LearningTree

Design of Data and
Data Models

Conceptual, logical, and physical data models — ER notation, product and sum types, normalization, and a step-by-step data modeling process for architects.

Chapter One · Data Modeling

Introduction to Data Modeling

Data is a first-class architectural concern. The way you structure, store, and access data shapes nearly every quality attribute — performance, scalability, consistency, and evolvability.

🗄️

Data Outlives Code

Applications get rewritten — data persists for decades

📊

Drives Quality

Data model shapes performance, scalability & consistency

🏗️

Three Model Levels

Conceptual → Logical → Physical, each at a different abstraction

What is Data Modelling?

Data modelling is the process of analyzing, defining and visualizing the data stored and managed in our system — establishing a shared, precise understanding of what data exists, how it is structured, and how it behaves.

Software Architecture

A blueprint for the:

Structure of our system's components
Relationships between those components
Rules and principles guiding its evolution

Data Model

A blueprint for the:

Data entities in our system and their attributes
Relationships between those entities
Rules governing the data — privacy, security, integrity constraints, etc.
How data is stored and used across the system

In short: Just as the software architecture is the blueprint for the system's structure, the data model is the blueprint for the system's data — both are essential, first-class design artifacts.

Class Modelling vs Data Modelling

Both use diagrams to describe a system — but they serve different purposes. Class modelling captures runtime behaviour and object identity (what objects do); data modelling captures how data is persisted, stored, and constrained (what data is).

Aspect	Class Modelling (UML)	Data Modelling (ER / Schema)
Focus	Runtime behaviour & object identity	Data persistence, storage & constraints
Contains	Attributes + Methods (behaviour)	Attributes only — no methods
Output	Source code — classes, interfaces	DB schemas — tables, DDL, ER diagrams
Tool	UML Class Diagrams	ER Diagrams, Logical / Physical schemas
Audience	Developers, architects	DBAs, data architects, developers
Key Concern	Encapsulation, inheritance, polymorphism	Normalization, referential integrity, cardinality

Key takeaway: A class diagram's methods section has no equivalent in a data model — data models deliberately exclude behaviour and focus purely on what data exists, how it is structured, and what rules govern it. The same domain entity (e.g. Customer) will look very different in each model.

Why Architects Must Care About Data

Data Outlives Code

Applications get rewritten; data persists for decades
A poor data model is one of the hardest things to refactor
Schema changes can require expensive migrations across all consumers

Data Drives Quality Attributes

Performance — query patterns must align with data model
Scalability — how data is partitioned determines horizontal scaling
Consistency — ACID vs. eventual consistency affects correctness
Evolvability — schema evolution vs. rigid table structures

What is a Data Model?

A data model is an abstract representation of the data structures, relationships, and constraints that describe how data is organized, stored, and manipulated within a system. It serves as a blueprint for database design.

Entities

Real-world objects or concepts (e.g., Customer, Order, Product)
Each entity has attributes (properties)
One attribute (or combination) forms the primary key

Relationships

How entities are associated: 1:1, 1:N, M:N
Direction and cardinality define data access patterns
Foreign keys enforce referential integrity

Constraints

Rules that data must satisfy (NOT NULL, UNIQUE, CHECK)
Business rules encoded at the data level
Referential integrity — foreign keys must point to valid records

Core Design Principles

① Keep It Simple

Start with the simplest model that satisfies requirements
Avoid over-engineering before understanding access patterns
Complexity can be added later; removing it is costly

② Normalize to Avoid Duplication

Split related data into separate tables with PK / FK links
One fact in one place — eliminates update anomalies
Reduces storage and keeps data consistent

③ Prefer IDs Over Names

Store productId (FK) — not productName (string copy)
Names change; IDs are stable and immutable
Compact keys = faster joins and less storage

④ Design for Internationalization

Store locale and language codes alongside data
Use translation tables for all user-facing text
Use Unicode-safe column types (UTF-8, NVARCHAR)

⑤ Secure Data with Role-Based Access

PK/FK relationships do not restrict data visibility
Any DB user can read all rows without proper roles
Apply database roles and Row-Level Security (RLS)
Never rely solely on application-layer access control

⑥ Denormalize Selectively for Performance

Large joins across many tables cause latency
For read-heavy workloads: controlled redundancy speeds queries
Trade-off: storage cost vs query performance
Apply selectively — never denormalize prematurely

Choosing the Right Database for the Workload

Not all data has the same access pattern. Read-heavy and write-heavy workloads place fundamentally different demands on the database. Choosing the wrong type causes avoidable performance problems at scale.

Workload Profile	Problem at Scale	Recommended Store	Key Trade-off
Read-heavy, simple lookups	Relational joins slow under high read load	Document / Key-Value Store	Better read speed vs weaker consistency
Write-heavy, transactional	NoSQL stores often lack full ACID guarantees	Relational (RDBMS)	Strong consistency vs limited horizontal scale
Highly connected data	Deep multi-hop joins become O(n) in RDBMS	Graph Database	Fast traversals vs complexity & standardization
Large joins, many tables	Join latency degrades response time	Denormalize or add a cache	Faster reads vs storage cost & write overhead

Polyglot Persistence — Matching Data to the Right Store

Polyglot persistence is the practice of using different data stores for different parts of the same system — each optimized for its specific data shape and access pattern. A real-world social media system illustrates this well.

News Feed → NoSQL (Document / Key-Value)

Feed items are read far more often than they are written
Flexible schema — posts, videos, and links differ in structure
Document store allows denormalized, ready-to-render documents
Key-Value cache (e.g., Redis) serves hot feeds in microseconds

Trade-off: Better scalability & read performance vs eventual consistency — a user may briefly see a slightly stale feed.

Users & Relationships → Graph DB

Social graphs (followers, friends, recommendations) are highly connected
Relational joins struggle with multi-hop traversals ("friends of friends")
Graph DB traverses edges natively in near-constant time per hop
Powers recommendations, shortest path, and influence scoring

Trade-off: Better graph traversal & recommendation quality vs added operational complexity, learning curve, and standardization challenges.

Architect's perspective: There is no single database optimal for all use cases. The architect's job is to match the data shape and access pattern to the right store — and to manage the added operational complexity that comes with running multiple systems.

Data Model Types — Relational vs. Non-Relational

Relational Model

Data stored in tables (relations) with rows and columns
Strong ACID guarantees — Atomicity, Consistency, Isolation, Durability
Schema-on-write — structure defined before data is inserted
Best for: structured data with clear relationships
Examples: PostgreSQL, MySQL, Oracle, SQL Server

Non-Relational (NoSQL) Models

Document — JSON-like documents (MongoDB, CouchDB)
Key-Value — simple lookup by key (Redis, DynamoDB)
Column-Family — wide rows, column-oriented (Cassandra, HBase)
Graph — nodes and edges (Neo4j, Amazon Neptune)
Schema-on-read — flexible, evolving structure

When to Use Which Model

Criteria	Relational	Document	Key-Value	Graph
Data Structure	Highly structured	Semi-structured	Unstructured	Interconnected
Query Pattern	Complex joins, aggregations	By document ID, nested queries	Simple get/put	Traversals, path queries
Consistency	Strong (ACID)	Tunable	Eventual	Tunable
Scalability	Vertical (mostly)	Horizontal	Horizontal	Vertical + some horizontal
Schema	Rigid, enforced	Flexible	Schema-free	Flexible
Use Case	Banking, ERP, inventory	CMS, catalogs, profiles	Caching, sessions	Social networks, recommendations

📋 Chapter 1 — Summary

Data Modeling — process of analyzing, defining, and visualizing data stored and managed within our system
Analyzing — understand what data exists and how it is used
Defining — specify entities, attributes, relationships and rules
Visualizing — communicate the model through diagrams and schemas
Data modeling has significant impact on software architecture
Data modeling influences technology choices
Data models affect fulfillment of functional requirements
Data models affect fulfillment of quality requirements

Chapter Two · Data Modeling

Data Model Types and Notations

Different data model types suit different problems. The architect must choose not only the model but also the notation used to communicate it — ER diagrams, UML, or database-specific schemas.

💡

Conceptual

High-level entities & relationships — business language, no tech detail

📐

Logical

Attributes, keys & constraints — independent of any specific database

🗄️

Physical

Tables, columns, indexes — optimised for a specific DBMS

Three Levels of Data Models

Key concept: Data modeling progresses through three levels of abstraction — from business concepts to physical implementation. Each level serves a different audience and purpose.

💡

Conceptual Model

What data exists in the business domain
High-level entities and their relationships
No technical details — business language only
Audience: stakeholders, domain experts, architects
Tool: ER diagrams, UML class diagrams
Can use informal sketches or whiteboard drawings — no standard formal notation required
Typically a whiteboard / discovery conversation with stakeholders, not a strict artefact

📐

Logical Model

How data is structured — independent of technology
Defines attributes, data types, keys, relationships
Applies normalization rules
Database / technology-independent — not tied to any specific DB engine
Audience: software engineers, data engineers, architects
Uses formal diagramming notations:
- Information Engineering / Crow's Foot notation
- Barker's notation
- IDEF1X
- Chen's notation
- UML (unofficial use)

🗄️

Physical Model

Where and how data is stored on disk
Database-specific: table names, column types, indexes, partitions
May include denormalization for performance
Audience: DBAs, developers
Tool: DDL scripts, DB-specific diagrams

Progression: Conceptual Model (business view) → Logical Model (structure view) → Physical Model (implementation view). Each step adds more detail and technical specificity.

Conceptual → Logical → Physical: An Example

Conceptual — business entities only

// Business concepts
Customer ──places──▸ Order
Order ──contains──▸ Product
// No attributes, no keys
// Just entities & relationships

Logical — attributes & types added

Customer {
  PK customerId : Integer
  name         : String
  email        : String
}
Order {
  PK orderId    : Integer
  FK customerId : Integer
  orderDate    : Date
  totalAmount  : Decimal
}

Physical — DDL for PostgreSQL

CREATE TABLE customers (
  customer_id SERIAL PRIMARY KEY,
  name VARCHAR(100) NOT NULL,
  email VARCHAR(255) UNIQUE
);
CREATE TABLE orders (
  order_id SERIAL PRIMARY KEY,
  customer_id INT REFERENCES
    customers(customer_id),
  order_date DATE DEFAULT NOW(),
  total NUMERIC(10,2)
);
CREATE INDEX idx_cust
  ON orders(customer_id);

Entity-Relationship (ER) Notation

The Entity-Relationship model (Peter Chen, 1976) is the most widely used notation for conceptual and logical data modeling. It represents entities, their attributes, and the relationships between them.

ER Diagram Elements

Symbol	Meaning	Example
Rectangle	Entity	Customer, Order
Ellipse	Attribute	name, email, orderId
Diamond	Relationship	"places", "contains"
Underlined	Primary Key	customerId
Double Rectangle	Weak Entity	OrderItem (depends on Order)
1 — N line	Cardinality	Customer 1 — N Order

Cardinality Notations

Notation	Meaning
1:1	One-to-One — each entity maps to exactly one of the other
1:N	One-to-Many — one entity maps to many of the other
M:N	Many-to-Many — requires a junction/bridge table in physical model
0..1	Optional — zero or one (partial participation)
1..*	Mandatory many — at least one

ER Diagram — E-Commerce Example

UML Class Diagrams for Data Modeling

UML class diagrams can represent data models with classes as entities, attributes as fields, and associations as relationships. They are especially useful when the data model is closely aligned with the domain model.

UML Advantages: Unified notation for both data model and domain model. Supports stereotypes, multiplicity, and composition.

UML-style diagram — Customer ↔ Order

Crow's Foot Notation

Crow's Foot notation is the most widely used ER notation in the logical modeling phase. It encodes both the minimum and maximum cardinality of a relationship directly onto each end of the line — no separate labels needed.

Each end of a relationship line carries two markers — read from the entity outward:

Inner mark (closest to entity) — minimum: single bar | = 1, circle ○ = 0
Outer mark (away from entity) — maximum: single bar | = 1, crow's foot < = many

Symbol	Name	Min / Max
`——\|\|——`	One and only one	Min 1 · Max 1 — mandatory single
`——\|○——`	Zero or one	Min 0 · Max 1 — optional single
`——\|<——`	One or many	Min 1 · Max N — mandatory multi
`——○<——`	Zero or many	Min 0 · Max N — optional multi

Reading tip: Always read cardinality from the entity outward along the line. Inner mark = minimum; outer mark = maximum.

Crow's Foot — Customer ↔ Order (one places one-or-many)

📋 Chapter 2 — Summary

Three types of abstraction for data models, each serving a different audience and level of detail
Conceptual — key entities, attributes and relationships (high-level picture)
Logical — attributes, data types and cardinality of relationships
Physical — maps entities to storage-specific elements (tables, columns, collections)
Notations for visualizing models: Crow's Foot, Barker's, IDEF1X, Chen's, UML

Chapter Three · Data Modeling

Step-by-Step Data Modeling Process

Data modeling is an iterative, multi-step process that begins with identifying the entities in a domain and ends with an optimized physical data model. Each step builds upon the previous one.

📦

1. Entities

Identify core domain objects

🏷️

2. Attributes

Define properties & data types

🔗

3. Relationships

Map how entities relate (1:1, 1:N, M:N)

🗄️

4. Physical Model

Choose DB tech & create tables

⚡

5. Optimize

Normalise, index & tune for performance

Step 1 — Identify the Entities

What is an Entity?

Collections of distinct:

Real-world objects
Locations
Concepts
Events

Each Instance / Record is:

Uniquely identifiable
Has properties / attributes

Tip: Look for "Nouns" in the domain description — they are your candidate entities.

Example Scenario

"An employee can schedule a meeting with one or several other employees. Every meeting takes place in a specific conference room. Only one meeting can occur in a conference room at any given time."

Step 2 — Identify the Attributes / Properties

What are Attributes?

Properties / characteristics of an entity
At least one attribute per entity
Stores a single, meaningful piece of information

Example — Employee Attributes

Step 3 — Identify the Relationships

"An employee can schedule a meeting with one or several other employees. Every meeting takes place in a specific conference room. Only one meeting can occur in a conference room at any given time."

Identify Relationships

Identify Relationships — Constraints / Rules

Step 4 — Identify the Data Modeling Technique & Create Physical Model

Choose the appropriate data modeling technique based on the nature of your data and relationships. Each technique has distinct strengths and trade-offs.

A. Relational Data Modeling

Advantages

Joining tables is very easy — allows creating complex relationships
Good query optimizers
Support transactions and strict consistency
Schema enforcement — "Schema on Write"

Disadvantages

Tables/rows don't fit programming data structures (Lists, Arrays, Hash Tables)
Requires ORM technologies — creates dependencies and performance overhead
Strict schema creates performance overhead
Schema updates are difficult, time consuming and introduce downtime

One-to-Many Relationship Example

B. Document Data Modeling

Entity → Collection of documents of the same type
Each record/instance is a document — a set of properties in a tree-like structure
Flexible schema — "Schema on Read"

Restaurant Search Data Model

restaurant
  ├── id
  ├── name
  ├── locations[]
  ├── hours
  │   ├── Monday .. Sunday
  └── menus
      ├── breakfast
      ├── lunch
      └── dinner

Advantages

Native support for programming language data structures (lists, arrays, maps)
Great for no relationships and one-to-many — no complex joins needed
Better storage locality for each record
Flexible schema — no complex migrations, no downtime
Easily scalable

Disadvantages

Not good for many-to-many and many-to-one relationships
Flexible schema (schema on read) is not always desirable

One-to-Many Relationship — Document Style

C. Graph Data Modeling

Graph Model Basics

Records are modeled as nodes in a graph, labeled with an Entity type
Each node can have properties and relationships to other nodes

Advantages

Performing graph traversals through unknown number of edges (joins) is easy
Great for many-to-one and many-to-many relationships
Support query languages (like Cypher) for easy graph querying

Disadvantages

Not great when we don't have high connectivity
Not good when the number of joins is small and predefined
More complex and less standard query language

Step 5 — Optimize the Data Model

Technique	Purpose
Normalization	→ Reduce Duplication and Storage Cost
Denormalization	→ Improve Read Performance
Indexing	→ Improve Search / Query Performance
Sharding / Partitioning	→ Improve Scalability
Replication	→ Improve Availability and Fault Tolerance

📋 Chapter 3 — Summary

Step-by-step process to create a data model — identify, define, choose, create, optimize
Step 1: Identify the Entities — nouns in the domain
Step 2: Identify the Attributes / Properties
Step 3: Identify the Relationships between entities
Step 4: Choose data modeling technique & create physical model
Step 5: Optimize — normalize, index, shard, replicate
Relational · Document · Graph — each with distinct trade-offs

Chapter Four · Algebraic Data Types

Products and Sums in Data Modeling

Product types combine data with AND; sum types combine data with OR. Together they form Algebraic Data Types (ADTs) — a powerful way to model data that prevents invalid states at the type level.

✖️

Product Types (AND)

Combine multiple fields together — struct, record, tuple, class

➕

Sum Types (OR)

Exactly one of several variants — enum, sealed class, union

Product Types — AND Composition

"Product" Type

Groups multiple values together — the resulting type contains all of its parts simultaneously
Represents an AND relationship — field A AND field B AND field C
Has a fixed number of required attributes / fields
Total possible values = the product (multiplication) of each field's possibilities
Other names: Record, Struct, Data class, Tuple

"Product" Type — Reason for the Name

Product Type — Java Class

public class Address {
  private String street;  // AND
  private String city;    // AND
  private String zip;     // AND
  private String country; // AND
}
// Address = street × city × zip × country
// Every Address has ALL four fields

Product Type — Record / Tuple

// Java 16+ Record — product type
public record Point(
  double x,  // AND
  double y   // AND
) {}

// TypeScript — object type
type Point = {
  x: number;  // AND
  y: number;  // AND
}

Product type examples: Structs, classes, records, tuples, database rows — anything where you have field A AND field B. A database table row is a product type: (id × name × email × created_at).

Sum Types — OR Composition

"Sum" Type

Represents a choice between multiple values — variant A OR variant B OR variant C
Creates an OR relationship — a value can be one of several variants, never two at once
Total possible values = the sum (addition) of each variant's possibilities
Other names: Discriminated union, Disjoint union, Tagged union, Sealed type, Mixed data

"Sum" Type — Example 1

public enum OrderStatus {
    PENDING,
    CANCELLED,
    SUCCESSFUL
}

OrderStatus currentStatus = OrderStatus.PENDING;

// OrderStatus can have any one state and not both

"Sum" Type — Reason for the Name

Sum Type — Java Sealed Interface (17+)

public sealed interface Shape
    permits Circle, Rectangle, Triangle {}

record Circle(double radius)
    implements Shape {}
record Rectangle(double w, double h)
    implements Shape {}
record Triangle(double a, double b, double c)
    implements Shape {}

// Shape = Circle + Rectangle + Triangle
// A Shape is one of these — never two

Sum Type — TypeScript Discriminated Union

type PaymentMethod =
  | { kind: "creditCard";
      cardNumber: string;
      expiry: string; }
  | { kind: "bankTransfer";
      iban: string; }
  | { kind: "paypal";
      email: string; }

// PaymentMethod = CreditCard + BankTransfer + PayPal
// The "kind" field is the discriminator

Why Sum Types Matter — Making Invalid States Unrepresentable

❌ Without Sum Type — nullable fields

public class Payment {
  String type; // "card" or "bank" or "paypal"

  // Card fields — null if not card
  String cardNumber;
  String expiry;

  // Bank fields — null if not bank
  String iban;

  // PayPal fields — null if not paypal
  String paypalEmail;
}
// ⚠ Can have cardNumber AND iban set
// ⚠ Can have type="card" but no cardNumber
// ⚠ Invalid states are representable!

✓ With Sum Type — only valid states

sealed interface Payment {}

record CardPayment(
  String cardNumber,
  String expiry
) implements Payment {}

record BankPayment(
  String iban
) implements Payment {}

record PayPalPayment(
  String email
) implements Payment {}

// ✓ CardPayment ALWAYS has cardNumber
// ✓ Can't mix card + bank fields
// ✓ Compiler enforces completeness

Anti-pattern: Using a single class with nullable fields to represent different variants. This creates "stringly typed" code where invalid states are possible and only caught at runtime (or not at all).

Products + Sums in Database Design

Product — a standard table row

CREATE TABLE customers (
  id     INT PRIMARY KEY,
  name   VARCHAR(100),   -- AND
  email  VARCHAR(255),   -- AND
  phone  VARCHAR(20)    -- AND
);
-- Row = id × name × email × phone
-- Every customer has ALL fields

Sum — table-per-type (inheritance mapping)

-- Base: all payments share these
CREATE TABLE payments (
  id   INT PRIMARY KEY,
  type VARCHAR(20) -- discriminator
);
-- Variant: card-specific fields
CREATE TABLE card_payments (
  payment_id INT REFERENCES payments,
  card_num   VARCHAR(16),
  expiry     DATE
);
-- Variant: bank-specific fields
CREATE TABLE bank_payments (
  payment_id INT REFERENCES payments,
  iban       VARCHAR(34)
);

Databases Don't Natively Support Sum Types

Relational databases are built around product types — every row is a fixed set of columns (field A AND field B)
There is no built-in OR construct to say "this row is either a card payment or a bank payment"
Sum types must be encoded manually using mapping strategies like:
- Single-table with a discriminator column
- Table-per-type with foreign keys
- Table-per-concrete-type
Each strategy has trade-offs in nullability, query complexity, and data integrity

Product Types Map Naturally

A CREATE TABLE with columns is already a product type — every row has ALL fields
No special encoding needed — databases handle products out of the box
Sum types require a design decision — you must choose a mapping strategy
Variant constraints must be enforced at the application level or via CHECK constraints — the database won't do it for you

Mapping Strategies for Sum Types in Databases

Strategy	Description	Pros	Cons
Single Table	One table, discriminator column, nullable fields per variant	Simple, fast queries, no joins	Nullable columns, wasted space
Table per Type	Base table + one table per variant with FK	No nulls, clean schema	Requires joins, more complex
Table per Class	Each variant gets its own table with all fields (no base table)	No joins, no nulls	No shared queries, base field duplication

JPA/Hibernate: These strategies map to @Inheritance(strategy = SINGLE_TABLE), JOINED, and TABLE_PER_CLASS respectively.

Algebraic Thinking — Counting States

Product — Multiply

Boolean × Boolean = 2 × 2 = 4 states
(true,true), (true,false), (false,true), (false,false)
A struct with 3 booleans: 2³ = 8 states
Every field multiplies the state space

Sum — Add

Boolean | TriState = 2 + 3 = 5 states
It's one OR the other, never both
Circle | Rectangle = Circle values + Rectangle values
Each variant adds to the state space

Design principle: Use products when a value requires all parts. Use sums when a value is one of several alternatives. Sums constrain the state space, making invalid states unrepresentable.

Importance of Products and Sums

Separation of the Data Model from its Representations

An abstract representation of our data that:

Represents our business logic independently of any database or language
Is clear — every stakeholder can understand the domain model
Is safe — the type system captures constraints before code is written
Tells us which states are valid and which are invalid
Products define what must be present; sums define what can vary

Decoupling of Data Model Implementations — Benefits

Take advantage of capabilities of each technology
Easily migrate to new technologies while isolating disruption
Make technology-specific optimizations:
- Performance optimizations
- Security enhancements
- Storage, caching, indexing strategies, etc.
The abstract model stays stable even as implementations change

Example — PurchaseTransaction with Product + Sum

Key insight: The data model is an abstract blueprint — it separates what the data is from how it is stored. Products ensure all required fields are present; sums ensure only valid variants exist. Together, they make invalid states unrepresentable before any code or schema is written.

📋 Chapter 4 — Summary

Product Type (AND): groups multiple values together — must contain values of all its components
Product name origin: number of possible states is the "product" (multiplication) of each field's possibilities
Sum Type (OR): represents a choice between multiple values — one variant at a time, never two
Sum name origin: number of possible states is the "sum" (addition) of each variant's possibilities
Databases don't natively support sum types — must be encoded via single-table, table-per-type, or table-per-class
Products + Sums separate data model from representations — abstract model stays stable as implementations change

Summary — All Data Modeling Topics at a Glance

01 · Introduction

Data Modeling Fundamentals

Data outlives code — first-class concern
Entities, relationships, constraints
Conceptual → Logical → Physical
Data drives quality attributes

02 · Types & Notations

Model Types & Notations

Relational — tables, ACID, structured
NoSQL — document, key-value, graph
ER diagrams — Chen notation
UML class diagrams for data

03 · Process

Step-by-Step Modeling

Entities → Attributes → Relationships
Relational · Document · Graph techniques
Optimize: Normalize, Index, Shard, Replicate
Each technique has distinct trade-offs

04 · Products & Sums

Algebraic Data Types

Product = AND (structs, records)
Sum = OR (sealed, unions)
Make invalid states unrepresentable
DB mapping: single table, joined, per-class