LearningTree

Design of Data and
Data Models

Conceptual, logical, and physical data models β€” ER notation, product and sum types, normalization, and a step-by-step data modeling process for architects.

01
Chapter One Β· Data Modeling

Introduction to Data Modeling

Data is a first-class architectural concern. The way you structure, store, and access data shapes nearly every quality attribute β€” performance, scalability, consistency, and evolvability.

πŸ—„οΈ

Data Outlives Code

  • Applications get rewritten β€” data persists for decades
πŸ“Š

Drives Quality

  • Data model shapes performance, scalability & consistency
πŸ—οΈ

Three Model Levels

  • Conceptual β†’ Logical β†’ Physical, each at a different abstraction
What is Data Modelling?

Data modelling is the process of analyzing, defining and visualizing the data stored and managed in our system β€” establishing a shared, precise understanding of what data exists, how it is structured, and how it behaves.

Software Architecture

A blueprint for the:

  • Structure of our system's components
  • Relationships between those components
  • Rules and principles guiding its evolution
Data Model

A blueprint for the:

  • Data entities in our system and their attributes
  • Relationships between those entities
  • Rules governing the data β€” privacy, security, integrity constraints, etc.
  • How data is stored and used across the system
In short: Just as the software architecture is the blueprint for the system's structure, the data model is the blueprint for the system's data β€” both are essential, first-class design artifacts.
Class Modelling vs Data Modelling

Both use diagrams to describe a system β€” but they serve different purposes. Class modelling captures runtime behaviour and object identity (what objects do); data modelling captures how data is persisted, stored, and constrained (what data is).

CLASS MODELLING Customer Β«classΒ» - id : int - name : String - email : String + getOrders() : List<Order> + placeOrder() : Order + updateEmail(s : String) Behaviour (methods) Focus: runtime behaviour & object identity vs DATA MODELLING Customer Β«entityΒ» PK id : INTEGER name : VARCHAR(100) email : VARCHAR(255) created_at : TIMESTAMP FK order_id : INTEGER β€” no methods β€” data & constraints only Focus: persistence, storage & constraints
Aspect Class Modelling (UML) Data Modelling (ER / Schema)
Focus Runtime behaviour & object identity Data persistence, storage & constraints
Contains Attributes + Methods (behaviour) Attributes only β€” no methods
Output Source code β€” classes, interfaces DB schemas β€” tables, DDL, ER diagrams
Tool UML Class Diagrams ER Diagrams, Logical / Physical schemas
Audience Developers, architects DBAs, data architects, developers
Key Concern Encapsulation, inheritance, polymorphism Normalization, referential integrity, cardinality
Key takeaway: A class diagram's methods section has no equivalent in a data model β€” data models deliberately exclude behaviour and focus purely on what data exists, how it is structured, and what rules govern it. The same domain entity (e.g. Customer) will look very different in each model.
Why Architects Must Care About Data
Data Outlives Code
  • Applications get rewritten; data persists for decades
  • A poor data model is one of the hardest things to refactor
  • Schema changes can require expensive migrations across all consumers
Data Drives Quality Attributes
  • Performance β€” query patterns must align with data model
  • Scalability β€” how data is partitioned determines horizontal scaling
  • Consistency β€” ACID vs. eventual consistency affects correctness
  • Evolvability β€” schema evolution vs. rigid table structures
What is a Data Model?

A data model is an abstract representation of the data structures, relationships, and constraints that describe how data is organized, stored, and manipulated within a system. It serves as a blueprint for database design.

Entities
  • Real-world objects or concepts (e.g., Customer, Order, Product)
  • Each entity has attributes (properties)
  • One attribute (or combination) forms the primary key
Relationships
  • How entities are associated: 1:1, 1:N, M:N
  • Direction and cardinality define data access patterns
  • Foreign keys enforce referential integrity
Constraints
  • Rules that data must satisfy (NOT NULL, UNIQUE, CHECK)
  • Business rules encoded at the data level
  • Referential integrity β€” foreign keys must point to valid records
Core Design Principles
β‘  Keep It Simple
  • Start with the simplest model that satisfies requirements
  • Avoid over-engineering before understanding access patterns
  • Complexity can be added later; removing it is costly
β‘‘ Normalize to Avoid Duplication
  • Split related data into separate tables with PK / FK links
  • One fact in one place β€” eliminates update anomalies
  • Reduces storage and keeps data consistent
β‘’ Prefer IDs Over Names
  • Store productId (FK) β€” not productName (string copy)
  • Names change; IDs are stable and immutable
  • Compact keys = faster joins and less storage
β‘£ Design for Internationalization
  • Store locale and language codes alongside data
  • Use translation tables for all user-facing text
  • Use Unicode-safe column types (UTF-8, NVARCHAR)
β‘€ Secure Data with Role-Based Access
  • PK/FK relationships do not restrict data visibility
  • Any DB user can read all rows without proper roles
  • Apply database roles and Row-Level Security (RLS)
  • Never rely solely on application-layer access control
β‘₯ Denormalize Selectively for Performance
  • Large joins across many tables cause latency
  • For read-heavy workloads: controlled redundancy speeds queries
  • Trade-off: storage cost vs query performance
  • Apply selectively β€” never denormalize prematurely
Choosing the Right Database for the Workload

Not all data has the same access pattern. Read-heavy and write-heavy workloads place fundamentally different demands on the database. Choosing the wrong type causes avoidable performance problems at scale.

Workload Profile Problem at Scale Recommended Store Key Trade-off
Read-heavy, simple lookups Relational joins slow under high read load Document / Key-Value Store Better read speed vs weaker consistency
Write-heavy, transactional NoSQL stores often lack full ACID guarantees Relational (RDBMS) Strong consistency vs limited horizontal scale
Highly connected data Deep multi-hop joins become O(n) in RDBMS Graph Database Fast traversals vs complexity & standardization
Large joins, many tables Join latency degrades response time Denormalize or add a cache Faster reads vs storage cost & write overhead
Polyglot Persistence β€” Matching Data to the Right Store

Polyglot persistence is the practice of using different data stores for different parts of the same system β€” each optimized for its specific data shape and access pattern. A real-world social media system illustrates this well.

News Feed β†’ NoSQL (Document / Key-Value)
  • Feed items are read far more often than they are written
  • Flexible schema β€” posts, videos, and links differ in structure
  • Document store allows denormalized, ready-to-render documents
  • Key-Value cache (e.g., Redis) serves hot feeds in microseconds
Trade-off: Better scalability & read performance vs eventual consistency β€” a user may briefly see a slightly stale feed.
Users & Relationships β†’ Graph DB
  • Social graphs (followers, friends, recommendations) are highly connected
  • Relational joins struggle with multi-hop traversals ("friends of friends")
  • Graph DB traverses edges natively in near-constant time per hop
  • Powers recommendations, shortest path, and influence scoring
Trade-off: Better graph traversal & recommendation quality vs added operational complexity, learning curve, and standardization challenges.
Architect's perspective: There is no single database optimal for all use cases. The architect's job is to match the data shape and access pattern to the right store β€” and to manage the added operational complexity that comes with running multiple systems.
Data Model Types β€” Relational vs. Non-Relational
Relational Model
  • Data stored in tables (relations) with rows and columns
  • Strong ACID guarantees β€” Atomicity, Consistency, Isolation, Durability
  • Schema-on-write β€” structure defined before data is inserted
  • Best for: structured data with clear relationships
  • Examples: PostgreSQL, MySQL, Oracle, SQL Server
Non-Relational (NoSQL) Models
  • Document β€” JSON-like documents (MongoDB, CouchDB)
  • Key-Value β€” simple lookup by key (Redis, DynamoDB)
  • Column-Family β€” wide rows, column-oriented (Cassandra, HBase)
  • Graph β€” nodes and edges (Neo4j, Amazon Neptune)
  • Schema-on-read β€” flexible, evolving structure
When to Use Which Model
Criteria Relational Document Key-Value Graph
Data Structure Highly structured Semi-structured Unstructured Interconnected
Query Pattern Complex joins, aggregations By document ID, nested queries Simple get/put Traversals, path queries
Consistency Strong (ACID) Tunable Eventual Tunable
Scalability Vertical (mostly) Horizontal Horizontal Vertical + some horizontal
Schema Rigid, enforced Flexible Schema-free Flexible
Use Case Banking, ERP, inventory CMS, catalogs, profiles Caching, sessions Social networks, recommendations
📋 Chapter 1 — Summary
  • Data Modeling β€” process of analyzing, defining, and visualizing data stored and managed within our system
  • Analyzing β€” understand what data exists and how it is used
  • Defining β€” specify entities, attributes, relationships and rules
  • Visualizing β€” communicate the model through diagrams and schemas
  • Data modeling has significant impact on software architecture
  • Data modeling influences technology choices
  • Data models affect fulfillment of functional requirements
  • Data models affect fulfillment of quality requirements
02
Chapter Two Β· Data Modeling

Data Model Types and Notations

Different data model types suit different problems. The architect must choose not only the model but also the notation used to communicate it β€” ER diagrams, UML, or database-specific schemas.

πŸ’‘

Conceptual

  • High-level entities & relationships β€” business language, no tech detail
πŸ“

Logical

  • Attributes, keys & constraints β€” independent of any specific database
πŸ—„οΈ

Physical

  • Tables, columns, indexes β€” optimised for a specific DBMS
Three Levels of Data Models
Key concept: Data modeling progresses through three levels of abstraction β€” from business concepts to physical implementation. Each level serves a different audience and purpose.
πŸ’‘
Conceptual Model
  • What data exists in the business domain
  • High-level entities and their relationships
  • No technical details β€” business language only
  • Audience: stakeholders, domain experts, architects
  • Tool: ER diagrams, UML class diagrams
  • Can use informal sketches or whiteboard drawings β€” no standard formal notation required
  • Typically a whiteboard / discovery conversation with stakeholders, not a strict artefact
πŸ“
Logical Model
  • How data is structured β€” independent of technology
  • Defines attributes, data types, keys, relationships
  • Applies normalization rules
  • Database / technology-independent β€” not tied to any specific DB engine
  • Audience: software engineers, data engineers, architects
  • Uses formal diagramming notations:
    • Information Engineering / Crow's Foot notation
    • Barker's notation
    • IDEF1X
    • Chen's notation
    • UML (unofficial use)
πŸ—„οΈ
Physical Model
  • Where and how data is stored on disk
  • Database-specific: table names, column types, indexes, partitions
  • May include denormalization for performance
  • Audience: DBAs, developers
  • Tool: DDL scripts, DB-specific diagrams
Progression: Conceptual Model (business view) β†’ Logical Model (structure view) β†’ Physical Model (implementation view). Each step adds more detail and technical specificity.
Conceptual β†’ Logical β†’ Physical: An Example
Conceptual β€” business entities only
// Business concepts Customer ──places──▸ Order Order ──contains──▸ Product // No attributes, no keys // Just entities & relationships
Logical β€” attributes & types added
Customer { PK customerId : Integer name : String email : String } Order { PK orderId : Integer FK customerId : Integer orderDate : Date totalAmount : Decimal }
Physical β€” DDL for PostgreSQL
CREATE TABLE customers ( customer_id SERIAL PRIMARY KEY, name VARCHAR(100) NOT NULL, email VARCHAR(255) UNIQUE ); CREATE TABLE orders ( order_id SERIAL PRIMARY KEY, customer_id INT REFERENCES customers(customer_id), order_date DATE DEFAULT NOW(), total NUMERIC(10,2) ); CREATE INDEX idx_cust ON orders(customer_id);
Entity-Relationship (ER) Notation

The Entity-Relationship model (Peter Chen, 1976) is the most widely used notation for conceptual and logical data modeling. It represents entities, their attributes, and the relationships between them.

ER Diagram Elements
SymbolMeaningExample
RectangleEntityCustomer, Order
EllipseAttributename, email, orderId
DiamondRelationship"places", "contains"
UnderlinedPrimary KeycustomerId
Double RectangleWeak EntityOrderItem (depends on Order)
1 β€” N lineCardinalityCustomer 1 β€” N Order
Cardinality Notations
NotationMeaning
1:1One-to-One β€” each entity maps to exactly one of the other
1:NOne-to-Many β€” one entity maps to many of the other
M:NMany-to-Many β€” requires a junction/bridge table in physical model
0..1Optional β€” zero or one (partial participation)
1..*Mandatory many β€” at least one
ER Diagram β€” E-Commerce Example
Customer custId name places 1 N Order orderId date contains 1 N Product price
UML Class Diagrams for Data Modeling

UML class diagrams can represent data models with classes as entities, attributes as fields, and associations as relationships. They are especially useful when the data model is closely aligned with the domain model.

UML Advantages: Unified notation for both data model and domain model. Supports stereotypes, multiplicity, and composition.
UML-style diagram β€” Customer ↔ Order
Customer Β«entityΒ» PK customerId : int name : String email : String createdAt : Date + getOrders() : List<Order> Β«behaviourΒ» 1 N places Order Β«entityΒ» PK orderId : int FK customerId : int date : Date totalAmount : Decimal status : String β€” data model: no methods β€”
Crow's Foot Notation

Crow's Foot notation is the most widely used ER notation in the logical modeling phase. It encodes both the minimum and maximum cardinality of a relationship directly onto each end of the line β€” no separate labels needed.

Each end of a relationship line carries two markers β€” read from the entity outward:

  • Inner mark (closest to entity) β€” minimum: single bar | = 1, circle β—‹ = 0
  • Outer mark (away from entity) β€” maximum: single bar | = 1, crow's foot < = many
SymbolNameMin / Max
β€”β€”||β€”β€”One and only oneMin 1 Β· Max 1 β€” mandatory single
β€”β€”|β—‹β€”β€”Zero or oneMin 0 Β· Max 1 β€” optional single
β€”β€”|<β€”β€”One or manyMin 1 Β· Max N β€” mandatory multi
β€”β€”β—‹<β€”β€”Zero or manyMin 0 Β· Max N β€” optional multi
Reading tip: Always read cardinality from the entity outward along the line. Inner mark = minimum; outer mark = maximum.
Crow's Foot β€” Customer ↔ Order (one places one-or-many)
CUSTOMER PK customerId : int name : String email : String ORDER PK orderId : int FK customerId : int date : Date 1 N places exactly one 1 customer owns each order one or many N orders per customer
📋 Chapter 2 — Summary
  • Three types of abstraction for data models, each serving a different audience and level of detail
  • Conceptual β€” key entities, attributes and relationships (high-level picture)
  • Logical β€” attributes, data types and cardinality of relationships
  • Physical β€” maps entities to storage-specific elements (tables, columns, collections)
  • Notations for visualizing models: Crow's Foot, Barker's, IDEF1X, Chen's, UML
03
Chapter Three Β· Data Modeling

Step-by-Step Data Modeling Process

Data modeling is an iterative, multi-step process that begins with identifying the entities in a domain and ends with an optimized physical data model. Each step builds upon the previous one.

πŸ“¦

1. Entities

  • Identify core domain objects
🏷️

2. Attributes

  • Define properties & data types
πŸ”—

3. Relationships

  • Map how entities relate (1:1, 1:N, M:N)
πŸ—„οΈ

4. Physical Model

  • Choose DB tech & create tables
⚑

5. Optimize

  • Normalise, index & tune for performance
Step 1 β€” Identify the Entities
What is an Entity?

Collections of distinct:

  • Real-world objects
  • Locations
  • Concepts
  • Events
Each Instance / Record is:
  • Uniquely identifiable
  • Has properties / attributes
Tip: Look for "Nouns" in the domain description β€” they are your candidate entities.
Example Scenario

"An employee can schedule a meeting with one or several other employees. Every meeting takes place in a specific conference room. Only one meeting can occur in a conference room at any given time."

E EMPLOYEE EmployeeID E MEETING MeetingID E CONFERENCE_ROOM RoomID
Step 2 β€” Identify the Attributes / Properties
What are Attributes?
  • Properties / characteristics of an entity
  • At least one attribute per entity
  • Stores a single, meaningful piece of information
Example β€” Employee Attributes
E EMPLOYEE EmployeeID PK FirstName LastName Department EmailAddress SocialSecurityNumber Gender
Step 3 β€” Identify the Relationships

"An employee can schedule a meeting with one or several other employees. Every meeting takes place in a specific conference room. Only one meeting can occur in a conference room at any given time."

Identify Relationships
E EMPLOYEE EmployeeID FirstName LastName EmailAddress attends / schedules E MEETING MeetingID MeetingTime Duration takes place E CONFERENCE_ROOM RoomID Location Capacity
Identify Relationships β€” Constraints / Rules
E EMPLOYEE EmployeeID FirstName LastName EmailAddress attends / schedules E MEETING MeetingID MeetingTime Duration takes place E CONFERENCE_ROOM RoomID Location Capacity CONSTRAINTS: Only one meeting per conference room at any given time Β· MeetingTime is required (β—‰)
Step 4 β€” Identify the Data Modeling Technique & Create Physical Model

Choose the appropriate data modeling technique based on the nature of your data and relationships. Each technique has distinct strengths and trade-offs.

A. Relational Data Modeling
Employee PK employeeId : INT firstName : VARCHAR lastName : VARCHAR department : VARCHAR email : VARCHAR Meeting PK meetingId : INT FK roomId : INT meetingTime : DATETIME duration : INT ConferenceRoom PK roomId : INT location : VARCHAR capacity : INT
Advantages
  • Joining tables is very easy β€” allows creating complex relationships
  • Good query optimizers
  • Support transactions and strict consistency
  • Schema enforcement β€” "Schema on Write"
Disadvantages
  • Tables/rows don't fit programming data structures (Lists, Arrays, Hash Tables)
  • Requires ORM technologies β€” creates dependencies and performance overhead
  • Strict schema creates performance overhead
  • Schema updates are difficult, time consuming and introduce downtime
One-to-Many Relationship Example
Employee PK employeeId name email 1 N schedules Meeting PK meetingId FK employeeId meetingTime
B. Document Data Modeling
  • Entity → Collection of documents of the same type
  • Each record/instance is a document β€” a set of properties in a tree-like structure
  • Flexible schema β€” "Schema on Read"
Restaurant Search Data Model
restaurant β”œβ”€β”€ id β”œβ”€β”€ name β”œβ”€β”€ locations[] β”œβ”€β”€ hours β”‚ β”œβ”€β”€ Monday .. Sunday └── menus β”œβ”€β”€ breakfast β”œβ”€β”€ lunch └── dinner
Advantages
  • Native support for programming language data structures (lists, arrays, maps)
  • Great for no relationships and one-to-many β€” no complex joins needed
  • Better storage locality for each record
  • Flexible schema β€” no complex migrations, no downtime
  • Easily scalable
Disadvantages
  • Not good for many-to-many and many-to-one relationships
  • Flexible schema (schema on read) is not always desirable
One-to-Many Relationship β€” Document Style
Employee (doc) employeeId: 42 name: "Alice" meetings: [ { id: 1, time: ... }, 1 N Meeting (embedded) meetingId: 1 time: "10:00" room: "A-101" duration: 60
C. Graph Data Modeling
Graph Model Basics
  • Records are modeled as nodes in a graph, labeled with an Entity type
  • Each node can have properties and relationships to other nodes
Employee Alice Employee Bob Meeting 10:00 AM Room A-101 Cap: 10 SCHEDULES ATTENDS IN_ROOM
Advantages
  • Performing graph traversals through unknown number of edges (joins) is easy
  • Great for many-to-one and many-to-many relationships
  • Support query languages (like Cypher) for easy graph querying
Disadvantages
  • Not great when we don't have high connectivity
  • Not good when the number of joins is small and predefined
  • More complex and less standard query language
Step 5 β€” Optimize the Data Model
TechniquePurpose
Normalization→ Reduce Duplication and Storage Cost
Denormalization→ Improve Read Performance
Indexing→ Improve Search / Query Performance
Sharding / Partitioning→ Improve Scalability
Replication→ Improve Availability and Fault Tolerance
📋 Chapter 3 — Summary
  • Step-by-step process to create a data model β€” identify, define, choose, create, optimize
  • Step 1: Identify the Entities β€” nouns in the domain
  • Step 2: Identify the Attributes / Properties
  • Step 3: Identify the Relationships between entities
  • Step 4: Choose data modeling technique & create physical model
  • Step 5: Optimize β€” normalize, index, shard, replicate
  • Relational Β· Document Β· Graph β€” each with distinct trade-offs
04
Chapter Four Β· Algebraic Data Types

Products and Sums in Data Modeling

Product types combine data with AND; sum types combine data with OR. Together they form Algebraic Data Types (ADTs) β€” a powerful way to model data that prevents invalid states at the type level.

βœ–οΈ

Product Types (AND)

  • Combine multiple fields together β€” struct, record, tuple, class
βž•

Sum Types (OR)

  • Exactly one of several variants β€” enum, sealed class, union
Product Types β€” AND Composition
"Product" Type
  • Groups multiple values together β€” the resulting type contains all of its parts simultaneously
  • Represents an AND relationship β€” field A AND field B AND field C
  • Has a fixed number of required attributes / fields
  • Total possible values = the product (multiplication) of each field's possibilities
  • Other names: Record, Struct, Data class, Tuple
"Product" Type β€” Reason for the Name
ENTITY DIAGRAM E User Name : String Age : Int WHY "PRODUCT" TYPE? Field Γ— Field = States Name Γ— Age = N ↓ ↓ ↓ S values Γ— I values = N states Name (type: String) Γ— Age (type: Int) = N (Number of possible User States) S Γ— I = N The name "Product Type" comes from the mathematical Cartesian product . The total number of possible states of the type is the product of the number of possible values of each field ( S Γ— I = N ).
Product Type β€” Java Class
public class Address { private String street; // AND private String city; // AND private String zip; // AND private String country; // AND } // Address = street Γ— city Γ— zip Γ— country // Every Address has ALL four fields
Product Type β€” Record / Tuple
// Java 16+ Record β€” product type public record Point( double x, // AND double y // AND ) {} // TypeScript β€” object type type Point = { x: number; // AND y: number; // AND }
Product type examples: Structs, classes, records, tuples, database rows β€” anything where you have field A AND field B. A database table row is a product type: (id Γ— name Γ— email Γ— created_at).
Sum Types β€” OR Composition
"Sum" Type
  • Represents a choice between multiple values β€” variant A OR variant B OR variant C
  • Creates an OR relationship β€” a value can be one of several variants, never two at once
  • Total possible values = the sum (addition) of each variant's possibilities
  • Other names: Discriminated union, Disjoint union, Tagged union, Sealed type, Mixed data
"Sum" Type β€” Example 1
public enum OrderStatus { PENDING, CANCELLED, SUCCESSFUL } OrderStatus currentStatus = OrderStatus.PENDING; // OrderStatus can have any one state and not both
"Sum" Type β€” Reason for the Name
currentStatus ← PENDING or CANCELLED or SUCCESSFUL Only one variant at a time β€” never two simultaneously Number of States: 1 + 1 + 1 = 3 states The name "Sum Type" comes from the mathematical addition (sum) of possibilities. The total number of possible states is the sum of each variant's states ( 1 + 1 + 1 = 3 ). Contrast with Product Type where states are multiplied (S Γ— I = N).
Sum Type β€” Java Sealed Interface (17+)
public sealed interface Shape permits Circle, Rectangle, Triangle {} record Circle(double radius) implements Shape {} record Rectangle(double w, double h) implements Shape {} record Triangle(double a, double b, double c) implements Shape {} // Shape = Circle + Rectangle + Triangle // A Shape is one of these β€” never two
Sum Type β€” TypeScript Discriminated Union
type PaymentMethod = | { kind: "creditCard"; cardNumber: string; expiry: string; } | { kind: "bankTransfer"; iban: string; } | { kind: "paypal"; email: string; } // PaymentMethod = CreditCard + BankTransfer + PayPal // The "kind" field is the discriminator
Why Sum Types Matter β€” Making Invalid States Unrepresentable
❌ Without Sum Type β€” nullable fields
public class Payment { String type; // "card" or "bank" or "paypal" // Card fields β€” null if not card String cardNumber; String expiry; // Bank fields β€” null if not bank String iban; // PayPal fields β€” null if not paypal String paypalEmail; } // ⚠ Can have cardNumber AND iban set // ⚠ Can have type="card" but no cardNumber // ⚠ Invalid states are representable!
✓ With Sum Type β€” only valid states
sealed interface Payment {} record CardPayment( String cardNumber, String expiry ) implements Payment {} record BankPayment( String iban ) implements Payment {} record PayPalPayment( String email ) implements Payment {} // βœ“ CardPayment ALWAYS has cardNumber // βœ“ Can't mix card + bank fields // βœ“ Compiler enforces completeness
Anti-pattern: Using a single class with nullable fields to represent different variants. This creates "stringly typed" code where invalid states are possible and only caught at runtime (or not at all).
Products + Sums in Database Design
PRODUCT TYPE (AND) customers PK id : INT AND name : VARCHAR AND email : VARCHAR AND phone : VARCHAR id Γ— name Γ— email Γ— phone Every row has ALL fields SUM TYPE (OR) payments PK id : INT type : VARCHAR discriminator OR OR card_payments card_num expiry bank_payments iban Card + Bank A payment is ONE of these β€” never both Product (AND) Every row contains all fields simultaneously. States = field₁ Γ— fieldβ‚‚ Γ— … Γ— fieldβ‚™ Sum (OR) A record is one variant at a time. States = variant₁ + variantβ‚‚ + … + variantβ‚™
Product β€” a standard table row
CREATE TABLE customers ( id INT PRIMARY KEY, name VARCHAR(100), -- AND email VARCHAR(255), -- AND phone VARCHAR(20) -- AND ); -- Row = id Γ— name Γ— email Γ— phone -- Every customer has ALL fields
Sum β€” table-per-type (inheritance mapping)
-- Base: all payments share these CREATE TABLE payments ( id INT PRIMARY KEY, type VARCHAR(20) -- discriminator ); -- Variant: card-specific fields CREATE TABLE card_payments ( payment_id INT REFERENCES payments, card_num VARCHAR(16), expiry DATE ); -- Variant: bank-specific fields CREATE TABLE bank_payments ( payment_id INT REFERENCES payments, iban VARCHAR(34) );
Databases Don't Natively Support Sum Types
  • Relational databases are built around product types β€” every row is a fixed set of columns (field A AND field B)
  • There is no built-in OR construct to say "this row is either a card payment or a bank payment"
  • Sum types must be encoded manually using mapping strategies like:
    • Single-table with a discriminator column
    • Table-per-type with foreign keys
    • Table-per-concrete-type
  • Each strategy has trade-offs in nullability, query complexity, and data integrity
Product Types Map Naturally
  • A CREATE TABLE with columns is already a product type β€” every row has ALL fields
  • No special encoding needed β€” databases handle products out of the box
  • Sum types require a design decision β€” you must choose a mapping strategy
  • Variant constraints must be enforced at the application level or via CHECK constraints β€” the database won't do it for you
Mapping Strategies for Sum Types in Databases
Strategy Description Pros Cons
Single Table One table, discriminator column, nullable fields per variant Simple, fast queries, no joins Nullable columns, wasted space
Table per Type Base table + one table per variant with FK No nulls, clean schema Requires joins, more complex
Table per Class Each variant gets its own table with all fields (no base table) No joins, no nulls No shared queries, base field duplication
JPA/Hibernate: These strategies map to @Inheritance(strategy = SINGLE_TABLE), JOINED, and TABLE_PER_CLASS respectively.
Algebraic Thinking β€” Counting States
Product β€” Multiply
  • Boolean Γ— Boolean = 2 Γ— 2 = 4 states
  • (true,true), (true,false), (false,true), (false,false)
  • A struct with 3 booleans: 2Β³ = 8 states
  • Every field multiplies the state space
Sum β€” Add
  • Boolean | TriState = 2 + 3 = 5 states
  • It's one OR the other, never both
  • Circle | Rectangle = Circle values + Rectangle values
  • Each variant adds to the state space
Design principle: Use products when a value requires all parts. Use sums when a value is one of several alternatives. Sums constrain the state space, making invalid states unrepresentable.
Importance of Products and Sums
Separation of the Data Model from its Representations

An abstract representation of our data that:

  • Represents our business logic independently of any database or language
  • Is clear β€” every stakeholder can understand the domain model
  • Is safe β€” the type system captures constraints before code is written
  • Tells us which states are valid and which are invalid
  • Products define what must be present; sums define what can vary
Decoupling of Data Model Implementations β€” Benefits
  • Take advantage of capabilities of each technology
  • Easily migrate to new technologies while isolating disruption
  • Make technology-specific optimizations:
    • Performance optimizations
    • Security enhancements
    • Storage, caching, indexing strategies, etc.
  • The abstract model stays stable even as implementations change
Example β€” PurchaseTransaction with Product + Sum
E PurchaseTransaction userId PRODUCT (AND) productId quantity paymentMethod SUM (OR) OR E CreditCard cardNumber securityCode expiryDate E PayPal email Product: userId Γ— productId Γ— quantity Γ— paymentMethod Sum: paymentMethod = CreditCard + PayPal
Key insight: The data model is an abstract blueprint β€” it separates what the data is from how it is stored. Products ensure all required fields are present; sums ensure only valid variants exist. Together, they make invalid states unrepresentable before any code or schema is written.
📋 Chapter 4 — Summary
  • Product Type (AND): groups multiple values together β€” must contain values of all its components
  • Product name origin: number of possible states is the "product" (multiplication) of each field's possibilities
  • Sum Type (OR): represents a choice between multiple values β€” one variant at a time, never two
  • Sum name origin: number of possible states is the "sum" (addition) of each variant's possibilities
  • Databases don't natively support sum types β€” must be encoded via single-table, table-per-type, or table-per-class
  • Products + Sums separate data model from representations β€” abstract model stays stable as implementations change
Summary β€” All Data Modeling Topics at a Glance
01 Β· Introduction

Data Modeling Fundamentals

  • Data outlives code β€” first-class concern
  • Entities, relationships, constraints
  • Conceptual β†’ Logical β†’ Physical
  • Data drives quality attributes
02 Β· Types & Notations

Model Types & Notations

  • Relational β€” tables, ACID, structured
  • NoSQL β€” document, key-value, graph
  • ER diagrams β€” Chen notation
  • UML class diagrams for data
03 Β· Process

Step-by-Step Modeling

  • Entities β†’ Attributes β†’ Relationships
  • Relational Β· Document Β· Graph techniques
  • Optimize: Normalize, Index, Shard, Replicate
  • Each technique has distinct trade-offs
04 Β· Products & Sums

Algebraic Data Types

  • Product = AND (structs, records)
  • Sum = OR (sealed, unions)
  • Make invalid states unrepresentable
  • DB mapping: single table, joined, per-class