Design of Data and
Data Models
Conceptual, logical, and physical data models β ER notation, product and sum types, normalization, and a step-by-step data modeling process for architects.
Introduction to Data Modeling
Data is a first-class architectural concern. The way you structure, store, and access data shapes nearly every quality attribute β performance, scalability, consistency, and evolvability.
Data Outlives Code
- Applications get rewritten β data persists for decades
Drives Quality
- Data model shapes performance, scalability & consistency
Three Model Levels
- Conceptual β Logical β Physical, each at a different abstraction
Data modelling is the process of analyzing, defining and visualizing the data stored and managed in our system β establishing a shared, precise understanding of what data exists, how it is structured, and how it behaves.
A blueprint for the:
- Structure of our system's components
- Relationships between those components
- Rules and principles guiding its evolution
A blueprint for the:
- Data entities in our system and their attributes
- Relationships between those entities
- Rules governing the data β privacy, security, integrity constraints, etc.
- How data is stored and used across the system
Both use diagrams to describe a system β but they serve different purposes. Class modelling captures runtime behaviour and object identity (what objects do); data modelling captures how data is persisted, stored, and constrained (what data is).
| Aspect | Class Modelling (UML) | Data Modelling (ER / Schema) |
|---|---|---|
| Focus | Runtime behaviour & object identity | Data persistence, storage & constraints |
| Contains | Attributes + Methods (behaviour) | Attributes only β no methods |
| Output | Source code β classes, interfaces | DB schemas β tables, DDL, ER diagrams |
| Tool | UML Class Diagrams | ER Diagrams, Logical / Physical schemas |
| Audience | Developers, architects | DBAs, data architects, developers |
| Key Concern | Encapsulation, inheritance, polymorphism | Normalization, referential integrity, cardinality |
Customer) will look very different in each model.- Applications get rewritten; data persists for decades
- A poor data model is one of the hardest things to refactor
- Schema changes can require expensive migrations across all consumers
- Performance β query patterns must align with data model
- Scalability β how data is partitioned determines horizontal scaling
- Consistency β ACID vs. eventual consistency affects correctness
- Evolvability β schema evolution vs. rigid table structures
A data model is an abstract representation of the data structures, relationships, and constraints that describe how data is organized, stored, and manipulated within a system. It serves as a blueprint for database design.
- Real-world objects or concepts (e.g., Customer, Order, Product)
- Each entity has attributes (properties)
- One attribute (or combination) forms the primary key
- How entities are associated: 1:1, 1:N, M:N
- Direction and cardinality define data access patterns
- Foreign keys enforce referential integrity
- Rules that data must satisfy (NOT NULL, UNIQUE, CHECK)
- Business rules encoded at the data level
- Referential integrity β foreign keys must point to valid records
- Start with the simplest model that satisfies requirements
- Avoid over-engineering before understanding access patterns
- Complexity can be added later; removing it is costly
- Split related data into separate tables with PK / FK links
- One fact in one place β eliminates update anomalies
- Reduces storage and keeps data consistent
- Store
productId(FK) β notproductName(string copy) - Names change; IDs are stable and immutable
- Compact keys = faster joins and less storage
- Store locale and language codes alongside data
- Use translation tables for all user-facing text
- Use Unicode-safe column types (
UTF-8,NVARCHAR)
- PK/FK relationships do not restrict data visibility
- Any DB user can read all rows without proper roles
- Apply database roles and Row-Level Security (RLS)
- Never rely solely on application-layer access control
- Large joins across many tables cause latency
- For read-heavy workloads: controlled redundancy speeds queries
- Trade-off: storage cost vs query performance
- Apply selectively β never denormalize prematurely
Not all data has the same access pattern. Read-heavy and write-heavy workloads place fundamentally different demands on the database. Choosing the wrong type causes avoidable performance problems at scale.
| Workload Profile | Problem at Scale | Recommended Store | Key Trade-off |
|---|---|---|---|
| Read-heavy, simple lookups | Relational joins slow under high read load | Document / Key-Value Store | Better read speed vs weaker consistency |
| Write-heavy, transactional | NoSQL stores often lack full ACID guarantees | Relational (RDBMS) | Strong consistency vs limited horizontal scale |
| Highly connected data | Deep multi-hop joins become O(n) in RDBMS | Graph Database | Fast traversals vs complexity & standardization |
| Large joins, many tables | Join latency degrades response time | Denormalize or add a cache | Faster reads vs storage cost & write overhead |
Polyglot persistence is the practice of using different data stores for different parts of the same system β each optimized for its specific data shape and access pattern. A real-world social media system illustrates this well.
- Feed items are read far more often than they are written
- Flexible schema β posts, videos, and links differ in structure
- Document store allows denormalized, ready-to-render documents
- Key-Value cache (e.g., Redis) serves hot feeds in microseconds
- Social graphs (followers, friends, recommendations) are highly connected
- Relational joins struggle with multi-hop traversals ("friends of friends")
- Graph DB traverses edges natively in near-constant time per hop
- Powers recommendations, shortest path, and influence scoring
- Data stored in tables (relations) with rows and columns
- Strong ACID guarantees β Atomicity, Consistency, Isolation, Durability
- Schema-on-write β structure defined before data is inserted
- Best for: structured data with clear relationships
- Examples: PostgreSQL, MySQL, Oracle, SQL Server
- Document β JSON-like documents (MongoDB, CouchDB)
- Key-Value β simple lookup by key (Redis, DynamoDB)
- Column-Family β wide rows, column-oriented (Cassandra, HBase)
- Graph β nodes and edges (Neo4j, Amazon Neptune)
- Schema-on-read β flexible, evolving structure
| Criteria | Relational | Document | Key-Value | Graph |
|---|---|---|---|---|
| Data Structure | Highly structured | Semi-structured | Unstructured | Interconnected |
| Query Pattern | Complex joins, aggregations | By document ID, nested queries | Simple get/put | Traversals, path queries |
| Consistency | Strong (ACID) | Tunable | Eventual | Tunable |
| Scalability | Vertical (mostly) | Horizontal | Horizontal | Vertical + some horizontal |
| Schema | Rigid, enforced | Flexible | Schema-free | Flexible |
| Use Case | Banking, ERP, inventory | CMS, catalogs, profiles | Caching, sessions | Social networks, recommendations |
- Data Modeling β process of analyzing, defining, and visualizing data stored and managed within our system
- Analyzing β understand what data exists and how it is used
- Defining β specify entities, attributes, relationships and rules
- Visualizing β communicate the model through diagrams and schemas
- Data modeling has significant impact on software architecture
- Data modeling influences technology choices
- Data models affect fulfillment of functional requirements
- Data models affect fulfillment of quality requirements
Data Model Types and Notations
Different data model types suit different problems. The architect must choose not only the model but also the notation used to communicate it β ER diagrams, UML, or database-specific schemas.
Conceptual
- High-level entities & relationships β business language, no tech detail
Logical
- Attributes, keys & constraints β independent of any specific database
Physical
- Tables, columns, indexes β optimised for a specific DBMS
- What data exists in the business domain
- High-level entities and their relationships
- No technical details β business language only
- Audience: stakeholders, domain experts, architects
- Tool: ER diagrams, UML class diagrams
- Can use informal sketches or whiteboard drawings β no standard formal notation required
- Typically a whiteboard / discovery conversation with stakeholders, not a strict artefact
- How data is structured β independent of technology
- Defines attributes, data types, keys, relationships
- Applies normalization rules
- Database / technology-independent β not tied to any specific DB engine
- Audience: software engineers, data engineers, architects
- Uses formal diagramming notations:
- Information Engineering / Crow's Foot notation
- Barker's notation
- IDEF1X
- Chen's notation
- UML (unofficial use)
- Where and how data is stored on disk
- Database-specific: table names, column types, indexes, partitions
- May include denormalization for performance
- Audience: DBAs, developers
- Tool: DDL scripts, DB-specific diagrams
The Entity-Relationship model (Peter Chen, 1976) is the most widely used notation for conceptual and logical data modeling. It represents entities, their attributes, and the relationships between them.
| Symbol | Meaning | Example |
|---|---|---|
| Rectangle | Entity | Customer, Order |
| Ellipse | Attribute | name, email, orderId |
| Diamond | Relationship | "places", "contains" |
| Underlined | Primary Key | customerId |
| Double Rectangle | Weak Entity | OrderItem (depends on Order) |
| 1 β N line | Cardinality | Customer 1 β N Order |
| Notation | Meaning |
|---|---|
| 1:1 | One-to-One β each entity maps to exactly one of the other |
| 1:N | One-to-Many β one entity maps to many of the other |
| M:N | Many-to-Many β requires a junction/bridge table in physical model |
| 0..1 | Optional β zero or one (partial participation) |
| 1..* | Mandatory many β at least one |
UML class diagrams can represent data models with classes as entities, attributes as fields, and associations as relationships. They are especially useful when the data model is closely aligned with the domain model.
Crow's Foot notation is the most widely used ER notation in the logical modeling phase. It encodes both the minimum and maximum cardinality of a relationship directly onto each end of the line β no separate labels needed.
Each end of a relationship line carries two markers β read from the entity outward:
- Inner mark (closest to entity) β minimum: single bar
|= 1, circleβ= 0 - Outer mark (away from entity) β maximum: single bar
|= 1, crow's foot<= many
| Symbol | Name | Min / Max |
|---|---|---|
ββ||ββ | One and only one | Min 1 Β· Max 1 β mandatory single |
ββ|βββ | Zero or one | Min 0 Β· Max 1 β optional single |
ββ|<ββ | One or many | Min 1 Β· Max N β mandatory multi |
βββ<ββ | Zero or many | Min 0 Β· Max N β optional multi |
- Three types of abstraction for data models, each serving a different audience and level of detail
- Conceptual β key entities, attributes and relationships (high-level picture)
- Logical β attributes, data types and cardinality of relationships
- Physical β maps entities to storage-specific elements (tables, columns, collections)
- Notations for visualizing models: Crow's Foot, Barker's, IDEF1X, Chen's, UML
Step-by-Step Data Modeling Process
Data modeling is an iterative, multi-step process that begins with identifying the entities in a domain and ends with an optimized physical data model. Each step builds upon the previous one.
1. Entities
- Identify core domain objects
2. Attributes
- Define properties & data types
3. Relationships
- Map how entities relate (1:1, 1:N, M:N)
4. Physical Model
- Choose DB tech & create tables
5. Optimize
- Normalise, index & tune for performance
Collections of distinct:
- Real-world objects
- Locations
- Concepts
- Events
- Uniquely identifiable
- Has properties / attributes
"An employee can schedule a meeting with one or several other employees. Every meeting takes place in a specific conference room. Only one meeting can occur in a conference room at any given time."
- Properties / characteristics of an entity
- At least one attribute per entity
- Stores a single, meaningful piece of information
"An employee can schedule a meeting with one or several other employees. Every meeting takes place in a specific conference room. Only one meeting can occur in a conference room at any given time."
Choose the appropriate data modeling technique based on the nature of your data and relationships. Each technique has distinct strengths and trade-offs.
- Joining tables is very easy β allows creating complex relationships
- Good query optimizers
- Support transactions and strict consistency
- Schema enforcement β "Schema on Write"
- Tables/rows don't fit programming data structures (Lists, Arrays, Hash Tables)
- Requires ORM technologies β creates dependencies and performance overhead
- Strict schema creates performance overhead
- Schema updates are difficult, time consuming and introduce downtime
- Entity → Collection of documents of the same type
- Each record/instance is a document β a set of properties in a tree-like structure
- Flexible schema β "Schema on Read"
- Native support for programming language data structures (lists, arrays, maps)
- Great for no relationships and one-to-many β no complex joins needed
- Better storage locality for each record
- Flexible schema β no complex migrations, no downtime
- Easily scalable
- Not good for many-to-many and many-to-one relationships
- Flexible schema (schema on read) is not always desirable
- Records are modeled as nodes in a graph, labeled with an Entity type
- Each node can have properties and relationships to other nodes
- Performing graph traversals through unknown number of edges (joins) is easy
- Great for many-to-one and many-to-many relationships
- Support query languages (like Cypher) for easy graph querying
- Not great when we don't have high connectivity
- Not good when the number of joins is small and predefined
- More complex and less standard query language
| Technique | Purpose |
|---|---|
| Normalization | → Reduce Duplication and Storage Cost |
| Denormalization | → Improve Read Performance |
| Indexing | → Improve Search / Query Performance |
| Sharding / Partitioning | → Improve Scalability |
| Replication | → Improve Availability and Fault Tolerance |
- Step-by-step process to create a data model β identify, define, choose, create, optimize
- Step 1: Identify the Entities β nouns in the domain
- Step 2: Identify the Attributes / Properties
- Step 3: Identify the Relationships between entities
- Step 4: Choose data modeling technique & create physical model
- Step 5: Optimize β normalize, index, shard, replicate
- Relational Β· Document Β· Graph β each with distinct trade-offs
Products and Sums in Data Modeling
Product types combine data with AND; sum types combine data with OR. Together they form Algebraic Data Types (ADTs) β a powerful way to model data that prevents invalid states at the type level.
Product Types (AND)
- Combine multiple fields together β struct, record, tuple, class
Sum Types (OR)
- Exactly one of several variants β enum, sealed class, union
- Groups multiple values together β the resulting type contains all of its parts simultaneously
- Represents an AND relationship β field A AND field B AND field C
- Has a fixed number of required attributes / fields
- Total possible values = the product (multiplication) of each field's possibilities
- Other names: Record, Struct, Data class, Tuple
(id Γ name Γ email Γ created_at).- Represents a choice between multiple values β variant A OR variant B OR variant C
- Creates an OR relationship β a value can be one of several variants, never two at once
- Total possible values = the sum (addition) of each variant's possibilities
- Other names: Discriminated union, Disjoint union, Tagged union, Sealed type, Mixed data
- Relational databases are built around product types β every row is a fixed set of columns (field A AND field B)
- There is no built-in OR construct to say "this row is either a card payment or a bank payment"
- Sum types must be encoded manually using mapping strategies like:
- Single-table with a discriminator column
- Table-per-type with foreign keys
- Table-per-concrete-type
- Each strategy has trade-offs in nullability, query complexity, and data integrity
- A
CREATE TABLEwith columns is already a product type β every row has ALL fields - No special encoding needed β databases handle products out of the box
- Sum types require a design decision β you must choose a mapping strategy
- Variant constraints must be enforced at the application level or via
CHECKconstraints β the database won't do it for you
| Strategy | Description | Pros | Cons |
|---|---|---|---|
| Single Table | One table, discriminator column, nullable fields per variant | Simple, fast queries, no joins | Nullable columns, wasted space |
| Table per Type | Base table + one table per variant with FK | No nulls, clean schema | Requires joins, more complex |
| Table per Class | Each variant gets its own table with all fields (no base table) | No joins, no nulls | No shared queries, base field duplication |
@Inheritance(strategy = SINGLE_TABLE), JOINED, and TABLE_PER_CLASS respectively.Boolean Γ Boolean= 2 Γ 2 = 4 states(true,true), (true,false), (false,true), (false,false)- A struct with 3 booleans: 2Β³ = 8 states
- Every field multiplies the state space
Boolean | TriState= 2 + 3 = 5 states- It's one OR the other, never both
Circle | Rectangle= Circle values + Rectangle values- Each variant adds to the state space
An abstract representation of our data that:
- Represents our business logic independently of any database or language
- Is clear β every stakeholder can understand the domain model
- Is safe β the type system captures constraints before code is written
- Tells us which states are valid and which are invalid
- Products define what must be present; sums define what can vary
- Take advantage of capabilities of each technology
- Easily migrate to new technologies while isolating disruption
- Make technology-specific optimizations:
- Performance optimizations
- Security enhancements
- Storage, caching, indexing strategies, etc.
- The abstract model stays stable even as implementations change
- Product Type (AND): groups multiple values together β must contain values of all its components
- Product name origin: number of possible states is the "product" (multiplication) of each field's possibilities
- Sum Type (OR): represents a choice between multiple values β one variant at a time, never two
- Sum name origin: number of possible states is the "sum" (addition) of each variant's possibilities
- Databases don't natively support sum types β must be encoded via single-table, table-per-type, or table-per-class
- Products + Sums separate data model from representations β abstract model stays stable as implementations change
Data Modeling Fundamentals
- Data outlives code β first-class concern
- Entities, relationships, constraints
- Conceptual β Logical β Physical
- Data drives quality attributes
Model Types & Notations
- Relational β tables, ACID, structured
- NoSQL β document, key-value, graph
- ER diagrams β Chen notation
- UML class diagrams for data
Step-by-Step Modeling
- Entities β Attributes β Relationships
- Relational Β· Document Β· Graph techniques
- Optimize: Normalize, Index, Shard, Replicate
- Each technique has distinct trade-offs
Algebraic Data Types
- Product = AND (structs, records)
- Sum = OR (sealed, unions)
- Make invalid states unrepresentable
- DB mapping: single table, joined, per-class