data / modeling

Data Modeling

Designing the structure of data to reflect business reality and enable system evolution

TOGAF ADM NIST CSF ISO 27001 AWS Well-Arch Google SRE AI-Native

💡

In Plain English

Data modeling is deciding how your data is organized before you build the system — like designing the filing system before you start filing. Get it right and the system is fast and easy to change. Get it wrong and every new feature requires painful restructuring.

📈

Business Value

Poor data models are the #1 cause of performance problems and the hardest technical debt to remediate. Google's Spanner, Amazon's DynamoDB, and every core banking system in the Philippines have data models designed years before the applications running on them. The model outlives the code.

📖 Detailed Explanation

Data modeling is the architectural discipline of representing business reality as structured data. A good data model captures the entities, relationships, constraints, and invariants of the business domain in a form that enables efficient querying, consistent mutation, and evolutionary change. A poor data model makes simple business rules hard to enforce, makes common queries expensive, and makes schema evolution a deployment-day nightmare.

Entity-Relationship Modeling (ER) is the foundational technique for relational databases. Entities represent business objects (Customer, Account, Transaction). Relationships connect them (Customer has many Accounts; Account has many Transactions). Attributes describe them (Account has balance, currency, status). Cardinality constraints (one-to-one, one-to-many, many-to-many) encode business rules. A well-drawn ER model is executable documentation — it captures business rules that would otherwise live only in engineers' heads.

Normalization vs. Denormalization is the central tension in relational modeling. Normalization (1NF through 3NF/BCNF) eliminates data redundancy by ensuring each fact is stored in exactly one place. This prevents update anomalies — changing a customer's address in one table doesn't risk inconsistency with another. Denormalization intentionally replicates data to enable faster queries. For an OLTP core banking system, normalize to 3NF. For a reporting data warehouse, denormalize into a star schema (fact tables + dimension tables). For an event-driven system, event-carried state is a form of denormalization by necessity.

Domain-Driven Design (DDD) and Bounded Contexts reshape how data modeling is approached in microservices architectures. In a monolith, there is one canonical Customer entity. In a microservices architecture, different bounded contexts have legitimately different views of the same real-world concept. The Order Service's Customer is just { customerId, shippingAddress, loyaltyTier }. The CRM's Customer is { customerId, fullProfile, contactHistory, lifetime_value }. Forcing them to share one data model creates coupling that defeats the purpose of service decomposition. Each bounded context owns its schema, its invariants, and its Customer representation.

Event Sourcing as Data Model replaces mutable state with an immutable append-only log of domain events. Instead of storing the current balance of an account, you store every Deposit, Withdrawal, and Transfer event. The current balance is derived by replaying the event log. This pattern is architecturally powerful: you have a complete audit trail by default, you can reconstruct the state of the system at any point in time, and you can project new read models from the event log without touching the write side. Greg Young pioneered this for financial systems; it is now mandatory architecture for payment systems that require full audit trails under PCI DSS and BSP regulations.

NoSQL Data Modeling inverts the ER approach. Instead of modeling entities first and queries second, you model queries first and structure data to answer them efficiently. In DynamoDB, you design your table's partition key and sort key based on your most critical access patterns. If your top query is "get all transactions for account X in the last 30 days," your partition key is accountId and your sort key is timestamp. Attempting to run ad-hoc queries on a DynamoDB table designed for specific access patterns produces expensive and slow full table scans — the data model and the query patterns must be co-designed.

Schema Evolution is the ongoing challenge: how do you change the data model after the system is in production? For relational databases: additive changes (add column with default, add table, add index) are always safe. Breaking changes (rename column, change data type, remove column) require multi-stage migrations: add the new column, backfill it, migrate readers, migrate writers, drop the old column. For event-sourced systems: events are immutable and versioned — schema changes require adding new event versions, not modifying existing ones, to preserve the integrity of historical event logs.

📈 Architecture Diagram

erDiagram
    CUSTOMER {
        uuid customer_id PK
        string national_id UK
        string full_name
        string email UK
        string phone
        timestamp created_at
        string kyc_status
    }
    ACCOUNT {
        uuid account_id PK
        uuid customer_id FK
        string account_number UK
        string account_type
        decimal balance
        string currency
        string status
        timestamp opened_at
    }
    TRANSACTION {
        uuid transaction_id PK
        uuid account_id FK
        string transaction_type
        decimal amount
        string currency
        string reference_number UK
        timestamp initiated_at
        timestamp settled_at
        string status
        string channel
    }
    TRANSACTION_EVENT {
        uuid event_id PK
        uuid transaction_id FK
        string event_type
        jsonb payload
        timestamp event_at
        string actor_id
    }

    CUSTOMER ||--o{ ACCOUNT : "owns"
    ACCOUNT ||--o{ TRANSACTION : "records"
    TRANSACTION ||--o{ TRANSACTION_EVENT : "produces"

Banking domain ER model: normalized to 3NF for transactional integrity, with a Transaction_Event table implementing event sourcing for audit trail and regulatory compliance.

🌎 Real-World Examples

Spotify — Data Mesh Implementation

Stockholm, Sweden · Music Streaming · 100M+ songs

Spotify was an early Data Mesh adopter. Each domain team (Playlists, Recommendations, Artist Analytics) owns their data products — they define schemas, SLAs, and access policies. Their 'Backstage' developer portal (open-sourced) serves as the data catalog where teams register and discover data products. Cross-domain data access goes through well-defined data product interfaces, never direct database queries.

✓ Result: Data product discovery time reduced from days to minutes; data quality incidents dropped 60% after domain ownership was established

Snowflake — Cloud Data Architecture

Bozeman, USA · Cloud Data Platform · 7,000+ customers

Snowflake's own internal data architecture is the reference implementation of their platform's capabilities: a single Data Cloud with data sharing across 7,000+ customers via Snowflake Secure Data Sharing. Their engineering team uses Snowflake to monitor Snowflake — internal metrics, usage data, and query performance all flow through the same platform they sell. Zero ETL data movement between departments.

✓ Result: Single data platform for 7,000+ enterprise customers; cross-organization data sharing with zero data movement latency

LinkedIn — Real-Time Data Platform

Sunnyvale, USA · Professional Network · 900M members

LinkedIn created Apache Kafka (open-sourced in 2011) to solve their data pipeline problem: 175+ applications producing data that 200+ applications need to consume. Their current platform processes 7 trillion+ events per day. They invented the Lambda Architecture (batch + speed layers) and later helped drive the Kappa Architecture (stream-only). Their engineering blog is one of the most influential data engineering resources.

✓ Result: 7 trillion events/day processed with < 5 second end-to-end latency; Kafka now used by 80% of Fortune 100 companies

Wise (TransferWise) — Financial Data Integrity

London, UK · International Payments · $12B monthly volume

Wise's data architecture for cross-border payments enforces immutability at every layer: every payment event is append-only (event sourcing), every balance change has an immutable audit trail, and data reconciliation runs continuously to detect discrepancies. Their 'double-entry bookkeeping' pattern applied at the database level ensures financial data integrity that satisfies FCA, FinCEN, and MAS regulatory requirements simultaneously.

✓ Result: Zero financial reconciliation failures in 8 years of operation; $12B+ monthly payment volume with 100% audit trail completeness

🌟 Core Principles

Model the Business Domain, Not the Technology

The data model should reflect how the business thinks about its data — not how the database stores it most efficiently. Start with the business entities and relationships; optimize for the technology second.

Each Fact in One Place (Normalization)

In OLTP systems, every fact should be stored exactly once. If customer address is stored in both the Customer table and the Order table, updating one without the other creates inconsistency. Normalize until you have a consistency problem; then denormalize selectively.

Design for Query Patterns in NoSQL

In NoSQL systems (DynamoDB, Cassandra), model data to answer your specific query patterns efficiently. General-purpose schemas on NoSQL systems produce expensive, slow queries. Know your top 5 access patterns before designing the schema.

Events are Immutable; State is Derived

In event-sourced systems, past events are facts that cannot be changed. Current state is a projection derived from replaying events. This distinction preserves the integrity of the historical record and is required for PCI DSS and BSP audit trail compliance.

Schema Evolution is a First-Class Concern

Plan for how the schema will change before deploying it. Additive changes are always safe. Breaking changes require multi-stage migrations. Event schemas must be versioned. Design with evolution in mind from day one.

⚙️ Implementation Steps

Identify the Domain Entities and Boundaries

Work with business stakeholders to enumerate all entities, their key attributes, and the relationships between them. Use event storming or domain storytelling to discover entities through business processes rather than database tables.

Apply Normalization to OLTP Schemas

Walk through 1NF (atomic attributes), 2NF (no partial dependencies), and 3NF (no transitive dependencies) for all relational tables. Document any intentional denormalization with the query performance justification.

Design the Write Model (Transactions)

Identify the transactional boundaries: what operations must be atomic? Each transaction boundary maps to either a single table mutation, a multi-table transaction within one service, or a saga across services. Never span a transaction across service boundaries.

Design the Read Models (Query Optimization)

For reporting and analytics use cases, design denormalized read models (star schemas, materialized views, Elasticsearch indexes) separately from the normalized write model. This is the CQRS principle applied to data modeling.

Define Data Retention and Archival Strategy

For each entity: what is the retention period? (BSP: 5 years minimum for transaction records.) How are records archived? How are deletion requests handled (GDPR/DPA right to erasure)? Design retention into the schema from the start.

✅ Governance Checkpoints

Checkpoint	Owner	Gate Criteria	Status
ER Diagram Reviewed by Domain Experts	Data Architect	Business stakeholders have validated entity definitions and relationships	Required
Normalization Documented	Data Architect	Normalization level stated with justifications for any denormalization	Required
Migration Plan Defined	Data Engineer	Schema migration approach documented for all planned changes	Required
Retention Policy Aligned to BSP/DPA	Compliance Architect	Retention periods defined per entity aligned to regulatory requirements	Required

◈ Recommended Patterns

✦ Star Schema (Data Warehouse)

A central Fact table (transactions, events) surrounded by Dimension tables (customer, product, time, location). Highly denormalized for fast analytical queries. The Kimball methodology for dimensional modeling.

✦ Event Sourcing Store

An append-only event log is the primary data store. Current state is derived by replaying events. Enables complete audit trails, temporal queries ('what was the balance on 2024-01-15?'), and multiple independent read projections from the same event stream.

✦ Polyglot Persistence

Use the right database technology for each access pattern: PostgreSQL for transactional data requiring ACID guarantees; DynamoDB for high-throughput key-value access; Elasticsearch for full-text search; Redis for caching; TimescaleDB for time-series metrics.

⛔ Anti-Patterns to Avoid

⛔ God Table

A single database table with 200+ columns, used by every service in the system to store anything vaguely related to the central entity. Modifications require coordinating every team. Queries return enormous row payloads. The hardest technical debt to escape.

⛔ EAV (Entity-Attribute-Value) Overuse

Storing all attributes as rows in a generic (entity_id, attribute_name, attribute_value) table to achieve 'schema flexibility.' Makes SQL queries unmaintainable, destroys query performance, and loses type safety. Use JSONB columns or document databases for genuinely dynamic attributes.

🤖 AI Augmentation Extensions

🤖 AI-Assisted ER Generation

LLM agents translate business domain descriptions into draft ER diagrams with entity definitions, relationships, and cardinality constraints. Dramatically accelerates the initial modeling phase for new domains.

⚡ AI-generated ER models require validation by a data architect. LLMs may miss domain-specific constraints, regulatory requirements, and cardinality edge cases.

🤖 Schema Drift Detection

Automated monitoring compares deployed database schemas against the version-controlled schema definitions. Deviations (manually applied changes, ORM-driven schema drift) trigger alerts and generate remediation migrations.

⚡ Run schema drift detection in every environment — staging and production. Manual schema changes in production that bypass version control are a critical audit finding.

🔗 Related Sections

📚 References & Further Reading

Designing Data-Intensive Applications — Kleppmann↗ dataintensive.net
The Data Warehouse Toolkit — Kimball↗ Kimball Group
CQRS Documents — Greg Young↗ cqrs.files.wordpress.com
Domain-Driven Design — Evans↗ Amazon
Fundamentals of Data Engineering — Reis & Housley↗ O'Reilly