Data Modeling Concepts

When you query your data warehouse, you want fast answers to business questions. That requires data modeling — organizing data for analytics rather than transactions.

OLAP vs OLTP

Most applications use OLTP (Online Transaction Processing) databases, optimized for writes. Data warehouses use OLAP (Online Analytical Processing) patterns, optimized for reads.

Aspect	OLTP (App Database)	OLAP (Data Warehouse)
Optimized for	Write performance	Read/scan performance
Schema	Normalized (3NF)	Denormalized (Star Schema)
Tables	Many small tables	Fewer wide tables
Queries	Point lookups	Aggregations, scans
Data	Current state	Historical snapshots

If you've worked with PostgreSQL or MySQL in web apps, you've used OLTP patterns. Data warehousing is a different mindset — optimizing for analysts, not transactions.

Why This Matters

Business users ask questions like:

"What were our sales by product category last month?"
"Which customers haven't ordered in 90 days?"
"What's our refund rate by channel?"

These questions require:

Clean data — No duplicates, consistent types
Joined data — Customer + Order + Product in one query
Historical data — Track changes over time

Normalized databases require 6+ JOINs to answer these questions. Star schemas need 1-2.

Star Schema Basics

A star schema organizes data around facts (events) and dimensions (context).

The fact table sits in the center with foreign keys to dimension tables — hence "star" schema.

Fact Tables (fct_*)

What they are: Business events or transactions — things that happened.

Characteristics	Examples
Contain measures (amounts, counts)	`total_price`, `quantity`
Typically large and growing	Millions of rows
Have foreign keys to dimensions	`customer_id`, `date_key`

Examples: Orders placed, emails sent, payments received, page views

Dimension Tables (dim_*)

What they are: Descriptive context — the who, what, where, when, how.

Characteristics	Examples
Contain attributes for filtering	`customer_name`, `product_category`
Relatively static	Thousands of rows
Have a primary key referenced by facts	`customer_id`, `product_id`

Examples: Customers, products, dates, locations, employees

The Grain Concept

Grain is the most important modeling decision: "What does one row represent?"

Table	Grain	One Row =
`fct_orders`	Order	One customer order
`fct_order_lines`	Line item	One product in an order
`fct_daily_sales`	Product-day	One product's sales for one day

Never mix grains! If fct_orders is one-row-per-order, don't add SKU-level fields. Create a separate fct_order_lines table instead.

Why We Deduplicate

Raw data often contains duplicates because:

API pagination fetched the same record twice
Incremental sync re-extracted updated records
Extraction ran multiple times

The fix: Keep only the most recent version of each record.

-- Deduplication pattern
SELECT *
FROM customers
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY id           -- What makes this unique?
  ORDER BY updated_at DESC  -- Which version to keep?
) = 1

Normalization vs Denormalization

Normalized (OLTP)

Minimal data duplication
Fast writes, slow complex reads
Many tables, many joins

Denormalized (OLAP)

Some data duplication is okay
Slower writes, fast analytical reads
Fewer tables, fewer joins

In star schemas, we intentionally denormalize to make queries fast. Customer name might appear in both dim_customer and fct_orders — that's fine for analytics.

Applies to Any Business

Star schemas aren't e-commerce specific. The pattern works anywhere:

Business	Facts (Events)	Dimensions (Context)
E-commerce	Orders, Returns	Customers, Products
Restaurant	Orders, Reservations	Guests, Menu Items
SaaS	Signups, Sessions	Users, Features
Healthcare	Visits, Procedures	Patients, Providers

The questions change, but the structure stays the same.

Key Terminology

Term	Definition	Example
Grain	What one row represents	"One order line item"
Surrogate Key	Artificial unique ID	`customer_key = MD5(email)`
Natural Key	Business ID from source	`shopify_customer_id = 12345`
Conformed Dimension	Same dimension across facts	`dim_date` used everywhere

See Data Pipeline for how data flows through layers
Learn about Actions for creating transforms
Review Data Extraction to populate your bronze layer

Data Modeling

Data Modeling Concepts

OLAP vs OLTP

Why This Matters

Star Schema Basics

Fact Tables (fct_*)

Dimension Tables (dim_*)

The Grain Concept

Why We Deduplicate

Normalization vs Denormalization

Normalized (OLTP)

Denormalized (OLAP)

Applies to Any Business

Key Terminology

Best Practices

Keep It Simple

Document the Grain

Use Surrogate Keys

Build Incrementally

Next Steps

On this page