Data Modeling
Star schemas, OLAP patterns, and dimensional modeling basics
Data Modeling Concepts
When you query your data warehouse, you want fast answers to business questions. That requires data modeling — organizing data for analytics rather than transactions.
OLAP vs OLTP
Most applications use OLTP (Online Transaction Processing) databases, optimized for writes. Data warehouses use OLAP (Online Analytical Processing) patterns, optimized for reads.
| Aspect | OLTP (App Database) | OLAP (Data Warehouse) |
|---|---|---|
| Optimized for | Write performance | Read/scan performance |
| Schema | Normalized (3NF) | Denormalized (Star Schema) |
| Tables | Many small tables | Fewer wide tables |
| Queries | Point lookups | Aggregations, scans |
| Data | Current state | Historical snapshots |
If you've worked with PostgreSQL or MySQL in web apps, you've used OLTP patterns. Data warehousing is a different mindset — optimizing for analysts, not transactions.
Why This Matters
Business users ask questions like:
- "What were our sales by product category last month?"
- "Which customers haven't ordered in 90 days?"
- "What's our refund rate by channel?"
These questions require:
- Clean data — No duplicates, consistent types
- Joined data — Customer + Order + Product in one query
- Historical data — Track changes over time
Normalized databases require 6+ JOINs to answer these questions. Star schemas need 1-2.
Star Schema Basics
A star schema organizes data around facts (events) and dimensions (context).
The fact table sits in the center with foreign keys to dimension tables — hence "star" schema.
Fact Tables (fct_*)
What they are: Business events or transactions — things that happened.
| Characteristics | Examples |
|---|---|
| Contain measures (amounts, counts) | total_price, quantity |
| Typically large and growing | Millions of rows |
| Have foreign keys to dimensions | customer_id, date_key |
Examples: Orders placed, emails sent, payments received, page views
Dimension Tables (dim_*)
What they are: Descriptive context — the who, what, where, when, how.
| Characteristics | Examples |
|---|---|
| Contain attributes for filtering | customer_name, product_category |
| Relatively static | Thousands of rows |
| Have a primary key referenced by facts | customer_id, product_id |
Examples: Customers, products, dates, locations, employees
The Grain Concept
Grain is the most important modeling decision: "What does one row represent?"
| Table | Grain | One Row = |
|---|---|---|
fct_orders | Order | One customer order |
fct_order_lines | Line item | One product in an order |
fct_daily_sales | Product-day | One product's sales for one day |
Never mix grains! If fct_orders is one-row-per-order, don't add SKU-level fields. Create a separate fct_order_lines table instead.
Why We Deduplicate
Raw data often contains duplicates because:
- API pagination fetched the same record twice
- Incremental sync re-extracted updated records
- Extraction ran multiple times
The fix: Keep only the most recent version of each record.
-- Deduplication pattern
SELECT *
FROM customers
QUALIFY ROW_NUMBER() OVER (
PARTITION BY id -- What makes this unique?
ORDER BY updated_at DESC -- Which version to keep?
) = 1Normalization vs Denormalization
Normalized (OLTP)
- Minimal data duplication
- Fast writes, slow complex reads
- Many tables, many joins
Denormalized (OLAP)
- Some data duplication is okay
- Slower writes, fast analytical reads
- Fewer tables, fewer joins
In star schemas, we intentionally denormalize to make queries fast. Customer name might appear in both dim_customer and fct_orders — that's fine for analytics.
Applies to Any Business
Star schemas aren't e-commerce specific. The pattern works anywhere:
| Business | Facts (Events) | Dimensions (Context) |
|---|---|---|
| E-commerce | Orders, Returns | Customers, Products |
| Restaurant | Orders, Reservations | Guests, Menu Items |
| SaaS | Signups, Sessions | Users, Features |
| Healthcare | Visits, Procedures | Patients, Providers |
The questions change, but the structure stays the same.
Key Terminology
| Term | Definition | Example |
|---|---|---|
| Grain | What one row represents | "One order line item" |
| Surrogate Key | Artificial unique ID | customer_key = MD5(email) |
| Natural Key | Business ID from source | shopify_customer_id = 12345 |
| Conformed Dimension | Same dimension across facts | dim_date used everywhere |
Best Practices
Keep It Simple
Start with basic fact and dimension tables. Add complexity only when needed.
Document the Grain
Always write down what one row represents. This prevents confusion later.
Use Surrogate Keys
Generate consistent keys (like MD5 hashes) rather than relying on source system IDs.
Build Incrementally
Start with one fact table and its dimensions. Add more as you prove value.
Next Steps
- See Data Pipeline for how data flows through layers
- Learn about Actions for creating transforms
- Review Data Extraction to populate your bronze layer