VirtuousAI

Data Modeling

Star schemas, OLAP patterns, and dimensional modeling basics

Data Modeling Concepts

When you query your data warehouse, you want fast answers to business questions. That requires data modeling — organizing data for analytics rather than transactions.

OLAP vs OLTP

Most applications use OLTP (Online Transaction Processing) databases, optimized for writes. Data warehouses use OLAP (Online Analytical Processing) patterns, optimized for reads.

AspectOLTP (App Database)OLAP (Data Warehouse)
Optimized forWrite performanceRead/scan performance
SchemaNormalized (3NF)Denormalized (Star Schema)
TablesMany small tablesFewer wide tables
QueriesPoint lookupsAggregations, scans
DataCurrent stateHistorical snapshots

If you've worked with PostgreSQL or MySQL in web apps, you've used OLTP patterns. Data warehousing is a different mindset — optimizing for analysts, not transactions.

Why This Matters

Business users ask questions like:

  • "What were our sales by product category last month?"
  • "Which customers haven't ordered in 90 days?"
  • "What's our refund rate by channel?"

These questions require:

  1. Clean data — No duplicates, consistent types
  2. Joined data — Customer + Order + Product in one query
  3. Historical data — Track changes over time

Normalized databases require 6+ JOINs to answer these questions. Star schemas need 1-2.

Star Schema Basics

A star schema organizes data around facts (events) and dimensions (context).

The fact table sits in the center with foreign keys to dimension tables — hence "star" schema.

Fact Tables (fct_*)

What they are: Business events or transactions — things that happened.

CharacteristicsExamples
Contain measures (amounts, counts)total_price, quantity
Typically large and growingMillions of rows
Have foreign keys to dimensionscustomer_id, date_key

Examples: Orders placed, emails sent, payments received, page views

Dimension Tables (dim_*)

What they are: Descriptive context — the who, what, where, when, how.

CharacteristicsExamples
Contain attributes for filteringcustomer_name, product_category
Relatively staticThousands of rows
Have a primary key referenced by factscustomer_id, product_id

Examples: Customers, products, dates, locations, employees

The Grain Concept

Grain is the most important modeling decision: "What does one row represent?"

TableGrainOne Row =
fct_ordersOrderOne customer order
fct_order_linesLine itemOne product in an order
fct_daily_salesProduct-dayOne product's sales for one day

Never mix grains! If fct_orders is one-row-per-order, don't add SKU-level fields. Create a separate fct_order_lines table instead.

Why We Deduplicate

Raw data often contains duplicates because:

  • API pagination fetched the same record twice
  • Incremental sync re-extracted updated records
  • Extraction ran multiple times

The fix: Keep only the most recent version of each record.

-- Deduplication pattern
SELECT *
FROM customers
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY id           -- What makes this unique?
  ORDER BY updated_at DESC  -- Which version to keep?
) = 1

Normalization vs Denormalization

Normalized (OLTP)

  • Minimal data duplication
  • Fast writes, slow complex reads
  • Many tables, many joins

Denormalized (OLAP)

  • Some data duplication is okay
  • Slower writes, fast analytical reads
  • Fewer tables, fewer joins

In star schemas, we intentionally denormalize to make queries fast. Customer name might appear in both dim_customer and fct_orders — that's fine for analytics.

Applies to Any Business

Star schemas aren't e-commerce specific. The pattern works anywhere:

BusinessFacts (Events)Dimensions (Context)
E-commerceOrders, ReturnsCustomers, Products
RestaurantOrders, ReservationsGuests, Menu Items
SaaSSignups, SessionsUsers, Features
HealthcareVisits, ProceduresPatients, Providers

The questions change, but the structure stays the same.

Key Terminology

TermDefinitionExample
GrainWhat one row represents"One order line item"
Surrogate KeyArtificial unique IDcustomer_key = MD5(email)
Natural KeyBusiness ID from sourceshopify_customer_id = 12345
Conformed DimensionSame dimension across factsdim_date used everywhere

Best Practices

Keep It Simple

Start with basic fact and dimension tables. Add complexity only when needed.

Document the Grain

Always write down what one row represents. This prevents confusion later.

Use Surrogate Keys

Generate consistent keys (like MD5 hashes) rather than relying on source system IDs.

Build Incrementally

Start with one fact table and its dimensions. Add more as you prove value.

Next Steps

On this page