VirtuousAI

Data Pipeline

Bronze, silver, and gold layers for progressive data refinement

Data Pipeline Architecture

VirtuousAI uses a medallion architecture to progressively refine your data from raw extracts to analytics-ready tables. This pattern ensures data quality, enables auditing, and supports incremental updates.

The Three Layers

Bronze: Raw Data

The bronze layer stores exactly what came from your data sources — no transformations, no cleaning.

AspectBronze Layer
PurposePreserve original data
FormatParquet files on S3
UpdatesAppend-only (never modified)
ProblemsDuplicates, messy types, nested structures

Why keep raw data?

  • Audit trail — You can always see what was originally extracted
  • Reprocessing — Re-run transforms if logic changes
  • Debugging — Investigate issues by examining source data

Silver: Cleaned Data

The silver layer contains deduplicated, typed, consistently named data. Each source gets its own staging tables.

AspectSilver Layer
PurposeClean and standardize
FormatDelta Lake tables
UpdatesFull refresh or incremental merge
Namingstg_shopify_customers, stg_amazon_orders

What silver provides:

  • One row per business entity (deduplicated)
  • Correct data types (decimals, timestamps)
  • Consistent naming conventions
  • Still source-scoped (not yet unified)

Gold: Business Models

The gold layer contains dimensional models ready for analytics. Data from multiple sources is unified here.

AspectGold Layer
PurposeEnable analytics
FormatDelta Lake tables
UpdatesBuilt from silver tables
Namingdim_customer, fct_orders

What gold provides:

  • Unified entities across sources
  • Pre-computed aggregations
  • Star schema for fast queries
  • Business terminology (not source system names)

Why This Architecture?

The medallion architecture isn't just organizational — it solves real problems that emerge when working with production data.

Problem 1: Duplicate Records

APIs often return the same record multiple times (pagination issues, incremental syncs). Bronze keeps all of them; silver deduplicates.

Problem 2: Inconsistent Types

Price might be a string in one API response and a number in another. Silver casts everything to consistent types.

Problem 3: Nested Structures

API responses often have deeply nested JSON. Silver flattens these into queryable columns.

Problem 4: Cross-Source Analysis

Customers appear in Shopify, Klaviyo, and your email system. Gold unifies them into a single dim_customer.

Transformations

Data moves between layers via duckdb_transform actions. These execute SQL transformations using DuckDB.

Transform Modes

Replaces the entire target table. Simple and safe.

Use when:

  • Table is small (< 1M rows)
  • Schema changes frequently
  • You want guaranteed consistency

Updates existing rows and inserts new ones based on merge keys.

Use when:

  • Table is large
  • Most data doesn't change
  • You need faster refresh times

Example Transform

A typical bronze-to-silver transform:

SELECT
  id as customer_id,
  email,
  CAST(total_spent AS DECIMAL(18,2)) as total_spent,
  created_at as created_at_utc
FROM bronze_customers
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY id
  ORDER BY updated_at DESC
) = 1

This query:

  1. Renames fields for consistency
  2. Casts monetary values to decimal
  3. Deduplicates by keeping only the latest version of each customer

Storage and Format

LayerFormatLocation
BronzeParquets3://vai-flux-bronze/tenant_id={id}/...
SilverDelta Lakes3://vai-flux-silver/tenant_id={id}/...
GoldDelta Lakes3://vai-flux-gold/tenant_id={id}/...

Why Delta Lake for silver/gold?

  • ACID transactions
  • Schema evolution
  • Time travel (query historical versions)
  • Efficient MERGE operations

Next Steps

On this page