Data Pipeline Architecture

VirtuousAI uses a medallion architecture to progressively refine your data from raw extracts to analytics-ready tables. This pattern ensures data quality, enables auditing, and supports incremental updates.

The Three Layers

Bronze: Raw Data

The bronze layer stores exactly what came from your data sources — no transformations, no cleaning.

Aspect	Bronze Layer
Purpose	Preserve original data
Format	Parquet files on S3
Updates	Append-only (never modified)
Problems	Duplicates, messy types, nested structures

Why keep raw data?

Audit trail — You can always see what was originally extracted
Reprocessing — Re-run transforms if logic changes
Debugging — Investigate issues by examining source data

Silver: Cleaned Data

The silver layer contains deduplicated, typed, consistently named data. Each source gets its own staging tables.

Aspect	Silver Layer
Purpose	Clean and standardize
Format	Delta Lake tables
Updates	Full refresh or incremental merge
Naming	`stg_shopify_customers`, `stg_amazon_orders`

What silver provides:

One row per business entity (deduplicated)
Correct data types (decimals, timestamps)
Consistent naming conventions
Still source-scoped (not yet unified)

Gold: Business Models

The gold layer contains dimensional models ready for analytics. Data from multiple sources is unified here.

Aspect	Gold Layer
Purpose	Enable analytics
Format	Delta Lake tables
Updates	Built from silver tables
Naming	`dim_customer`, `fct_orders`

What gold provides:

Unified entities across sources
Pre-computed aggregations
Star schema for fast queries
Business terminology (not source system names)

Table is small (< 1M rows)
Schema changes frequently
You want guaranteed consistency

Updates existing rows and inserts new ones based on merge keys.

Use when:

Table is large
Most data doesn't change
You need faster refresh times

Example Transform

A typical bronze-to-silver transform:

SELECT
  id as customer_id,
  email,
  CAST(total_spent AS DECIMAL(18,2)) as total_spent,
  created_at as created_at_utc
FROM bronze_customers
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY id
  ORDER BY updated_at DESC
) = 1

This query:

Renames fields for consistency
Casts monetary values to decimal
Deduplicates by keeping only the latest version of each customer

Storage and Format

Layer	Format	Location
Bronze	Parquet	`s3://vai-flux-bronze/tenant_id={id}/...`
Silver	Delta Lake	`s3://vai-flux-silver/tenant_id={id}/...`
Gold	Delta Lake	`s3://vai-flux-gold/tenant_id={id}/...`

Why Delta Lake for silver/gold?

ACID transactions
Schema evolution
Time travel (query historical versions)
Efficient MERGE operations

Next Steps

Learn about Data Modeling for gold layer design
See Actions for transform configuration
Try Data Extraction to populate bronze

Data Pipeline

Data Pipeline Architecture

The Three Layers

Bronze: Raw Data

Silver: Cleaned Data

Gold: Business Models

Why This Architecture?

Problem 1: Duplicate Records

Problem 2: Inconsistent Types

Problem 3: Nested Structures

Problem 4: Cross-Source Analysis

Transformations

Transform Modes

Example Transform

Storage and Format

Next Steps

On this page