Data Pipeline
Bronze, silver, and gold layers for progressive data refinement
Data Pipeline Architecture
VirtuousAI uses a medallion architecture to progressively refine your data from raw extracts to analytics-ready tables. This pattern ensures data quality, enables auditing, and supports incremental updates.
The Three Layers
Bronze: Raw Data
The bronze layer stores exactly what came from your data sources — no transformations, no cleaning.
| Aspect | Bronze Layer |
|---|---|
| Purpose | Preserve original data |
| Format | Parquet files on S3 |
| Updates | Append-only (never modified) |
| Problems | Duplicates, messy types, nested structures |
Why keep raw data?
- Audit trail — You can always see what was originally extracted
- Reprocessing — Re-run transforms if logic changes
- Debugging — Investigate issues by examining source data
Silver: Cleaned Data
The silver layer contains deduplicated, typed, consistently named data. Each source gets its own staging tables.
| Aspect | Silver Layer |
|---|---|
| Purpose | Clean and standardize |
| Format | Delta Lake tables |
| Updates | Full refresh or incremental merge |
| Naming | stg_shopify_customers, stg_amazon_orders |
What silver provides:
- One row per business entity (deduplicated)
- Correct data types (decimals, timestamps)
- Consistent naming conventions
- Still source-scoped (not yet unified)
Gold: Business Models
The gold layer contains dimensional models ready for analytics. Data from multiple sources is unified here.
| Aspect | Gold Layer |
|---|---|
| Purpose | Enable analytics |
| Format | Delta Lake tables |
| Updates | Built from silver tables |
| Naming | dim_customer, fct_orders |
What gold provides:
- Unified entities across sources
- Pre-computed aggregations
- Star schema for fast queries
- Business terminology (not source system names)
Why This Architecture?
The medallion architecture isn't just organizational — it solves real problems that emerge when working with production data.
Problem 1: Duplicate Records
APIs often return the same record multiple times (pagination issues, incremental syncs). Bronze keeps all of them; silver deduplicates.
Problem 2: Inconsistent Types
Price might be a string in one API response and a number in another. Silver casts everything to consistent types.
Problem 3: Nested Structures
API responses often have deeply nested JSON. Silver flattens these into queryable columns.
Problem 4: Cross-Source Analysis
Customers appear in Shopify, Klaviyo, and your email system. Gold unifies them into a single dim_customer.
Transformations
Data moves between layers via duckdb_transform actions. These execute SQL transformations using DuckDB.
Transform Modes
Replaces the entire target table. Simple and safe.
Use when:
- Table is small (< 1M rows)
- Schema changes frequently
- You want guaranteed consistency
Updates existing rows and inserts new ones based on merge keys.
Use when:
- Table is large
- Most data doesn't change
- You need faster refresh times
Example Transform
A typical bronze-to-silver transform:
SELECT
id as customer_id,
email,
CAST(total_spent AS DECIMAL(18,2)) as total_spent,
created_at as created_at_utc
FROM bronze_customers
QUALIFY ROW_NUMBER() OVER (
PARTITION BY id
ORDER BY updated_at DESC
) = 1This query:
- Renames fields for consistency
- Casts monetary values to decimal
- Deduplicates by keeping only the latest version of each customer
Storage and Format
| Layer | Format | Location |
|---|---|---|
| Bronze | Parquet | s3://vai-flux-bronze/tenant_id={id}/... |
| Silver | Delta Lake | s3://vai-flux-silver/tenant_id={id}/... |
| Gold | Delta Lake | s3://vai-flux-gold/tenant_id={id}/... |
Why Delta Lake for silver/gold?
- ACID transactions
- Schema evolution
- Time travel (query historical versions)
- Efficient MERGE operations
Next Steps
- Learn about Data Modeling for gold layer design
- See Actions for transform configuration
- Try Data Extraction to populate bronze