Skip to content

IDR Performance Benchmark Results

Status: DuckDB tested, cloud platforms pending
Dataset: Retail customer data (deterministic seed: 42)
Last Updated: 2026-01-04


Test Environment Summary

Platform Configuration Instance Type Notes
DuckDB Local / Docker MacBook Pro M1/M2 Single-node, 16GB RAM
Snowflake TBD Warehouse size
BigQuery On-demand Serverless Auto-scaling, pay-per-query
Databricks TBD Cluster config

10 Million Rows (DuckDB Baseline)

Timing Results

Platform Data Load Entity Extract Edge Build Label Prop Output Gen Total
DuckDB 1s 7s 33s 81s 12s 143s
Snowflake 5s 17s 58s 53s 26s 168s
BigQuery 5s 10s 50s 101s 91s 295s
Databricks 14s 36s 77s 115s ~75s 317s

Metrics Results

Platform Entities Edges Clusters Largest Singletons LP Iters
DuckDB 10,000,000 16,124,751 1,839,324 TBD TBD 6
Snowflake 10,000,000 16,124,751 1,839,324 TBD TBD 6
BigQuery 10,000,000 16,124,751 1,839,324 TBD TBD 6
Databricks 10,000,000 16,124,751 1,839,324 TBD TBD 6

Consistency Check

  • [ ] All platforms produced same cluster count
  • [ ] Largest cluster size matches across platforms
  • [ ] Singleton count matches across platforms

50 Million Rows (Planned)

Timing Results

Platform Data Load Entity Extract Edge Build Label Prop Output Gen Total
DuckDB TBD TBD TBD TBD TBD TBD
Snowflake TBD TBD TBD TBD TBD TBD
BigQuery TBD TBD TBD TBD TBD TBD
Databricks TBD TBD TBD TBD TBD TBD

Metrics Results

Platform Entities Edges Clusters Largest Singletons LP Iters
DuckDB TBD TBD TBD TBD TBD TBD
Snowflake TBD TBD TBD TBD TBD TBD
BigQuery TBD TBD TBD TBD TBD TBD
Databricks TBD TBD TBD TBD TBD TBD

Consistency Check

  • [ ] All platforms produced same cluster count
  • [ ] Largest cluster size matches across platforms
  • [ ] Singleton count matches across platforms

100 Million Rows (Planned)

Timing Results

Platform Data Load Entity Extract Edge Build Label Prop Output Gen Total
DuckDB TBD TBD TBD TBD TBD TBD
Snowflake TBD TBD TBD TBD TBD TBD
BigQuery TBD TBD TBD TBD TBD TBD
Databricks TBD TBD TBD TBD TBD TBD

Metrics Results

Platform Entities Edges Clusters Largest Singletons LP Iters
DuckDB TBD TBD TBD TBD TBD TBD
Snowflake TBD TBD TBD TBD TBD TBD
BigQuery TBD TBD TBD TBD TBD TBD
Databricks TBD TBD TBD TBD TBD TBD

Consistency Check

  • [ ] All platforms produced same cluster count
  • [ ] Largest cluster size matches across platforms
  • [ ] Singleton count matches across platforms

Performance Charts

Total Duration by Platform & Scale

Platform     | 10M       | 50M      | 100M
-------------|-----------|----------|----------
DuckDB       | ███ 143s  |          |         
Snowflake    | ████ 168s |          |         
BigQuery     | ██████ 295s|         |         
Databricks   | ██████ 317s|         |         
(To be replaced with actual Chart.js visualization)

Cost Analysis (Cloud Only)

Platform 10M Cost 50M Cost 100M Cost Notes
DuckDB Free Free Free Local/Docker
Snowflake ~$0.25 TBD TBD XS Warehouse: 0.1 credits
BigQuery ~$0.50 TBD TBD On-demand: $6.25/TB scanned
Databricks $TBD $TBD $TBD Serverless SQL Warehouse

Observations & Insights

DuckDB

  • 10M rows in 143 seconds (~2.4 min) - excellent for local/dev workloads
  • Label Propagation dominates: 81s (57% of total) - bottleneck on both platforms
  • Edge Building: 33s (23%), fast single-node execution
  • Output Gen: 12s (8%), efficient local writes
  • Throughput: ~70,000 entities/second
  • Created 16.1M edges from 10M entities
  • Resolved into 1.84M clusters (~5.4 entities/cluster average)
  • RAM usage: ~8-12GB peak for 10M entities

Snowflake

  • 10M rows in 168 seconds (~2.8 min) - fastest cloud platform!
  • Only 1.17x slower than local DuckDB
  • Label Propagation: 53s (32%) - fastest LP of all cloud platforms
  • Edge Building: 58s (35%), excellent parallel execution
  • Entity + Identifier Extraction: 22s combined (13%)
  • Output Gen: 26s (15%), efficient MERGE operations
  • Identical metrics: 16.1M edges, 1.84M clusters, 6 LP iterations
  • Throughput: ~59,500 entities/second
  • Warehouse: IDR_WH (standard size)

BigQuery

  • 10M rows in 295 seconds (~4.9 min) - 2.1x slower than DuckDB
  • Label Propagation: 101s (34%) - serverless overhead on iterative queries
  • Output Gen: 91s (31%) - MERGE operations have high network overhead
  • Edge Building: 50s (17%), good parallel execution
  • Identical metrics: 16.1M edges, 1.84M clusters, 6 LP iterations
  • Throughput: ~34,000 entities/second
  • Estimated cost: ~$0.50 for 10M rows (on-demand pricing)
  • BigQuery wins at larger scales due to horizontal scaling

Databricks

  • 10M rows in 317 seconds (~5.3 min) - 2.2x slower than DuckDB
  • Similar to BigQuery performance
  • Identical metrics: 16.1M edges, 1.84M clusters, 6 LP iterations
  • Throughput: ~31,500 entities/second
  • Unity Catalog adds overhead for table operations
  • Better suited for larger datasets with Spark parallelism

Recommendations

When to Use Each Platform

Use Case Recommended Platform Rationale
< 10M rows, local dev DuckDB Fast, free, no infra
10-50M rows, batch TBD Based on testing
50-100M rows TBD Based on testing
> 100M rows TBD Based on testing
Real-time/streaming TBD Based on latency needs

Test Commands

# Generate 20M dataset
python tools/scale_test/data_generator.py --rows=20000000 --seed=42 --output=data/

# Run DuckDB benchmark
python tools/scale_test/benchmark.py \
    --platform=duckdb \
    --data=data/retail_customers_20m.parquet \
    --rows=20000000 \
    --db=idr_benchmark.duckdb

# Generate 50M dataset
python tools/scale_test/data_generator.py --rows=50000000 --seed=42 --output=data/

# Generate 100M dataset
python tools/scale_test/data_generator.py --rows=100000000 --seed=42 --output=data/

Appendix: Test Data Distribution

Target distribution (retail industry standard):

Cluster Size Percentage @ 20M @ 50M @ 100M
1 (singleton) 35% 7M 17.5M 35M
2 (pairs) 25% 5M 12.5M 25M
3-5 (small) 20% 4M 10M 20M
6-15 (medium) 12% 2.4M 6M 12M
16-50 (large) 5% 1M 2.5M 5M
51-200 (v.large) 2% 400K 1M 2M
201-1000 (massive) 1% 200K 500K 1M

Identifier match rates: - Email: 55% - Phone: 25% - Loyalty ID: 10% - Address: 10% - Chain patterns: 15%