Status: DuckDB tested, cloud platforms pending
Dataset: Retail customer data (deterministic seed: 42)
Last Updated: 2026-01-04
Test Environment Summary
| Platform |
Configuration |
Instance Type |
Notes |
| DuckDB |
Local / Docker |
MacBook Pro M1/M2 |
Single-node, 16GB RAM |
| Snowflake |
TBD |
Warehouse size |
|
| BigQuery |
On-demand |
Serverless |
Auto-scaling, pay-per-query |
| Databricks |
TBD |
Cluster config |
|
10 Million Rows (DuckDB Baseline)
Timing Results
| Platform |
Data Load |
Entity Extract |
Edge Build |
Label Prop |
Output Gen |
Total |
| DuckDB |
1s |
7s |
33s |
81s |
12s |
143s |
| Snowflake |
5s |
17s |
58s |
53s |
26s |
168s |
| BigQuery |
5s |
10s |
50s |
101s |
91s |
295s |
| Databricks |
14s |
36s |
77s |
115s |
~75s |
317s |
Metrics Results
| Platform |
Entities |
Edges |
Clusters |
Largest |
Singletons |
LP Iters |
| DuckDB |
10,000,000 |
16,124,751 |
1,839,324 |
TBD |
TBD |
6 |
| Snowflake |
10,000,000 |
16,124,751 |
1,839,324 |
TBD |
TBD |
6 |
| BigQuery |
10,000,000 |
16,124,751 |
1,839,324 |
TBD |
TBD |
6 |
| Databricks |
10,000,000 |
16,124,751 |
1,839,324 |
TBD |
TBD |
6 |
Consistency Check
- [ ] All platforms produced same cluster count
- [ ] Largest cluster size matches across platforms
- [ ] Singleton count matches across platforms
50 Million Rows (Planned)
Timing Results
| Platform |
Data Load |
Entity Extract |
Edge Build |
Label Prop |
Output Gen |
Total |
| DuckDB |
TBD |
TBD |
TBD |
TBD |
TBD |
TBD |
| Snowflake |
TBD |
TBD |
TBD |
TBD |
TBD |
TBD |
| BigQuery |
TBD |
TBD |
TBD |
TBD |
TBD |
TBD |
| Databricks |
TBD |
TBD |
TBD |
TBD |
TBD |
TBD |
Metrics Results
| Platform |
Entities |
Edges |
Clusters |
Largest |
Singletons |
LP Iters |
| DuckDB |
TBD |
TBD |
TBD |
TBD |
TBD |
TBD |
| Snowflake |
TBD |
TBD |
TBD |
TBD |
TBD |
TBD |
| BigQuery |
TBD |
TBD |
TBD |
TBD |
TBD |
TBD |
| Databricks |
TBD |
TBD |
TBD |
TBD |
TBD |
TBD |
Consistency Check
- [ ] All platforms produced same cluster count
- [ ] Largest cluster size matches across platforms
- [ ] Singleton count matches across platforms
100 Million Rows (Planned)
Timing Results
| Platform |
Data Load |
Entity Extract |
Edge Build |
Label Prop |
Output Gen |
Total |
| DuckDB |
TBD |
TBD |
TBD |
TBD |
TBD |
TBD |
| Snowflake |
TBD |
TBD |
TBD |
TBD |
TBD |
TBD |
| BigQuery |
TBD |
TBD |
TBD |
TBD |
TBD |
TBD |
| Databricks |
TBD |
TBD |
TBD |
TBD |
TBD |
TBD |
Metrics Results
| Platform |
Entities |
Edges |
Clusters |
Largest |
Singletons |
LP Iters |
| DuckDB |
TBD |
TBD |
TBD |
TBD |
TBD |
TBD |
| Snowflake |
TBD |
TBD |
TBD |
TBD |
TBD |
TBD |
| BigQuery |
TBD |
TBD |
TBD |
TBD |
TBD |
TBD |
| Databricks |
TBD |
TBD |
TBD |
TBD |
TBD |
TBD |
Consistency Check
- [ ] All platforms produced same cluster count
- [ ] Largest cluster size matches across platforms
- [ ] Singleton count matches across platforms
Platform | 10M | 50M | 100M
-------------|-----------|----------|----------
DuckDB | ███ 143s | |
Snowflake | ████ 168s | |
BigQuery | ██████ 295s| |
Databricks | ██████ 317s| |
(To be replaced with actual Chart.js visualization)
Cost Analysis (Cloud Only)
| Platform |
10M Cost |
50M Cost |
100M Cost |
Notes |
| DuckDB |
Free |
Free |
Free |
Local/Docker |
| Snowflake |
~$0.25 |
TBD |
TBD |
XS Warehouse: 0.1 credits |
| BigQuery |
~$0.50 |
TBD |
TBD |
On-demand: $6.25/TB scanned |
| Databricks |
$TBD |
$TBD |
$TBD |
Serverless SQL Warehouse |
Observations & Insights
DuckDB
- 10M rows in 143 seconds (~2.4 min) - excellent for local/dev workloads
- Label Propagation dominates: 81s (57% of total) - bottleneck on both platforms
- Edge Building: 33s (23%), fast single-node execution
- Output Gen: 12s (8%), efficient local writes
- Throughput: ~70,000 entities/second
- Created 16.1M edges from 10M entities
- Resolved into 1.84M clusters (~5.4 entities/cluster average)
- RAM usage: ~8-12GB peak for 10M entities
Snowflake
- 10M rows in 168 seconds (~2.8 min) - fastest cloud platform!
- Only 1.17x slower than local DuckDB
- Label Propagation: 53s (32%) - fastest LP of all cloud platforms
- Edge Building: 58s (35%), excellent parallel execution
- Entity + Identifier Extraction: 22s combined (13%)
- Output Gen: 26s (15%), efficient MERGE operations
- Identical metrics: 16.1M edges, 1.84M clusters, 6 LP iterations
- Throughput: ~59,500 entities/second
- Warehouse: IDR_WH (standard size)
BigQuery
- 10M rows in 295 seconds (~4.9 min) - 2.1x slower than DuckDB
- Label Propagation: 101s (34%) - serverless overhead on iterative queries
- Output Gen: 91s (31%) - MERGE operations have high network overhead
- Edge Building: 50s (17%), good parallel execution
- Identical metrics: 16.1M edges, 1.84M clusters, 6 LP iterations
- Throughput: ~34,000 entities/second
- Estimated cost: ~$0.50 for 10M rows (on-demand pricing)
- BigQuery wins at larger scales due to horizontal scaling
Databricks
- 10M rows in 317 seconds (~5.3 min) - 2.2x slower than DuckDB
- Similar to BigQuery performance
- Identical metrics: 16.1M edges, 1.84M clusters, 6 LP iterations
- Throughput: ~31,500 entities/second
- Unity Catalog adds overhead for table operations
- Better suited for larger datasets with Spark parallelism
Recommendations
| Use Case |
Recommended Platform |
Rationale |
| < 10M rows, local dev |
DuckDB |
Fast, free, no infra |
| 10-50M rows, batch |
TBD |
Based on testing |
| 50-100M rows |
TBD |
Based on testing |
| > 100M rows |
TBD |
Based on testing |
| Real-time/streaming |
TBD |
Based on latency needs |
Test Commands
# Generate 20M dataset
python tools/scale_test/data_generator.py --rows=20000000 --seed=42 --output=data/
# Run DuckDB benchmark
python tools/scale_test/benchmark.py \
--platform=duckdb \
--data=data/retail_customers_20m.parquet \
--rows=20000000 \
--db=idr_benchmark.duckdb
# Generate 50M dataset
python tools/scale_test/data_generator.py --rows=50000000 --seed=42 --output=data/
# Generate 100M dataset
python tools/scale_test/data_generator.py --rows=100000000 --seed=42 --output=data/
Appendix: Test Data Distribution
Target distribution (retail industry standard):
| Cluster Size |
Percentage |
@ 20M |
@ 50M |
@ 100M |
| 1 (singleton) |
35% |
7M |
17.5M |
35M |
| 2 (pairs) |
25% |
5M |
12.5M |
25M |
| 3-5 (small) |
20% |
4M |
10M |
20M |
| 6-15 (medium) |
12% |
2.4M |
6M |
12M |
| 16-50 (large) |
5% |
1M |
2.5M |
5M |
| 51-200 (v.large) |
2% |
400K |
1M |
2M |
| 201-1000 (massive) |
1% |
200K |
500K |
1M |
Identifier match rates:
- Email: 55%
- Phone: 25%
- Loyalty ID: 10%
- Address: 10%
- Chain patterns: 15%