Frequently Asked Questions
Quick answers to common questions about SQL Identity Resolution.
General Questions
What is identity resolution?
Identity resolution is the process of matching and linking records across multiple data sources that refer to the same real-world entity (customer, user, device, etc.). It creates a unified view from fragmented data.
What's the difference between deterministic and probabilistic matching?
| Approach | How It Works | Pros | Cons |
|---|---|---|---|
| Deterministic | Exact match on identifiers (email, phone, ID) | Explainable, fast, low false positives | Misses fuzzy matches |
| Probabilistic | Similarity scoring + ML | Catches more matches | Black box, more false positives |
SQL Identity Resolution uses deterministic matching for transparency and reliability.
Can I use this with my existing CDP?
Yes! SQL Identity Resolution creates identity clusters that can be exported to any system:
- As a replacement: Use it instead of your CDP's built-in resolution
- As a supplement: Run it alongside and compare results
- As an upstream source: Feed resolved IDs back to your CDP
Does this handle GDPR/CCPA compliance?
Your data never leaves your data warehouse. This is a key advantage over SaaS solutions. You maintain full control over:
- Data residency
- Access logging
- Right to deletion
- Consent management
Technical Questions
How does label propagation work?
Label propagation is a graph algorithm that assigns each node (entity) to a cluster:
- Each entity starts with its own label (entity_key)
- Each iteration, entities adopt the minimum label of their neighbors
- Repeat until no labels change
- All entities with the same label form a cluster
Typically converges in 5-15 iterations.
What happens when two systems have the same email?
They get linked! That's the point. If CRM:customer_123 and Ecommerce:order_456 both have john@acme.com, they'll be in the same cluster with the same resolved_id.
How do I prevent over-matching?
Use identifier exclusions for common/shared values:
INSERT INTO idr_meta.identifier_exclusion VALUES
('EMAIL', 'noreply@%', TRUE, 'Generic noreply address'),
('PHONE', '0000000000', FALSE, 'Invalid phone');
And set max_group_size limits:
Can I run this incrementally?
Yes! Set run_mode='INCR' to only process records with watermark_column > last_watermark_value. This is the recommended production mode.
Performance Questions
How long does it take?
| Records | Platform | Time |
|---|---|---|
| 100K | DuckDB | ~5 seconds |
| 1M | DuckDB | ~30 seconds |
| 10M | Snowflake (XS) | ~3 minutes |
| 100M | Snowflake (M) | ~15 minutes |
See Benchmark Results for detailed metrics.
What's the memory requirement?
- DuckDB: ~2GB RAM per 1M records
- Cloud platforms: Handled by the warehouse (scale compute as needed)
Can I parallelize the run?
The algorithm runs as a single job, but:
- Internal SQL operations use parallel execution
- You can partition large datasets by entity_type
- Incremental mode reduces per-run volume
Integration Questions
How do I schedule runs?
| Platform | Scheduler |
|---|---|
| DuckDB | Cron, Airflow, dbt Cloud |
| Snowflake | Tasks + Streams |
| BigQuery | Cloud Scheduler |
| Databricks | Jobs / Workflows |
See Scheduling Guide.
Can I use this with dbt?
Yes! The dbt_idr package provides native dbt integration:
See dbt Package Guide.
How do I export results to downstream systems?
Query the output tables directly:
-- Export to downstream
SELECT
m.entity_key,
m.resolved_id,
g.email_primary,
g.phone_primary
FROM idr_out.identity_resolved_membership_current m
JOIN idr_out.golden_profile_current g ON g.resolved_id = m.resolved_id