Skip to content

SQL Identity Resolution

Production-grade deterministic identity resolution for modern data warehouses.

  • Multi-Platform


    Run on DuckDB, Snowflake, BigQuery, or Databricks with the same logic.

    Platform Setup

  • Production Ready


    Dry run mode, metrics export, audit logging, and data quality controls.

    Production Guide

  • SQL-First & dbt Native


    Pure SQL execution or use the dbt package for seamless integration.

    dbt Package

  • Open Source


    Full control, full transparency, zero licensing costs.

    GitHub


Why SQL Identity Resolution?

Your data never leaves your warehouse

Unlike SaaS solutions, SQL IDR runs entirely within your data platform. No data egress, no third-party processing, full compliance control.


What is Identity Resolution?

Identity resolution is the process of matching and merging records across multiple data sources to create a unified view of each entity (customer, user, account).

This solution uses deterministic matching with label propagation to create stable, explainable identity clusters.

graph LR
    A[CRM: customer_123<br/>email: john@acme.com] --> IDR[Identity Resolution]
    B[Ecommerce: order_456<br/>email: john@acme.com] --> IDR
    C[Mobile: device_789<br/>phone: 555-0123] --> IDR
    IDR --> D[Unified Identity<br/>resolved_id: customer_123]

Quick Start

# Clone and setup
git clone https://github.com/anilkulkarni87/sql-identity-resolution.git
cd sql-identity-resolution

# Create schema
duckdb idr.duckdb < sql/duckdb/00_ddl_all.sql

# Run (dry run first!)
python sql/duckdb/idr_run.py --db=idr.duckdb --run-mode=FULL --dry-run
-- Create schemas and tables
-- Run sql/snowflake/00_ddl_all.sql

-- Execute (dry run first!)
CALL idr_run('FULL', 30, TRUE);  -- TRUE = dry run
# Setup
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/sa.json

# Create schema
bq query < sql/bigquery/00_ddl_all.sql

# Run (dry run first!)
python sql/bigquery/idr_run.py --project=my-project --dry-run
# Import notebooks from sql/databricks/notebooks/
# Run IDR_QuickStart.py to setup
# Set DRY_RUN widget to "true" for preview

Full Quick Start Guide


Key Features

Feature Description
Deterministic Matching Rule-based matching on email, phone, loyalty ID, etc.
Label Propagation Graph-based clustering using connected components
Incremental Processing Watermark-based delta processing
Dry Run Mode Preview changes before committing
Metrics Export Prometheus, DataDog, webhook integrations
Production Hardening Max group size limits, identifier exclusions
Golden Profiles Automatic best-record selection

Architecture Overview

graph TB
    subgraph Sources["Source Systems"]
        S1[CRM]
        S2[E-commerce]
        S3[Mobile App]
    end

    subgraph Meta["Metadata Layer"]
        M1[source_table]
        M2[rule]
        M3[identifier_mapping]
    end

    subgraph Process["Processing"]
        P1[Build Edges]
        P2[Label Propagation]
        P3[Cluster Assignment]
    end

    subgraph Output["Output Layer"]
        O1[Membership]
        O2[Clusters]
        O3[Golden Profiles]
        O4[Metrics]
    end

    Sources --> Process
    Meta --> Process
    Process --> Output

Full Architecture


Next Steps