SQL Identity Resolution

Production-grade deterministic identity resolution for modern data warehouses.

Multi-Platform

Run on DuckDB, Snowflake, BigQuery, or Databricks with the same logic.

Platform Setup
Production Ready

Dry run mode, metrics export, audit logging, and data quality controls.

Production Guide
SQL-First & dbt Native

Pure SQL execution or use the dbt package for seamless integration.

dbt Package
Open Source

Full control, full transparency, zero licensing costs.

GitHub

Why SQL Identity Resolution?

Your data never leaves your warehouse

Unlike SaaS solutions, SQL IDR runs entirely within your data platform. No data egress, no third-party processing, full compliance control.

What is Identity Resolution?

Identity resolution is the process of matching and merging records across multiple data sources to create a unified view of each entity (customer, user, account).

This solution uses deterministic matching with label propagation to create stable, explainable identity clusters.

graph LR
    A[CRM: customer_123<br/>email: john@acme.com] --> IDR[Identity Resolution]
    B[Ecommerce: order_456<br/>email: john@acme.com] --> IDR
    C[Mobile: device_789<br/>phone: 555-0123] --> IDR
    IDR --> D[Unified Identity<br/>resolved_id: customer_123]

Quick Start

DuckDBSnowflakeBigQueryDatabricks

# Clone and setup
git clone https://github.com/anilkulkarni87/sql-identity-resolution.git
cd sql-identity-resolution

# Create schema
duckdb idr.duckdb < sql/duckdb/00_ddl_all.sql

# Run (dry run first!)
python sql/duckdb/idr_run.py --db=idr.duckdb --run-mode=FULL --dry-run

-- Create schemas and tables
-- Run sql/snowflake/00_ddl_all.sql

-- Execute (dry run first!)
CALL idr_run('FULL', 30, TRUE);  -- TRUE = dry run

# Setup
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/sa.json

# Create schema
bq query < sql/bigquery/00_ddl_all.sql

# Run (dry run first!)
python sql/bigquery/idr_run.py --project=my-project --dry-run

# Import notebooks from sql/databricks/notebooks/
# Run IDR_QuickStart.py to setup
# Set DRY_RUN widget to "true" for preview

Full Quick Start Guide

Key Features

Feature	Description
Deterministic Matching	Rule-based matching on email, phone, loyalty ID, etc.
Label Propagation	Graph-based clustering using connected components
Incremental Processing	Watermark-based delta processing
Dry Run Mode	Preview changes before committing
Metrics Export	Prometheus, DataDog, webhook integrations
Production Hardening	Max group size limits, identifier exclusions
Golden Profiles	Automatic best-record selection

Architecture Overview

graph TB
    subgraph Sources["Source Systems"]
        S1[CRM]
        S2[E-commerce]
        S3[Mobile App]
    end

    subgraph Meta["Metadata Layer"]
        M1[source_table]
        M2[rule]
        M3[identifier_mapping]
    end

    subgraph Process["Processing"]
        P1[Build Edges]
        P2[Label Propagation]
        P3[Cluster Assignment]
    end

    subgraph Output["Output Layer"]
        O1[Membership]
        O2[Clusters]
        O3[Golden Profiles]
        O4[Metrics]
    end

    Sources --> Process
    Meta --> Process
    Process --> Output

Full Architecture

Next Steps

Quick Start

Get running in 5 minutes
Configuration

Set up rules and sources
Dry Run Mode

Preview changes safely
Metrics

Monitor your pipeline