Architecture

This document describes the system architecture of SQL Identity Resolution.

High-Level Architecture

graph TB
    subgraph Sources["Source Layer"]
        S1[CRM System]
        S2[E-commerce]
        S3[Mobile App]
        S4[Support Tickets]
    end

    subgraph Meta["Metadata Layer"]
        M1["source_table<br/>(table registry)"]
        M2["rule<br/>(matching rules)"]
        M3["identifier_mapping<br/>(column mappings)"]
        M4["config<br/>(settings)"]
    end

    subgraph Process["Processing Layer"]
        P1["Extract Entities<br/>(delta detection)"]
        P2["Build Edges<br/>(identifier matching)"]
        P3["Label Propagation<br/>(connected components)"]
        P4["Assign Clusters<br/>(membership update)"]
    end

    subgraph Output["Output Layer"]
        O1["identity_resolved_membership_current"]
        O2["identity_clusters_current"]
        O3["golden_profile_current"]
        O4["metrics_export"]
        O5["run_history"]
    end

    Sources --> P1
    Meta --> P1
    P1 --> P2
    P2 --> P3
    P3 --> P4
    P4 --> Output

Cross-Platform Design

The same core logic runs on all platforms with platform-specific adapters:

graph LR
    subgraph Core["Core SQL Logic"]
        A[DDL Schema]
        B[Edge Building SQL]
        C[Label Propagation SQL]
        D[Output Generation SQL]
    end

    subgraph Adapters["Platform Adapters"]
        DA["DuckDB<br/>Python CLI"]
        SN["Snowflake<br/>Stored Procedure"]
        BQ["BigQuery<br/>Python CLI"]
        DB["Databricks<br/>Notebook"]
    end

    Core --> DA
    Core --> SN
    Core --> BQ
    Core --> DB

Data Flow

1. Input Processing

sequenceDiagram
    participant Source as Source Table
    participant Meta as Metadata
    participant Work as Work Tables

    Meta->>Work: Read source_table config
    Source->>Work: Extract delta (watermark filter)
    Work->>Work: Generate entity_key
    Work->>Work: Extract identifiers
    Note over Work: entities_delta table populated

2. Edge Building

sequenceDiagram
    participant Entities as entities_delta
    participant Rules as rule/mapping
    participant Edges as edges_new

    Entities->>Rules: Apply identifier extraction
    Rules->>Edges: Match on canonicalized values
    Note over Edges: Edge = (entity_a, entity_b, identifier_type)

3. Label Propagation

sequenceDiagram
    participant Edges as edges_new
    participant LP as Label Propagation
    participant Labels as lp_labels

    Edges->>LP: Initialize labels (label = entity_key)
    loop Until convergence or max_iters
        LP->>LP: Propagate MIN label along edges
    end
    LP->>Labels: Final cluster assignments

4. Output Generation

sequenceDiagram
    participant Labels as lp_labels
    participant Membership as membership_current
    participant Clusters as clusters_current
    participant Golden as golden_profile_current

    Labels->>Membership: Update resolved_id
    Membership->>Clusters: Compute cluster sizes
    Membership->>Golden: Build best-record profiles

Schema Design

idr_meta (Configuration)

Table	Purpose
`source_table`	Registry of source tables to process
`rule`	Matching rules with priority and limits
`identifier_mapping`	Maps source columns to identifier types
`run_state`	Watermark tracking per source
`config`	Key-value configuration settings
`identifier_exclusion`	Values to exclude from matching

idr_work (Processing)

Table	Purpose	Lifecycle
`entities_delta`	Entities to process this run	Per-run
`identifiers`	Extracted identifier values	Per-run
`edges_new`	Entity pairs with matching identifiers	Per-run
`lp_labels`	Label propagation results	Per-run
`membership_updates`	Proposed membership changes	Per-run

idr_out (Output)

Table	Purpose
`identity_resolved_membership_current`	Entity → Cluster mapping
`identity_clusters_current`	Cluster metadata (size, confidence_score, edge_diversity, match_density, primary_reason)
`golden_profile_current`	Best-record profiles per cluster
`run_history`	Audit log of all runs
`stage_metrics`	Per-stage timing metrics
`metrics_export`	Exportable metrics
`dry_run_results`	Per-entity dry run changes
`dry_run_summary`	Aggregate dry run statistics
`skipped_identifier_groups`	Audit of skipped groups

Processing Modes

Full Mode

Processes all entities from all source tables:

graph LR
    A[All Source Records] --> B[Full Entity Extraction]
    B --> C[Complete Edge Building]
    C --> D[Full Label Propagation]
    D --> E[Complete Membership Update]

Incremental Mode

Processes only changed entities (watermark-based):

graph LR
    A[Source Records] --> B{updated_at > last_watermark?}
    B -->|Yes| C[Delta Entity Extraction]
    B -->|No| D[Skip]
    C --> E[Incremental Edge Building]
    E --> F[Focused Label Propagation]
    F --> G[Delta Membership Update]

Dry Run Architecture

graph TB
    subgraph Normal["Normal Run"]
        N1[Process Entities]
        N2[Build Edges]
        N3[Label Propagation]
        N4[Update Membership]
        N5[Update Clusters]
        N6[Update Golden Profiles]
        N7[Update Watermarks]
    end

    subgraph DryRun["Dry Run"]
        D1[Process Entities]
        D2[Build Edges]
        D3[Label Propagation]
        D4[Compute Diff]
        D5[Write dry_run_results]
        D6[Write dry_run_summary]
        D7["⛔ Skip Production Writes"]
    end

Metrics Export Architecture

graph LR
    subgraph Runner["IDR Runner"]
        R1[record_metric calls]
    end

    subgraph Storage["metrics_export table"]
        S1[(metric_name, value, dimensions)]
    end

    subgraph Exporter["metrics_exporter.py"]
        E1[DuckDB Adapter]
        E2[Snowflake Adapter]
        E3[BigQuery Adapter]
    end

    subgraph Destinations["Export Destinations"]
        D1[stdout]
        D2[Webhook]
        D3[Prometheus]
        D4[DataDog]
    end

    Runner --> Storage
    Storage --> Exporter
    Exporter --> Destinations

Security Model

graph TB
    subgraph Access["Access Control"]
        A1[Service Account / Role]
    end

    subgraph Permissions["Required Permissions"]
        P1["SELECT on source tables"]
        P2["ALL on idr_meta.*"]
        P3["ALL on idr_work.*"]
        P4["ALL on idr_out.*"]
    end

    subgraph Data["Data Flow"]
        D1["Source Data<br/>(read-only)"]
        D2["Work Tables<br/>(transient)"]
        D3["Output Tables<br/>(persistent)"]
    end

    Access --> Permissions
    Permissions --> Data

Next Steps

Matching Algorithm - Deep dive into label propagation
Data Model - Complete schema reference
Configuration - Setting up rules and sources