Matching Algorithm
This document explains the deterministic matching algorithm used in SQL Identity Resolution.
Overview
The algorithm uses Label Propagation on a graph of entities connected by shared identifiers. This is a form of connected components detection that assigns all entities in a connected subgraph to the same cluster.
Algorithm Steps
Step 1: Identifier Extraction
Extract identifiers from source entities based on mapping rules:
graph LR
subgraph Source["Source Record"]
A["customer_id: 123<br/>email: john@acme.com<br/>phone: 555-0123"]
end
subgraph Identifiers["Extracted Identifiers"]
I1["EMAIL: john@acme.com"]
I2["PHONE: 5550123"]
end
Source --> Identifiers
Normalization rules: - Email: lowercase, trim whitespace - Phone: digits only - Custom: configurable per identifier type
Step 2: Edge Building
Create edges between entities that share an identifier:
graph LR
subgraph Entities
E1["Entity A<br/>email: john@acme.com"]
E2["Entity B<br/>email: john@acme.com"]
E3["Entity C<br/>phone: 555-0123"]
end
subgraph Edges
EG1["(A, B, EMAIL)"]
EG2["(B, C, PHONE)"]
end
E1 & E2 --> EG1
E2 & E3 --> EG2
SQL Logic:
-- Create edges from shared identifiers
INSERT INTO idr_work.edges_new
SELECT
a.entity_key AS entity_a,
b.entity_key AS entity_b,
a.identifier_type
FROM idr_work.identifiers a
JOIN idr_work.identifiers b
ON a.identifier_type = b.identifier_type
AND a.identifier_value_norm = b.identifier_value_norm
AND a.entity_key < b.entity_key -- Avoid duplicates
Step 3: Label Propagation
Iteratively propagate the minimum label along edges until convergence:
graph TB
subgraph Iteration0["Iteration 0 (Initial)"]
A0["A: label=A"]
B0["B: label=B"]
C0["C: label=C"]
end
subgraph Iteration1["Iteration 1"]
A1["A: label=A"]
B1["B: label=A (min of A,B)"]
C1["C: label=B (min of B,C)"]
end
subgraph Iteration2["Iteration 2"]
A2["A: label=A"]
B2["B: label=A"]
C2["C: label=A (min of A,B)"]
end
subgraph Final["Converged"]
AF["A: label=A ✓"]
BF["B: label=A ✓"]
CF["C: label=A ✓"]
end
Iteration0 --> Iteration1
Iteration1 --> Iteration2
Iteration2 --> Final
SQL Logic (one iteration):
-- Propagate minimum label along edges
UPDATE idr_work.lp_labels curr
SET label = (
SELECT MIN(neighbor.label)
FROM idr_work.edges_new e
JOIN idr_work.lp_labels neighbor
ON neighbor.entity_key = CASE
WHEN e.entity_a = curr.entity_key THEN e.entity_b
ELSE e.entity_a
END
WHERE e.entity_a = curr.entity_key OR e.entity_b = curr.entity_key
)
WHERE EXISTS (
SELECT 1 FROM idr_work.edges_new e
WHERE e.entity_a = curr.entity_key OR e.entity_b = curr.entity_key
);
Step 4: Cluster Assignment
The final label becomes the resolved_id:
| entity_key | label (resolved_id) |
|---|---|
| customer:A | customer:A |
| customer:B | customer:A |
| customer:C | customer:A |
| customer:D | customer:D |
Handling Edge Cases
Singletons
Entities with no matching identifiers get resolved_id = entity_key:
-- Singletons: entities with no edges
INSERT INTO idr_work.membership_updates
SELECT entity_key, entity_key AS resolved_id
FROM idr_work.entities_delta
WHERE entity_key NOT IN (
SELECT entity_key FROM idr_work.lp_labels
);
Large Groups (max_group_size)
To prevent generic identifiers (e.g., test@test.com) from creating mega-clusters:
graph TB
subgraph Normal["Normal Group (size=5)"]
N1[Entity 1]
N2[Entity 2]
N3[Entity 3]
N4[Entity 4]
N5[Entity 5]
end
subgraph Large["Large Group (size=15000)"]
L1[Entity 1]
L2["..."]
L3[Entity 15000]
end
subgraph Result["After max_group_size=10000"]
R1["Normal: Connected ✓"]
R2["Large: Skipped ⚠️<br/>(logged to skipped_identifier_groups)"]
end
Normal --> R1
Large --> R2
Configuration:
INSERT INTO idr_meta.rule (rule_id, identifier_type, max_group_size)
VALUES ('email_exact', 'EMAIL', 10000);
Identifier Exclusions
Skip known bad identifiers:
INSERT INTO idr_meta.identifier_exclusion
VALUES ('EMAIL', 'test@test.com', FALSE, 'Generic test email');
INSERT INTO idr_meta.identifier_exclusion
VALUES ('EMAIL', '%@example.com', TRUE, 'Example domain pattern');
Convergence
The algorithm converges when no labels change between iterations:
graph LR
A[Iteration N] --> B{Labels Changed?}
B -->|Yes| C[Iteration N+1]
B -->|No| D[Converged]
C --> A
E[Max Iterations] --> F{Reached Limit?}
F -->|Yes| D
Typical convergence: - Small graphs: 2-5 iterations - Medium graphs: 5-10 iterations - Large graphs: 10-20 iterations - Max default: 30 iterations
Complexity Analysis
| Step | Time Complexity | Space Complexity |
|---|---|---|
| Identifier Extraction | O(n) | O(n × m) |
| Edge Building | O(n × m) | O(e) |
| Label Propagation | O(i × e) | O(n) |
| Cluster Assignment | O(n) | O(n) |
Where: - n = number of entities - m = avg identifiers per entity - e = number of edges - i = iterations until convergence
Comparison with Other Algorithms
| Algorithm | Pros | Cons |
|---|---|---|
| Label Propagation (this) | Simple, deterministic, SQL-native | May not handle probabilistic matches |
| Union-Find | Faster for sparse graphs | Harder to implement in SQL |
| ML-based (Dedupe) | Handles fuzzy matches | Requires training, black box |
| Fellegi-Sunter | Statistical rigor | Complex, requires tuning |
Next Steps
- Data Model - Schema reference
- Configuration - Setting up rules
- Production Hardening - max_group_size, exclusions