Production Deployment: BigQuery
This guide details the exact steps to deploy SQL Identity Resolution (IDR) to a production Google BigQuery environment.
Prerequisites
- GCP Project: A Google Cloud Project with BigQuery API enabled.
- Service Account: A service account with
BigQuery AdminorBigQuery Data Editor+BigQuery Job Userroles. - Python Environment: For running
load_metadata.pyandidr_run.py.
Step 1: Schema Setup
1.1 Create Datasets
Create the three required datasets in your project.
export PROJECT_ID="your-project-id"
bq mk --dataset ${PROJECT_ID}:idr_meta
bq mk --dataset ${PROJECT_ID}:idr_work
bq mk --dataset ${PROJECT_ID}:idr_out
1.2 Create Tables
Execute the DDL script to create all necessary tables.
Step 2: Configuration
Create a production.yaml file defining your rules and sources.
Example production.yaml:
rules:
- rule_id: email_exact
identifier_type: EMAIL
priority: 1
settings:
canonicalize: LOWERCASE
sources:
- table_id: ga4_events
table_fqn: your-project-id.analytics_123456.events_*
entity_key_expr: user_pseudo_id
trust_rank: 2
identifiers:
- type: COOKIE
expr: user_pseudo_id
Step 3: Metadata Loading
Use the load_metadata.py tool to push your configuration to BigQuery.
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json"
python tools/load_metadata.py \
--platform=bigquery \
--project=your-project-id \
--config=production.yaml
Step 4: Execution & Scheduling
The IDR process is a Python script (sql/bigquery/core/idr_run.py) that orchestrates BigQuery jobs.
Option A: Cloud Run Jobs (Recommended)
- Containerize the application (Dockerfile provided in repo).
- Deploy to Cloud Run Jobs or Google Kubernetes Engine (GKE).
- Schedule with Cloud Scheduler.
Command:
Option B: Cloud Composer (Airflow)
Use the BashOperator or KubernetesPodOperator to run the script.
run_idr = BashOperator(
task_id='run_idr',
bash_command='python sql/bigquery/core/idr_run.py --project={{ var.value.gcp_project }} --run-mode=FULL',
dag=dag
)
Step 5: Monitoring
Monitor the pipeline using the idr_out tables.
Check Run History: