Soda Spark - Open Source Data Quality Tool

I stumbled upon Soda when on the lookout for open source DQ tools. It was quite easy to get started on it and incorporate it in existing Data pipelines. This repo has a notebook which will help others in exploring Soda more and see if it suits there needs. The notebook is self explanatory, but I wanted to jot down detailed steps and share for folks who are looking for the same.

Soda Spark

The documentation is quite clean and easy to read and can be found here. Below are generalised steps to be done for onboarding to Soda.

Install soda spark
Read about metrics available by default
Identify and define Soda scan yml file for your dataset.
Execute the scan on your dataframe.
1. It currently returns the object scan_results
2. It used to return a Dataframe
Take Action based on results.

### Installation Installation was tricky currently when I tried it in Databricks Community Edition. You can refer to the Step 1 in the notebook for the workaround. Otherwise its pretty starightforward.

Notebooks reference

Setup Cluster by installing required libraries Setup cluster
Define the required variables for SODA Define variables
Define scan definition and execute tests Data Quality

Scan Results

Scan Results is a python Object and comprises of two child Objects:

Measurements It holds the result of all metrics defined in the yml file.
Test_result It holds the results of test defined at a table level and column level. We can programtically check for a specific measurement or test_result and take action based on it.

### My Approach I have currently converted the scan results to a measurement Dataframe and test_result Dataframe for easier analysis. My planned next steps were publishing these results to InfluxDB, visualize and define alerts there. Before I could do that, I stumbled upon Soda Cloud. (Paid Service with Free trial)

Soda Cloud

Soda Cloud does exactly what I wanted to do within Influxdata but without any additional work. Here are the steps I needed to do:

Setup Soda Cloud account
Create an api key
Setup the Soda Server Client
Add another argument when I execute the scan scan.execute(scan_definition, df, soda_server_client=soda_server_client)

The below image shows a monitor created automatically based on the Scan yml file.

The below image shows an alert for a test that has failed

You can look at all the Datasets you are monitoring in one place

You can look at Schema, monitors and Sample_data (if published) for each dataset:

Advantages (from my perspective)

Easy setup
Quick to onboard
Can leverage just open source as we could define actions based on Scan results.
If you leverage soda scan for all pipelines, Soda Cloud has the potential to act as Data Catalog with data health monitor.
Excellent community support on Slack..
Sample data or Failed records could also be sent to cloud platform instead of Soda Cloud.

Some things which could take time due to the learning curve. I intend to add more examples here.

Defining the Scan yml file.
Understanding the metrics that are provided by default.
Group metrics and Historical metrics

Next Steps and Limitations

Soda Cloud is a paid service and I think it should be so for what it provides.

Explore publishing the scan result to Influxdata (community edition)
Visualize it in influx. This could be a good stack for personal projects.
Soda currently supports many databses and soda-spark.
Soda streaming might be up soon.