A collection of Databricks notebooks for testing and learning
View the Project on GitHub anilkulkarni87/databricks_notebooks
I stumbled upon Soda when on the lookout for open source DQ tools. It was quite easy to get started on it and incorporate it in existing Data pipelines. This repo has a notebook which will help others in exploring Soda more and see if it suits there needs. The notebook is self explanatory, but I wanted to jot down detailed steps and share for folks who are looking for the same.
The documentation is quite clean and easy to read and can be found here. Below are generalised steps to be done for onboarding to Soda.
scan_results
### Installation Installation was tricky currently when I tried it in Databricks Community Edition. You can refer to the Step 1 in the notebook for the workaround. Otherwise its pretty starightforward.
Scan Results is a python Object and comprises of two child Objects:
measurement
or test_result
and take action based on it.### My Approach I have currently converted the scan results to a measurement Dataframe and test_result Dataframe for easier analysis. My planned next steps were publishing these results to InfluxDB, visualize and define alerts there. Before I could do that, I stumbled upon Soda Cloud. (Paid Service with Free trial)
Soda Cloud does exactly what I wanted to do within Influxdata but without any additional work. Here are the steps I needed to do:
scan.execute(scan_definition, df, soda_server_client=soda_server_client)
The below image shows a monitor created automatically based on the Scan yml file.
The below image shows an alert for a test that has failed
You can look at all the Datasets you are monitoring in one place
You can look at Schema, monitors and Sample_data (if published) for each dataset:
Some things which could take time due to the learning curve. I intend to add more examples here.
Soda Cloud is a paid service and I think it should be so for what it provides.