databricks_notebooks

A collection of Databricks notebooks for testing and learning

View the Project on GitHub anilkulkarni87/databricks_notebooks

πŸ“ Table of Contents

Great expectations - Open Source Data Quality Tool

Great expectations fills the void that our Data pipeline exists. As we transition from Big Data to Good Data, teams have begun to realize the importance of good data. But the definition of good has also evolved over time. We want metrics to define the health of our pipeline and beyond.

Concepts in great-expectations

  1. Data Context
    1. Data Sources
    2. Data Connectors
  2. Stores
    1. Expectations Store
      • Expectations are stored
      • Backend: Azure Storage
    2. Validations Store
      • Validation results are stored. Its the output when an expectation is run on a batch of data.
      • Backend: Azure Storage
    3. Evaluation Parameter store
      • Yet to figure out
      • I want to compare todays rowcounts with yeaterday’s rowcount for a specific batch. How do I do it?
    4. Profile Store
      • There is still work happening on this one.
      • What does this mean?
    5. Metrics Store
      • Metrics extracted from validation results are stored.
      • Backend: Postgres
      • Only some metrics are being inserted.
      • This feature is still evolving.
    6. Checkpoint Store
      • Checkpoint definitions are stored.
      • Backend: Azure Storage
  3. Data docs Sites
    • Backend: Azure Storage
    • The static html site is built.
  4. Checkpoints
    • Backend: Azure Storage
    • The complete yaml files are persisted.
    • Can define email alerts.

Installation

Installation was a cakewalk. I have tried this in the Databricks community edition.

Notebooks reference

  1. Setup Cluster by installing required libraries Setup cluster
  2. Define the BaseDataContext for GE Setup Data Context
  3. Define expectations and run tests Data Quality

Data Docs

Please click here to look at the Data docs that GE is able to build Data Docs

Advantages (from my perspective)

  1. The open source version is complete in terms of defining data quality checks, executing, persisting the expectations and the results. Finally the Data Docs which makes them presentable.
  2. Helps in building a static website for data docs and can be shared across the company/team.
  3. Can easily translate result to Spark Dataframe for persisting the result in Database.
  4. Can build wrapper around the configs to make onboarding easier.

Open Questions or ToDo:

✍️ Authors

πŸŽ‰ Acknowledgements