Preventing Data Quality problems from hitting the Production environment using Great Expectations

By Itay Braun, CTO, Metis

In this post you'll learn how to apply data quality tests using an open-source platform called Great Expectations.

Why Should I care?

Data quality is important because it determines the accuracy and reliability of the information that is used to make decisions. Poor quality data can lead to incorrect conclusions and faulty decision-making, which can have serious consequences. For example, in a healthcare setting, poor-quality data can lead to incorrect diagnoses and inappropriate treatment, which can put patients at risk. In a business setting, poor-quality data can lead to bad financial decisions, which can result in financial losses and damage to the company's reputation. In short, ensuring data quality is essential for ensuring that the information used to make decisions is accurate and reliable.

Can you give me 3 real-world examples?

A database that has data with missing or incorrect values, leading to incorrect results when querying the database.
A data analysis model that is trained on data with bias, leading to inaccurate or unfair predictions.
A data pipeline that has errors in the data transformation process, leading to inconsistent or incorrect data.

Those problems are relevant to any organization: a hospital that has incorrectly entered patient information, a retail company that has inaccurate sales data, leading to incorrect inventory management, or a financial institution that has incorrect information about its customers, leading to incorrect credit scores.

Using Great Expectations - The Quick Way

GreatExpectations is an open-source tool that helps organizations to improve the quality of their data. It provides a framework for defining and testing the expectations for data, and allows users to validate and document their data. Great Expectations can be used to check for common data quality issues, such as missing values, out-of-range values, and duplicates, as well as more complex issues, such as data drift and hidden bias. It can be integrated into existing data pipelines and workflows to provide ongoing data quality assurance.

In this post you'll learn a simple and quick implementation, using Python:

Save the data from a query to a Pandas Data Frame.
Create a Great Expectation Data Set from the DataFrame
Apply expectations ( = rules) such as 99% of the values must be within a range of values
View the results

It is worth mentioning that there are other good open-source data quality platforms such as soda.io, DBT data testing, and deequ.

Step 1 - Get the Data

Usually, you'll only care about the recent data. Sometimes the data can be analyzed in a batch, say all the rows inserted or updated on the last day, using an SQL Query. If the data needed to be analyzed more often then consider setting a read-only replica of the DB and running the data quality tests there. For ex. AWS RDS Read-Only Replica can be instrumented in a few clicks.

If you need real-time data quality then you'll need a Change Data Capture (CDC) to detect the changes. I won't cover this subject in the post.

import great_expectations as ge
import pandas as pd
import sqlalchemy
import json
import array

engine = sqlalchemy.create_engine('postgresql://postgres:***@***.***.eu-central-1.rds.amazonaws.com:5432/airbases')
# print("Connecting with engine " + str(engine))
connection = engine.connect()

query = """
SELECT * FROM postgres_air.flight limit 1000;
"""
flight = pd.read_sql_query(query, connection)
## Generate a GreatExepectationd DataFrame (The exact type is: great_expectations.dataset.pandas_dataset.PandasDatase).
## This object has built-in expectations 
df = ge.from_pandas(flight)

In this example, the query just used the first 1000 rows. In the real world, the query will run on the data changed since the last check.

Step 2 - Run the Expectations

Each expectation uses a different configuration, documented here. I added a short description to each Data Quality Test, for better explainability.

## Evaluate an expection (Type: great_expectations.core.expectation_validation_result.ExpectationValidationResult). 
expectation_results = df.expect_column_values_to_be_unique("flight_id")
## Add a custom description
expectation_results.description = "Unique values in the column flight_id"

print("1. Test Name: " + expectation_results.description)
print("Success: " + str(expectation_results.success))
print("Type: " + str(expectation_results.expectation_config.expectation_type))

# print(expectation_results) ## Print the full results. Notice this is not JSON, though it looks like one. 

expectation_results= df.expect_column_values_to_be_between (
  column = "flight_no", 
  min_value = "100", 
  max_value = "1800",
  mostly = 0.99
)
expectation_results.description = "At least 99% of the results are within the range 100 to 1800"
print("2. Test Name: " + expectation_results.description)
print("Success: " + str(expectation_results.success))
print("Type: " + str(expectation_results.expectation_config.expectation_type))
if (expectation_results.success == False):
   print (expectation_results)

View the Results

In this simple example, the results are printed on the screen. An automated real-world scenario would use a native GitHub Action of GreatExpectations, sending the diagnostic report to their cloud. Or a custom Python script integrated with any other code / API to implement your custom workflow.

1. Test Name: Unique values in the column flight_id
Success: True
Type: expect_column_values_to_be_unique
2. Test Name: At least 99% of the results are within the range 100 to 1800
Success: False
Type: expect_column_values_to_be_between

When something goes wrong then probably you would like to see the full exepectation_results object (as JSON) to get a better understanding of the problem. In the example below the second expectation failed as not 99% of the values fall in the configured range.

{
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "element_count": 1000,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 825,
    "unexpected_percent": 82.5,
    "unexpected_percent_total": 82.5,
    "unexpected_percent_nonmissing": 82.5,
    "partial_unexpected_list": [
      "385",
      "3535",
      "5093",
      "189",
      "2574",
      "4238",
      "541",
      "2408",
      "4861",
      "1895",
      "4519",
      "5500",
      "352",
      "4922",
      "3735",
      "3002",
      "358",
      "5185",
      "2068",
      "390"
    ]
  },
  "meta": {},
  "expectation_config": {
    "kwargs": {
      "column": "flight_no",
      "min_value": "100",
      "max_value": "1800",
      "mostly": 0.99,
      "result_format": "BASIC"
    },
    "meta": {},
    "expectation_type": "expect_column_values_to_be_between"
  },
  "success": false
}

I must say that I suspect their code thinks "100" and "1800" are strings therefore the string "390" is seen as out of range. But using integers generated weird errors. More testing is needed on our side before we'll use this library to monitor the data quality of our app in production. Probably related to the datatypes.

Customer Rules (= Expectations)

The out-of-the-box expectations are useful but quite simple and naive. In reality, you want to write custom advanced calculated, related directly to recent changes in the logic of the app, to monitor the production DB after releasing the new version.

Technically it is possible to add custom expectations, as explained here. We are still learning how to simplify the process as I see the feature of simply adding custom rules as critical for adopting this, or any other else data quality framework.

Conclusion

Bad data is another DB-related problem. A smart developer should constantly monitor the data quality of the production environment, to find problems as quickly as possible, before they hit production.

GreatExpectation is one of the most popular data quality frameworks. Here we learned how to quickly implement it. But the Developer Experience (DX) can still be better.

BTW, I used Open.AI to write the introduction. It is very useful and lets you focus on the more important and complicated technical parts.