Great Expectations

Great Expectations Integration with ETIQ library details

Feature added in Etiq 1.6

Overview

ETIQ adds integration with the OSS Great Expectations library. This allows you to quickly add a suite of tests for your dataset and show the results in the ETIQ dashboard.

The Great Expectations python library needs to be installed if you want to use this functionality;

pip install great_expectations

Use Cases

The library exposes Snapshot.scan_expectations() through which we can run suites or import existing results. Suites can either come from existing contexts, manually added via code or via JSON.

For the results to be shown in the ETIQ dashboard, you need to run etiq.login() before starting the scan.

Running An Existing Expectation Suite

If you've got an existing expectation suite you can pass it that suite as an argument to Snapshot.scan_expectations()

import great_expectations as ge

context = ge.get_context()

# What Suites do we have?
suite_names = context.list_expectation_suite_names()
# For example...
chosen_suite = suite_names[0]

# Now tell ETIQ to run them;
(segments, issues, aggregates) = snapshot.scan_expectations(context=context, suite_name=chosen_suite)

Importing Existing Results

If you've already run the suite, you can just pass those in. This will cause the results to be uploaded into the dashboard;

# existing_suite_checkpoint is a checkpoint from your own existing suite.
my_results = existing_suite_checkpoint.run()

# This will upload results to the dashboard if logged in.
# And return the results in the etiq style of three datasets showing segments,
# issues and aggregated issues as pandas datasets.
(segments, issues, aggregates) = snapshot.scan_expectations(results=my_results)

Declaring Expectations In Code

You can declare your expectations in code directly e.g. as part of a notebook. We supply a helper method Snapshot.get_validator() to quickly get a Great Expectations validator object.

This will then have all the expectations available to you. The Great Expectations website holds a useful reference list.

# Regular snapshot created in ETIQ
snapshot = project.snapshots.create(
    name="My Simple expectations",
    dataset=dataset,
    model=etiq.model.DefaultXGBoostClassifier(),
)

# This is our regular GX validator.
validator = snapshot.get_validator()

# Now we can declare our expectations as normal GX. For example;
validator.expect_column_values_to_not_be_null("age")
validator.expect_column_values_to_be_between("age", min_value=0, max_value=70)

# Finally actually run the validation by calling `scan_expectations`
segments, issues, aggregates = snapshot.scan_expectations(validator=validator)

These expectations are only run when scan_expectations is called.

Declaring Expectations in JSON Config

Much of the ETIQ library can be driven using a JSON config file. We provide a simple way to specify expectations in this text based format;

expectation_config.json

{
  "scan_expectations": {
    "json_suite": [
      {
        "expect_column_values_to_not_be_null": "age"
      },
      {
        "expect_column_values_to_be_between": {
          "column": "age",
          "min_value": 0,
          "max_value": 120,
        }
      }, 
    ]
  }
}

The Python code is then just:

# ...
from etiq import etiq_config
# ...

with etiq_config("expectation_config.json"):
    segments, issues, aggregates = snapshot.scan_expectations()

Syntax/Required Sections

scan_expectations - An Object.
scan_expectations.json_suite - A list of JSON expectations.

Expectations can be specified in two ways;

If the expectation takes a single argument, then it can just be {<expectation_name>: <argument>}
If the expectation takes more arguments, the argument itself should be an object where each key-value pair represents an argument and value.

The above expectation translates to:

validator.expect_column_values_to_not_be_null("age")
validator.expect_column_values_to_be_between(column="age", min_value=0, max_value=120)

Integration API Documentation

Snapshot.get_validator() -> Validator

This convenience method returns a validator object for creating expectation suites against.

Snapshot.scan_expectations(validator, context, suite_name, results)

This is our entry point to run expectations etc. Unless using the JSON method above, you must pass one or more of these arguments in:

validator - A GX Validator object to run.
context - A GX context object. This would contain an existing suite to run.
suite_name - String, name of suite to run from the context.
results - Existing results to send to the dashboard.

PreviousAirflow NextFAQ

Last updated 1 year ago

Was this helpful?