Drift RCA Scan

Description of Drift RCA Scan

There are currently two kinds of RCA drift scan in the etiq library used to scan for feature and target drift respectively:

  1. scan_drift_metrics_rca

  2. scan_target_drift_metrics_rca

As a quick refresher:

Feature drift takes place when the distributions of the input features changes. For instance, perhaps you built your model on a sample dataset from the winter period and it's now summer, and your model predicting what kind of dessert people are more likely to buy is not longer as accurate.

Similarly to feature drift, target drift is about distribution of the predicted feature changing from one time period to the next.

For more details look at the Drift Scan Type section.

Imagine if only a part of the data drifted your overall tests might not pick up on it, but this test would. The scan it will auto-discover problematic segments on its own without the need for the user to specify segments to test.

RCA scans for concept drift are currently not implemented but it is on our roadmap and should be introduced in the near future.

For both types of scans, you have to set the parameters below:

ParameterDescription

thresholds

This is dictionary indexed by measure name specifying an lower and upper threshold that indicates a drift issue

drift_measures

This is a list of drift measures to use for the scan. If not specified then all drift measures including using defined ones are used

ignore_lower_threshold

This boolean flag allows the lower threshold to be ignored when scanning for segments with drift issues (this is by default set to True)

ignore_upper_threshold

This boolean flag allows the lower threshold to be ignored when scanning for segments with drift issues (this is by default set to True)

minimum_segment_size

This allows the user to set the minimum number of samples in a segment before the segment can be considered significant. By default this is set to 2% of the number of samples

features

(additional parameter only used for scan_drift_metrics_rca)

This is a list of features to consider when scanning for feature drift. By default all features in the snapshot dataset will be scanned

An example config file is as follows

{
    "dataset": {
        "label": "income",
        "bias_params": {
            "protected": "gender",
            "privileged": 1,
            "unprivileged": 0,
            "positive_outcome_label": 1,
            "negative_outcome_label": 0
        },
        "train_valid_test_splits": [0.8, 0.2, 0.0],
        "cat_col": "cat_vars",
        "cont_col": "cont_vars"
    },
    "scan_target_drift_metrics_rca": {
        "thresholds": {
            "psi": [0.0, 0.15]
        },
        "drift_measures": ["psi"],
        "ignore_lower_threshold": true,
        "ignore_upper_threshold": false,
        "minimum_segment_size": 1000  
    },
    "scan_drift_metrics_rca": {
        "thresholds": {
            "psi": [0.0, 0.15]
        },
        "drift_measures": ["psi"],
        "ignore_lower_threshold": true,
        "ignore_upper_threshold": false,
        "minimum_segment_size": 1000,
        "features": ["hours-per-week"] 
    }          
}

You can use any of the drift measures already provided.

The minimum segment size will impact the results. We recommend setting the minimum segment size at 2% of the sample size. However if 2% is less than a significant segment size for your sample (e.g. less than 1000), please increase it. By default the scans use 2% of your sample size.

For more details about how RCA scans work for target and feature drift see the notebook at https://github.com/ETIQ-AI/ml-testing/tree/main/RCA/RCA%20Drift%20Metrics.

Last updated