Etiq Docs
Search…
Drift RCA Scan
Description of Drift RCA Scan
There are currently two kinds of RCA drift scan in the etiq library. These are:
  1. 1.
    scan_drift_metrics_rca
  2. 2.
    scan_target_drift_metrics_rca
and are used to scan for feature and target drift respectively.
As a quick refresher:
Feature drift takes place when the distributions of the input features changes. For instance, perhaps you built your model on a sample dataset from the winter period and it's now summer, and your model predicting what kind of dessert people are more likely to buy is not longer as accurate.
Similarly to feature drift, target drift is about distribution of the predicted feature changing from one time period to the next.
For more details look at the Drift Scan Type section.
Imagine if only a part of the data drifted your overall tests might not pick up on it, but this test would. The scan it will auto-discover problematic segments on its own without the need for the user to specify segments to test.
RCA scans for concept drift are currently not implemented but it is on our roadmap and should be introduced in the near future.
For both types of scans, you have to set the parameters below:
  • thresholds: This is dictionary indexed by measure name specifying an lower and upper threshold that indicates a drift issue
  • drift_measures: This is a list of drift measures to use for the scan. If not specified then all drift measures including using defined ones are used.
  • ignore_lower_threshold: This boolean flag allows the lower threshold to be ignored when scanning for segments with drift issues (this is by default set to True)
  • ignore_upper_threshold: This boolean flag allows the lower threshold to be ignored when scanning for segments with drift issues (this is by default set to True)
  • minimum_segment_size: This allows the user to set the minimum number of samples in a segment before the segment can be considered significant. By default this is set to 2% of the number of samples.
In addition, scan_drift_metrics_rca has the additional parameter
  • features: This is a list of features to consider when scanning for feature drift. By default all features in the snapshot dataset will be scanned.
An example config file is as follows
{
"dataset": {
"label": "income",
"bias_params": {
"protected": "gender",
"privileged": 1,
"unprivileged": 0,
"positive_outcome_label": 1,
"negative_outcome_label": 0
},
"train_valid_test_splits": [0.8, 0.2, 0.0],
"cat_col": "cat_vars",
"cont_col": "cont_vars"
},
"scan_target_drift_metrics_rca": {
"thresholds": {
"psi": [0.0, 0.15]
},
"drift_measures": ["psi"],
"ignore_lower_threshold": true,
"ignore_upper_threshold": false,
"minimum_segment_size": 1000
},
"scan_drift_metrics_rca": {
"thresholds": {
"psi": [0.0, 0.15]
},
"drift_measures": ["psi"],
"ignore_lower_threshold": true,
"ignore_upper_threshold": false,
"minimum_segment_size": 1000,
"features": ["hours-per-week"]
}
}
You can use any of the drift measures already provided.
For more details about how RCA scans work for target and feature drift see the notebook at https://github.com/ETIQ-AI/ml-testing/tree/main/RCA/RCA%20Drift%20Metrics.
Copy link