Drift can impact your model in production and make it perform worse than you initially expected.
There are a few different kinds of drift:
- 1.Feature drift: Feature drift takes place when the distributions of the input features changes.
- For instance, perhaps you built your model on a sample dataset from the winter period and it's now summer, and your model predicting what kind of dessert people are more likely to buy is not longer as accurate.
- 2.Target drift: Similarly to feature drift, target drift is about distribution of the predicted feature changing from one time period to the next.
- 3.Concept drift: Concept drift occurs when the relationships between the features and the predicted changes over time.
- 4.Prediction drift: Prediction drift refers to those instances when something happened to the model scoring itself when running in production and the relationship
- This means that somehow with the same or similar input dataset you'd get different predictions in the post-period as you did in the previous period.
We do not include scans related to prediction drift as we think that other scans will probably be more likely to uncover these issues and given our main use cases right now (classification).
To measure drift you will need a comparison or benchmark dataset. To load your comparison dataset use the following:
# Create a dataset with the comparison data
dataset_s = etiq.SimpleDatasetBuilder.from_dataframe(data_encoded, target_feature='income').build()
# Create a dataset with the data
todays_dataset_s = etiq.SimpleDatasetBuilder.from_dataframe(todays_dataset_df, target_feature='income').build()
Then log your snapshot and scan the dataset for drift:
# Create the snapshot
snapshot = project.snapshots.create(name="Test Snapshot", dataset=todays_dataset_s, comparison_dataset=dataset_s, model=None)
To run a drift scan you will not need a model, but you will need at least 2 datasets. For each snapshot, you log the current period dataset and the previous period dataset. When you call the drift scan, it will assess whether any drift issues occurred.
While most people think of drift as something that happens in production, in fact it is something you can test for as you build your model as well. If you have datasets from a different time period (a year ago, a quarter ago, seasonal), then you might want to see if the distributions of the features and labels have drifted over time (feature/target drift) or if the type of relationships between features and target have changed over time.
Thus we recommend that if you have the datasets you use these scans pre-production to give you an indication of potential issues you might encounter when your model does go live.
So far we have packaged the drift scans with the following assumption relevant for Target Drift and Concept Drift: When in production you will have to create your datasets containing actuals - irrespective of what the model predicted this feature will refer to what actually happened in reality, e.g. has the customer defaulted on their loan, was the transaction fraudulent, etc. At the moment you will have the same parameter in your config file: parameter ‘label’, but this feature will denote what actually happened in reality. If requested by our users, we are open to adding an actuals dataset type in the packaging in the future, and separately a ‘predicted’ feature.
For drift in production, just like for the other scans that can be used in production, we will shortly release an Etiq + Airflow demo. If you can’t wait or you use a different orchestration tool, please email us [email protected].
Depending on the type of scan, your use case, and the stage at which you are in the model building/deployment process, you will have multiple combinations of how to set-up your dataset and sampling %, and your scan. For a common classification use case please see suggested set-up below:
Remember you won’t need the model for your drift scans, just for your other scans.
Drift scans are trying to determine differences in two probability distributions. Feature drift and target drift use PSI and KS:
Population Stability Index (PSI): PSI measures the shift of a population over time or the shift between two samples of a population. This involves binning the two distributions and then comparing the population percentages in each bin such that:
where i represents the bin number, Actual is the present population distribution and Expected is the reference population distribution. PSI<0.1 indicates an insignificant change, 0.1<PSI<0.25 indicates a minor change and PSI>0.25 indicates a major change.
Kolmogorov-Smirnov Test: This is a test to assess if two data samples (D1 and D2) belong to the same probability distribution. This measure is defined as:
where P(x) and Q(x) are the Cumulative Distribution Functions of 1-D datasets D1 and D2, and is the supremum (it is the subset of samples x that maximizes |P(x)-Q(x)|). The KS test identifies differences in location and shape of the cumulative distribution functions of samples D1 and D2, and it works well with numerical data.
To assess whether concept drift took place, it’s not enough to see if the distributions of the input features or of the predicted feature have changed over time, instead we need to understand if the conditional probabilities changed over time. This is because concept drift looks not at changes in data but at changes in relationships between input features and the predicted feature.
The measures included in our library for this are as below:
You can definitely use these measures for data drift as well
Kullback–Leibler (KL) Divergence: This measures the difference between two probability distributions, These measures provide a straightforward metric for monitoring any significant changes in the input data or the model output. If Q and P represent respectively the distribution for the old and new data, then the KL divergence for Q and P is defined as:
Jensen-Shannon (JS) Divergence: This is an extension of KL divergence such that it is symmetric and smoother. If Q and P represent respectively the distribution for the old and new data, then the JS divergence for Q and P is defined as:
Wasserstein Distance/Earth Mover Distance: This is a distance measure between two 1D distributions. This measures the minimum amount of work needed to convert distribution P to distribution Q, where work is calculated by multiplying the amount of distribution weight to be moved by the distance moved. This measure is defined as:
where n is the number of bins, and i is the bin number. Use this to monitor the input distribution (single feature at a time) or prediction probabilities.
All these measure how two probability distributions differ. But the PSI and Kolmogorov-Smirnov tests measure differences in empirical distributions and therefore are not as suitable for measuring concept drift.
Custom Drift Measures: It is possible to define custom drift measures to be used with etiq. This is done using the
drift_measuredecorator for a custom target or feature drift measure or
concept_drift_measurefor a custom concept drift measure. For examples please see the notebook here.
For continuous features you will want to consider pre-bucketing your features. We are adding an option so that you don't need to do this in our next release. You will also run into issues if you have fewer/more categories than in our base dataset as the distribution shifts will impact most of the metrics above. We are also adding functionality to cover this eventuality.