Leakage

What is leakage?

Leakage can be very detrimental to your model. If accidentally you have a feature which encodes the target, your model will look like it's performing really well and accuracy will be much higher than expected. However this will not hold in production and you will likely deploy the wrong model in production.

There are multiple types of leakage, but at the moment we provide:

Target leakage. This occurs when a feature leaked into the target. e.g. if you're trying to predict yearly income and accidentally a monthly salary feature is included in your dataset (for the same time period). While it seems hard to make this mistake, think of datasets with hundreds of features sources from different databases and repositories around a business, perhaps calculated by multiple teams.

Demographic leakage. This occurs when a feature leaked into one of your protected demographics feature. e.g. if relationship status contains information related to a customer's gender, then using that relationship status as a feature in a predictive model is highly problematic. If you have a use case where you've identified that this is not a problem, then do not use this scan. However depending on the methodology you use in your model build this might pose other types of issues.

These are just 2 potential ways for leakage to occur. We're adding additional scans to deal with more types of leakage.

Metrics

A good indicator for the 2 types of leakage above is whether any of the features in the dataset are highly correlated, as this means that likely one has leaked into another. This criterion resembles the proxy issue in the bias scans; however the main difference is the level of the thresholds (the level is much higher for leakage).

We provide multiple correlation measures to be used based on the type of features: Pearson, Cramer's V, Rank-Biserial, Point-Biserial. Remember to clarify in the config or the snapshot which features are of which type to be able to use fully the multiple measure functionality. You can customize this in the config, but the default and recommended version is below:

  • "continuous_continuous_measure" : "pearsons"

  • "categorical_categorical_measure": "cramersv"

  • "categorical_continuous_measure": "rankbiserial"

  • "binary_continuous_measure": "pointbiserial"

Setting up leakage scans

StageScanSnapshot set-up

Pre-production

(if you use an etiq wrapped model)

scan_target_leakage() scan_demographic_leakage()

You can use the whole dataset and set-up the split % based on whatever you prefer (leaving at least 10% in the validation sample).

Etiq dataset loader will split it for you when it creates the snapshot. By default the scan will be run on the training sample.

Pre-production

(if you log an already trained model)

scan_target_leakage() scan_demographic_leakage()

You should use your actual training dataset. By default the scan will be run on the training sample.

Production vs. pre-production

This scan is more appropriate for training/pre-production stages.

Example Notebooks

For example notebooks, code and config files for leakage scans please see repo link.

Last updated