Etiq Docs
Search…
Leakage

What is leakage?

Leakage can be very detrimental to your model. If accidentally you have a feature which encodes the target your model will look like it's performing really well and accuracy will be much higher than expected. However this will not hold in production and you will likely deploy the wrong model in production.
There are multiple types of leakage, but at the moment we provide:
  1. 1.
    Target leakage. This occurs when a feature leaked into the target. E.g. if you're trying to predict yearly income and accidentally a monthly salary feature is included in your dataset (for the same time period). While it seems hard to make this mistake, think of datasets with hundreds of features sources from different databases and repositories around a business, perhaps calculated by multiple teams.
  2. 2.
    Demographic leakage. This occurs when a feature leaked into one of your protected demographics feature, E.g. if relationship status contains information related to a customer's gender, then using that relationship status as a feature in a predictive model is highly problematic.

Metrics

At the moment we only provide Pearson's correlation which means there is an issue with calculating correlations for known categorical features. This is on our roadmap as a priority fix.
A good indicator for the 2 types of leakage above is whether any of the features in the dataset are highly correlated, as this means that likely one has leaked into another. This criterion resembles the proxy issue in the bias scans; however the main difference is the level of the thresholds.

Setting up leakage scans

Stage
Scan
Snapshot set-up
Pre-production (if you use an etiq wrapped model)
scan_leakage()
You can use the whole dataset and set-up the split % based on whatever you prefer (leaving at least 10% in the validation sample). Etiq dataset loader will split it for you when it creates the snapshot. By default the scan will be run on the training sample.
Pre-production (if you log an already trained model)
scan_leakage()
You should use your actual training dataset By default the scan will be run on the training sample.

Production vs. pre-production

This scan is more appropriate for training/pre-production stages.

Example Notebooks

For example notebooks, code and config files for accuracy scans please see repo link.