Leakage can be very detrimental to your model. If accidentally you have a feature which encodes the target, your model will look like it's performing really well and accuracy will be much higher than expected. However this will not hold in production and you will likely deploy the wrong model in production.
There are multiple types of leakage, but at the moment we provide:
Target leakage. This occurs when a feature leaked into the target. e.g. if you're trying to predict yearly income and accidentally a monthly salary feature is included in your dataset (for the same time period). While it seems hard to make this mistake, think of datasets with hundreds of features sources from different databases and repositories around a business, perhaps calculated by multiple teams.
Demographic leakage. This occurs when a feature leaked into one of your protected demographics feature. e.g. if relationship status contains information related to a customer's gender, then using that relationship status as a feature in a predictive model is highly problematic. If you have a use case where you've identified that this is not a problem, then do not use this scan. However depending on the methodology you use in your model build this might pose other types of issues.
A good indicator for the 2 types of leakage above is whether any of the features in the dataset are highly correlated, as this means that likely one has leaked into another. This criterion resembles the proxy issue in the bias scans; however the main difference is the level of the thresholds (the level is much higher for leakage).
We provide multiple correlation measures to be used based on the type of features: Pearson, Cramer's V, Rank-Biserial, Point-Biserial. Remember to clarify in the config or the snapshot which features are of which type to be able to use fully the multiple measure functionality. You can customize this in the config, but the default and recommended version is below:
- "continuous_continuous_measure" : "pearsons"
- "categorical_categorical_measure": "cramersv"
- "categorical_continuous_measure": "rankbiserial"
- "binary_continuous_measure": "pointbiserial"
This scan is more appropriate for training/pre-production stages.