In this context bias refers to algorithmic bias. "Algorithmic bias" refers to unintended discrimination occurring as a result of an automated decision.
Legislation defines a series of protected features. For example, in the UK, citizens are protected against discrimination on the basis of age, disability, gender reassignment, marriage and civil partnership, pregnancy and maternity, race, religion or belief, sex or sexual orientation status by the Equality Act 2010. The unprivileged group within the protected feature (for example, people over 65 when age is the protected feature) tends to be discriminated against and as a result tends to be the one protected by legislation. The privileged group within the protected feature tends to not be discriminated against.
If you are not tackling this issue, not only is your model potentially unethical, discriminating unintentionally and at risk from a compliance point of view, but also you are potentially leaving customer groups underserved and thus leaving money on the table.
Some of the metrics commonly used in the algorithmic fairness literature that the Etiq library provides are:
- Equal Opportunity: measures the difference in true positive rate between a privileged demographic group and an unprivileged demographic group.
- Demographic Parity: measures the difference between number of positive labels out of total from a privileged demographic group vs. a unprivileged demographic group)
- Equal Odds TNR: measures the difference between true negative rate - privileged vs. unprivileged. The full measure in the literature looks for an optimal point where the difference in true positive rate between demographic groups as well as the difference in true negative rate between demographic groups are both minimized.
- Individual Fairness: measures whether individuals with similar features observe the same model responses
Our Bias Metrics scan uses the metrics above with certain thresholds to see if the model meets that benchmark or not.
The syntax to run the scan after you’ve logged a snapshot is:
The thresholds are set by the user, BUT most metrics are ideally as close to 0 as possible, meaning that the model shouldn't really behave differently (and with detrimental outcomes) for the protected groups.
The consensus in the literature (and our view) is that algorithmic bias can be mitigated but not removed entirely.
Our Bias Sources scan identifies potential sources of bias based on a framework that includes:
- Proxies: features that are proxy for demographics
- Sample size disparity: difference in sample sizes and size of positive/negative labels between protected demographic and the majority demographic group
- Segment size: are some customer profiles poorly represented in your sample?
- Limited features/correlation issue: features are less reliable for a certain demographic group, which is oftentimes linked with sampling but more fundamentally it could be that some groups' behaviour is less well encoded by available features
It is useful to look at these metrics globally to uncover issues across your sample. But a lot of the issues will only be visible for specific groups, specific records. The Bias Sources scan aims to identify which groups have the issues above.
Bias sources scan is ran on training dataset by default as this is where the potentially harmful unfairly discriminatory pattern is learned by your model. You will not be running this scan in production. Bias metrics is ran on the validation dataset.
The syntax to run the scan after you've logged the relevant config file and a snapshot is:
You have two options of bias sources scans to run:
- 1.if you don't set anything in the config, the segments will be fuzzy rather than business rules.
- 2.if you set the option: auto in the config (as in the current config we are using) then the segments will be based on business rules.
If you use the auto option, you will need to specify the categorical and continuous features. You can do this either from the config as in this case:
"train_valid_test_splits": [0.8, 0.1, 0.1],
"cat_col": ["workclass", "relationship", "occupation", "gender", "race", "native-country", "marital-status", "income", "education"],
"cont_col": ["age", "educational-num", "fnlwgt", "capital-gain", "capital-loss", "hours-per-week"]
"equal_opportunity": [0.0, 0.2],
"demographic_parity": [0.0, 0.2],
"equal_odds_tnr": [0.0, 0.2],
"individual_fairness": [0.0, 0.2],
"equal_odds_tpr": [0.0, 0.2]
Or you can run it from the notebook:
#Load your dataset
#For bias sources you need to add some specific syntax at the moment or set-up your categorical and continuous features in the config
dataset_loader = etiq.dataset(data_encoded)
dl = etiq.dataset_loader.DatasetLoader(data=data_encoded, label='income', bias_params=dataset_loader.bias_params,
train_valid_test_splits=[0.8, 0.1, 0.1], cat_col=cat_vars,
cont_col=cont_vars, names_col = data_encoded.columns.values)
from etiq.model import DefaultXGBoostClassifier
# Load our model
model = DefaultXGBoostClassifier()
# Creating a snapshot
snapshot = project.snapshots.create(name="Snapshot 2", dataset=dl.initial_dataset, model=model, bias_params=dataset_loader.bias_params)
We provide multiple correlation measures to be used based on the type of features: Pearson, Cramer's V, Rank-Biserial, Point-Biserial. Remember to clarify in the config or the snapshot which features are of which type to be able to use fully the multiple measure functionality. You can customize this in the config, but the default and recommended version is below:
- "continuous_continuous_measure" : "pearsons"
- "categorical_categorical_measure": "cramersv"
- "categorical_continuous_measure": "rankbiserial"
- "binary_continuous_measure": "pointbiserial"
There are many additional sources of bias, which require more background or context knowledge than just observing the data or the model:
- 'Tainted' examples: the target variable is reflective of past bias, e.g. a model predicting who might make a good hire using data on who was hired in the past not on who was the objectively best candidate for the role
- Skewed sample - the dataset is not representative of the population for which the model will be used
Bias is one of the most complex topics today. We started Etiq to help teams tackle this problem.
We don’t believe that having a few scans in place is enough to tackle this problem. We don’t think our bias sources scans are by any means exhaustive. Additionally the metrics themselves are often misleading - we have published some research on this topic here. However, if via these scans, data science and engineering teams at least start considering algorithmic bias and fairness as a problem they should tackle, as important if not more important than accuracy based performance, or drift, or data issues, then we feel at least part of our mission is accomplished.
If you are interested in this problem in more depth, we’d be very happy to hear from you. We have done research in the space and have additional pipelines built as part of the lower level API which we’re happy to share and run you through if you’re interested (email us [email protected]).