"Algorithmic bias" refers to unintended discrimination occurring as a result of an automated decision. The term "protected feature" refers to a specific demographic characteristic such as age or sex. Legislation defines a series of protected features. For example, in the UK, citizens are protected against discrimination on the basis of age, disability, gender reassignment, marriage and civil partnership, pregnancy and maternity, race, religion or belief, sex or sexual orientation status by the Equality Act 2010.
The unprivileged group within the protected feature (for example, people over 65 when age is the protected feature) tends to be discriminated against and as a result tends to be the one protected by legislation. The privileged group within the protected feature tends to not be discriminated against.
We are also using terminology such as "debias" in the library to expedite articulation. The consensus in the literature (and our view) is that algorithmic bias can be mitigated but not removed entirely.
How is bias measured?
There is no consensus on the most appropriate way to measure bias, however depending on the framework used, there are some key metrics worth knowing.
Before we get into this, a quick explanation of model building. A model uses data to make predictions. During training, a model "learns" of a way to use training data to understand which combination of features predict a positive or negative outcome (labels). Testing the model on a validation or test dataset lets the user quantify how accurate are the model predictions. There are many ways to measure bias and this is an on-going research topic. One way to measure bias is to compute fairness metrics on the predictions and ground truth that a trained model makes for a dataset. The fairness metrics attempt to encode in mathematical terms a notion of what a fair outcome should be for the model and the dataset. It is important to consider whether a particular fairness metric encapsulates the notion of a fair model for each individual project.
Some of the metrics commonly used in the algorithmic fairness literature that the Etiq library provides are:
Demographic parity - is the ratio of users predicted to be positive over all the users the same for all groups in a demographic? For instance, is the proportion of women accepted for an interview the same as the proportion of men?
Equal opportunity - is the model as accurate for all demographic groups? Is the true positive rate the same for all demographics? True positives rate measures the proportion of actual positives that are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition). If the true positives rate is lower for a group then likely that group is experiencing bias.
Equal odds - an extension on Equal opportunity. It does not look just at true positive rates but also at false negative rates for the different demographic groups to ensure that the model performs equally well for all the different groups.
Individual fairness - a different angle on bias is to ensure that customers who display the same characteristics are treated the same. This does not yet have a clear definition.
Demographic parity, equal opportunity and equal odds are described in this paper, and individual fairness is described in this paper.
The fairness & algorithmic bias literature is very complex and there is no consensus on how to measure and mitigate algorithmic bias.
What are some solutions and general approaches to the algorithmic bias problem?
In our understanding of the fairness literature, below are the key general areas:
Optimization: pre-processing, in-processing and post-processing methods which attempt to optimize for both fairness metrics and accuracy. Some examples include: mapping the training data to a space independent of the specific demographic, adversarial debiasing, calibrating the model once it's built. The repair approaches can be anywhere from repairs that are very non-intrusive, e.g. resampling to those that are changing the labels and feature distribution quite heavily.
Causality: Causality type approaches overlap with both counterfactuals and optimization ones, but are firmly rooted in the idea that a dataset can be modelled into a causal graph which can then point if belonging to a certain demographic class impacts other feature and via them impacts the outcome.
What framework do you use to measure, identify and mitigate the extent of the problem?
For the pipeline we released we are using group metrics and sources of bias approaches.
The sources of bias framework used relies on this lecture. According to this framework, there are roughly 5 areas of sources of bias (within the model build process, outside issues like team diversity, data collection, etc.). 3 of them are visible from the data and/or model:
proxies - features that are proxy for demographics
sample size disparity - sample size for the protected demographic group is quite a bit lower than for the majority class
limited features - features might be less reliable for a certain demographic group than for a majority class
The remaining 2 sources of bias require more background or context knowledge:
'tainted' examples - the target variable is reflective of past bias, e.g. a model predicting who might make a good hire using data on who was hired in the past not on who was the objectively best candidate for the role
skewed sample - the dataset is not representative of the population for which the model will be used
As expected, different bias sources can be mitigated by different repairs. The repair we focus on at the moment is only at pre-processing stage - changing the dataset in such a way as to mitigate some of the inherent bias issues it presents.
This pipeline is experimental.
1.3 includes additional pipelines and we will make them available on AWS Marketplace shortly. If you want to use them just get in touch: [email protected]
I'm just starting to look into the algorithmic bias topic. What are some resources to read for further reference?
How many records should I use to get an outcome from this pipeline?
Our test samples so far were on minimum 20K rows. If you try it on datasets fewer than 10K please submit any issues or questions on our slack channel.
How confident can I be of the results of the pipeline?
Our goal is for our pipelines to be transparent enough in terms of the outcome that it will be clear to the user how reliable the results are. We are also working on adding stability measures to our library.