Bias

What is bias?

In this context, bias refers to algorithmic bias. "Algorithmic bias" refers to unintended discrimination occurring as a result of an automated decision.

Legislation defines a series of protected features. For example, in the UK, citizens are protected against discrimination on the basis of age, disability, gender reassignment, marriage and civil partnership, pregnancy and maternity, race, religion or belief, sex or sexual orientation status by the Equality Act 2010.

The unprivileged group within the protected feature (for example, people over 65 when age is the protected feature) tends to be discriminated against and as a result tends to be the one protected by legislation. The privileged group within the protected feature tends to not be discriminated against.

If you are not tackling this issue, not only is your model potentially unethical, discriminating unintentionally and at risk from a compliance point of view, but also you are potentially leaving customer groups underserved and thus leaving money on the table.

Bias Metrics Scan

Some of the metrics commonly used in the algorithmic fairness literature that the Etiq library provides are:

MetricsDescription

Equal Opportunity

measures the difference in true positive rate between a privileged demographic group and an unprivileged demographic group

Demographic Parity

measures the difference between number of positive labels out of total from a privileged demographic group vs. a unprivileged demographic group)

Equal Odds TNR

measures the difference between true negative rate - privileged vs. unprivileged

The full measure in the literature looks for an optimal point where the difference in true positive rate between demographic groups as well as the difference in true negative rate between demographic groups are both minimized

Individual Fairness

measures whether individuals with similar features observe the same model responses

Our Bias Metrics scan uses the metrics above with certain thresholds to see if the model meets that benchmark or not.

The syntax to run the scan after you’ve logged a snapshot is:

snapshot.scan_bias_metrics()

The thresholds are set by the user, but most metrics are ideally as close to 0 as possible, meaning that the model shouldn't really behave differently (and with detrimental outcomes) for the protected groups.

The consensus in the literature (and our view) is that algorithmic bias can be mitigated but not removed entirely.

This is still a new area of research, and the metrics available can be misleading. For more resources please see our research post on this topic.

Bias Sources Scan

Our Bias Sources scan identifies potential sources of bias based on a framework that includes:

SourcesDescription

Proxies

features that are proxy for demographics

Sample size disparity

difference in sample sizes and size of positive/negative labels between protected demographic and the majority demographic group

Segment size

are some customer profiles poorly represented in your sample?

Limited features /

correlation issue

features are less reliable for a certain demographic group

this is oftentimes linked with sampling but more fundamentally it could be that some groups' behaviour is less well encoded by available features

It is useful to look at these metrics globally to uncover issues across your sample. But a lot of the issues will only be visible for specific groups or specific records. The Bias Sources scan aims to identify which groups have the issues above.

Bias sources scan is ran on training dataset by default as this is where the potentially harmful unfairly discriminatory pattern is learned by your model. You will not be running this scan in production. Bias metrics is ran on the validation dataset.

The syntax to run the scan after you've logged the relevant config file and a snapshot is:

snapshot.scan_bias_sources()

You have two options of bias sources scans to run:

  1. if you don't set anything in the config, the segments will be fuzzy rather than business rules.

  2. if you set the option: auto in the config (as in the current config we are using) then the segments will be based on business rules.

If you use the auto option, you will need to specify the categorical and continuous features. You can do this either from the config as in this case:

{
    "dataset": {
        "label": "income",
        "bias_params": {
            "protected": "gender",
            "privileged": 1,
            "unprivileged": 0,
            "positive_outcome_label": 1,
            "negative_outcome_label": 0
        },
        "train_valid_test_splits": [0.8, 0.1, 0.1],
        "remove_protected_from_features": true,
        "cat_col": ["workclass", "relationship", "occupation", "gender", "race", "native-country", "marital-status", "income", "education"],
        "cont_col": ["age", "educational-num", "fnlwgt", "capital-gain", "capital-loss", "hours-per-week"]
    },
	"scan_bias_metrics": {
        "thresholds": {
            "equal_opportunity": [0.0, 0.2],
            "demographic_parity": [0.0, 0.2],
            "equal_odds_tnr":  [0.0, 0.2], 
			"individual_fairness": [0.0, 0.2], 
			"equal_odds_tpr": [0.0, 0.2] 
			
        }
    }, 
	"scan_bias_sources": {
        "auto": true
    }  
}

Or you can run it from the notebook:

#Load your dataset
#For bias sources you need to add some specific syntax at the moment or set-up your categorical and continuous features in the config

dataset_loader = etiq.dataset(data_encoded)
dl = etiq.dataset_loader.DatasetLoader(data=data_encoded, label='income', bias_params=dataset_loader.bias_params,
                   train_valid_test_splits=[0.8, 0.1, 0.1], cat_col=cat_vars,
                   cont_col=cont_vars, names_col = data_encoded.columns.values)

from etiq.model import DefaultXGBoostClassifier
# Load our model
model = DefaultXGBoostClassifier()

# Creating a snapshot
snapshot = project.snapshots.create(name="Snapshot 2", dataset=dl.initial_dataset, model=model, bias_params=dataset_loader.bias_params)

We provide multiple correlation measures to be used based on the type of features: Pearson, Cramer's V, Rank-Biserial, Point-Biserial. Remember to clarify in the config or the snapshot which features are of which type to be able to use fully the multiple measure functionality. You can customize this in the config, but the default and recommended version is below:

  • "continuous_continuous_measure" : "pearsons"

  • "categorical_categorical_measure": "cramersv"

  • "categorical_continuous_measure": "rankbiserial"

  • "binary_continuous_measure": "pointbiserial"

There are many additional sources of bias, which require more background or context knowledge than just observing the data or the model:

  • 'Tainted' examples: the target variable is reflective of past bias

    • e.g. a model predicting who might make a good hire using data on who was hired in the past not on who was the objectively best candidate for the role

  • Skewed sample: the dataset is not representative of the population for which the model will be used

Production vs. pre-production

StageScanSnapshot set-up

Pre-production

(etiq wrapped model)

Bias Sources: scan_bias_sources()

You can use the whole dataset and set-up the split % based on whatever you prefer (leaving at least % in the validation sample). Etiq dataset loader will split it for you when it creates the snapshot.

By default the scan will be run on the training sample.

The parameter ‘label’ refers to predicted (and because this is your training/test/validation it will also be your actuals)

Pre-production

(etiq wrapped model)

Bias Metrics:

scan_bias_metrics()

You can use the whole dataset and set-up the split % based on whatever you prefer. Etiq dataset loader will split it for you when it creates the snapshot.

By default the scan will be run on the validation sample.

The parameter ‘label’ refers to predicted (and because this is your training/test/validation it will also be your actuals)

Pre-production

(already trained user model)

Bias Sources: scan_bias_sources()

You should log your actual training dataset as training by setting the split in the config file like this: train_valid_test_splits": [1.0, 0.0, 0.0].

By default the scan will be run on the training sample.

You will have to run this scan separately from the bias metrics and bias accuracy scans. (We are working on changing this).

The parameter ‘label’ refers to predicted (and because this is your training/test/validation it will also be your actuals)

Pre-production

(already trained user model)

Bias Metrics:

scan_bias_metrics()

You should log your actual test/validation dataset (the sample you did not use to train the model) as validation by setting the split in the config file like this: train_valid_test_splits: [0.0, 1.0, 0.0].

By default the scan will be run on the validation sample.

The "label" parameter will be the predicted feature, not the actual. You won’t have actuals by this stage of model deployment yet.

Production

Bias Metrics:

scan_bias_metrics()

Individual_Fairness; Demographic_Parity

You should log your dataset as validation.

By default the scan will be run on the validation sample.

These metrics do not require actuals. The "label" parameter will be the predicted feature, not the actual. You won’t have actuals by this stage of model deployment yet.

Production

Bias Metrics:

scan_bias_metrics()

Equal_Opportunity;

Equal_Odds

Only once you have actuals you are able to run this scan in production.

You should log your dataset as validation by setting the split in the config file like this: train_valid_test_splits": [0.0, 1.0, 0.0].

By default the scan will be run on the validation sample.

The "label" parameter will be the actuals feature once you have it and you will need to set-up your dataset in advance (e.g. via using Airflow)

Bias Scans Limitations

Bias is one of the most complex topics today. We started Etiq to help teams tackle this problem.

We don’t believe that having a few scans in place is enough to tackle this problem. We don’t think our bias sources scans are by any means exhaustive. Additionally the metrics themselves are often misleading - we have published some research on this topic here. However, if via these scans, data science and engineering teams at least start considering algorithmic bias and fairness as a problem they should tackle, as important if not more important than accuracy based performance, or drift, or data issues, then we feel at least part of our mission is accomplished.

If you are interested in this problem in more depth, we’d be very happy to hear from you. We have done research in the space and have additional pipelines built as part of the lower level API which we’re happy to share and run you through if you’re interested (email us info@etiq.ai).

Example notebooks

For example notebooks, code and config files for accuracy scans please see repo link.

Last updated