Why run accuracy scans?

Accuracy metrics are what I optimize my models on. Why should I have tests on accuracy metrics as well?

  1. High accuracy can be indicative of a problem just as much as low accuracy

    • For instance if a plain accuracy metric is 10% higher than you've expected you might have leakage somewhere or another issue.

  2. Optimizing for a metric pre-production does not equate to optimizing for that metric in production

    • You will be better off getting a good model off the ground, a model with no obvious issues, and which is likely to be robust, than trying to achieve a 1% higher accuracy with a potentially overfitting model, a model which is unfairly discriminating against protected demographic groups or with an model which will experience abrupt performance decay.


Our accuracy scans so far provide 3 metrics:



% correct out of total

True Positive Rate

the proportion positive outcome labels that are correctly classified out of all positive outcome labels

True Negative Rate

the proportion negative outcome labels that are correctly classified out of all negative outcome labels

In addition to these 3 metrics, you can use custom metrics to add your own metrics.

Setting-up accuracy scans

Depending on the type of scan, your use case, and the stage at which you are in the model building/deployment process, you will have multiple combinations of how to set-up your dataset and sampling % in your snapshot, and your scan.

For a common classification use case please see suggested set-up below:

StageScanSnapshot set-up


(if you use an etiq wrapped model)


You can use the whole dataset and set-up the split % based on whatever you prefer (leaving at least 10% in the validation sample).

Etiq dataset loader will split it for you when it creates the snapshot. By default the scan will be run on the validation sample.

The parameter ‘label’ refers to predicted (and because this is your training / test / validation it will also be your actuals)


(if you log your own already trained model)


You should log your actual test / validation dataset (the sample you did not use to train the model) as validation/test by setting the split in the config file like this: train_valid_test_splits": [0.0, 1.0, 0.0].

By default the scan will be run on the validation sample. The "label" parameter will be the predicted feature, not the actual. You won’t have actuals by this stage of model deployment yet.


(if you have actuals)

scan_accuracy_metrics() -TPR: true positive rate -TNR: true negative rate

Only once you have actuals you are able to run this scan in production. You should log the dataset used in production as validation by setting the split in the config file like this: train_valid_test_splits": [0.0, 1.0, 0.0].

The "label" parameter will be the actuals feature once you have it and you will need to set-up your dataset in advance (e.g. via using Airflow)


(if you do not have actuals)

scan_accuracy_metrics() -custom metric

You might have custom metrics which you are labelling as accuracy but for which you do not need actuals. In this instance, the label will be what the model scores/predicts rather than the actuals.

Production vs. pre-production

So far we have packaged the scans with the following assumption: When in production you will have to create your datasets containing actuals - irrespective of what the model predicted this feature will refer to what actually happened in reality (e.g. has the customer defaulted on their loan, was the transaction fraudulent, etc.)

At the moment you will have the same parameter in your config file: parameter ‘label’, but this feature will denote what actually happened in reality. If requested by our users, we are open to adding an actuals dataset type in the packaging in the future, and separately a ‘predicted’ feature.

You can use whatever accuracy metric you want in the scans to monitor your model’s performance. However, if you are thinking about how the responses came about, some metrics will be more helpful than others.

For instance, let’s take a case where you do not have control groups: a model where you are predicting default rates, and as a result of your model you are giving loans only to those people who present a low enough risk profile. Out of those people, looking at who kept paying their loans for the first time period would give you a reliable true positive rate, that might decrease over time if the loan period isn’t complete, but which at least is not misleading. However, trying to look at an overall accuracy rate would not make much sense, as you have not given loans to anyone for whom you predicted a low likelihood of repayment in the first place. A lot of algorithmic bias related problems stem from these issues.

Example notebooks

For example notebooks, code and config files for accuracy scans please see repo link.

Last updated