Accuracy

Why run accuracy scans?

Accuracy metrics are what I optimize my models on. Why should I have tests on accuracy metrics as well?

  1. High accuracy can be indicative of a problem just as much as low accuracy

    • For instance if a plain accuracy metric is 10% higher than you've expected you might have leakage somewhere or another issue.

  2. Optimizing for a metric pre-production does not equate to optimizing for that metric in production

    • You will be better off getting a good model off the ground, a model with no obvious issues, and which is likely to be robust, than trying to achieve a 1% higher accuracy with a potentially overfitting model, a model which is unfairly discriminating against protected demographic groups or with an model which will experience abrupt performance decay.

Metrics

Our accuracy scans so far provide 3 metrics:

In addition to these 3 metrics, you can use custom metrics to add your own metrics.

Setting-up accuracy scans

Depending on the type of scan, your use case, and the stage at which you are in the model building/deployment process, you will have multiple combinations of how to set-up your dataset and sampling % in your snapshot, and your scan.

For a common classification use case please see suggested set-up below:

Production vs. pre-production

So far we have packaged the scans with the following assumption: When in production you will have to create your datasets containing actuals - irrespective of what the model predicted this feature will refer to what actually happened in reality (e.g. has the customer defaulted on their loan, was the transaction fraudulent, etc.)

At the moment you will have the same parameter in your config file: parameter ‘label’, but this feature will denote what actually happened in reality. If requested by our users, we are open to adding an actuals dataset type in the packaging in the future, and separately a ‘predicted’ feature.

You can use whatever accuracy metric you want in the scans to monitor your model’s performance. However, if you are thinking about how the responses came about, some metrics will be more helpful than others.

For instance, let’s take a case where you do not have control groups: a model where you are predicting default rates, and as a result of your model you are giving loans only to those people who present a low enough risk profile. Out of those people, looking at who kept paying their loans for the first time period would give you a reliable true positive rate, that might decrease over time if the loan period isn’t complete, but which at least is not misleading. However, trying to look at an overall accuracy rate would not make much sense, as you have not given loans to anyone for whom you predicted a low likelihood of repayment in the first place. A lot of algorithmic bias related problems stem from these issues.

Example notebooks

For example notebooks, code and config files for accuracy scans please see repo link.

Last updated