Accuracy metrics are what I optimize my models on. Why should I have tests on accuracy metrics as well?
- 1.High accuracy can be indicative of a problem just as much as low accuracy
- For instance if a plain accuracy metric is 10% higher than you've expected you might have leakage somewhere or another issue.
- 2.Optimizing for a metric pre-production does not equate to optimizing for that metric in production
- You will be better off getting a good model off the ground, a model with no obvious issues, and which is likely to be robust, than trying to achieve a 1% higher accuracy with a potentially overfitting model, a model which is unfairly discriminating against protected demographic groups or with an model which will experience abrupt performance decay.
Our accuracy scans so far provide 3 metrics:
Depending on the type of scan, your use case, and the stage at which you are in the model building/deployment process, you will have multiple combinations of how to set-up your dataset and sampling % in your snapshot, and your scan.
For a common classification use case please see suggested set-up below:
So far we have packaged the scans with the following assumption: When in production you will have to create your datasets containing actuals - irrespective of what the model predicted this feature will refer to what actually happened in reality (e.g. has the customer defaulted on their loan, was the transaction fraudulent, etc.)
At the moment you will have the same parameter in your config file: parameter ‘label’, but this feature will denote what actually happened in reality. If requested by our users, we are open to adding an actuals dataset type in the packaging in the future, and separately a ‘predicted’ feature.