Data Issues
Data Issues Description
Data collection and validation forms an essential part of any machine learning pipeline. A number of issues could come up at the data collection phase and the Etiq library provides a way of detecting these. Instead of having the user define explicit rules as to what constitutes valid data the rules are automatically generated based on an exemplar dataset.
The different kinds of data issues detected are:
Identical Feature
This is a data issue where a feature in one dataset has values which are just identical copies of the exemplar dataset.
Missing Feature
This is a data issue where a feature in the exemplar dataset is missing from the comparison dataset.
Unknown Feature
This is a data issue where a feature in the comparison dataset is missing from the exemplar dataset.
Missing Feature Category
This is a data issue where a categorical feature has values in the exemplar dataset which are missing from the comparison dataset.
Unknown Feature Category
This is a data issue where a categorical feature has values in the comparison dataset which are missing from the exemplar dataset.
Feature Value Below Minimum
This is a data issue where a continuous feature has value(s) in the comparison dataset which are lower than the minimum value for that feature in the exemplar dataset.
Feature Value Above Maximum
This is a data issue where a continuous feature has a value(s) in the comparison dataset which are higher than the maximum value for that feature in the exemplar dataset.
Order Violation
This is a data issue where for a particular record the ordering of two features is violated e.g. a start date happens later than an end date. This is only available where a snapshot has a single dataset.
Missing ID
This is a data issue for a particular record where an id feature has a missing value. This is only available where a snapshot has a single dataset.
Duplicate Record
This is a data issue where a particular record is a duplicate of at least one other record. Note that records are identified by the tuple of all the id values unless a subset is specified. This is only available where a snapshot has a single dataset
Data Issues Scans
Just as you do for drift, you will have to create your snapshot using the dataset you are assessing for issues and a comparison dataset. The example below shows how to do a data issues scan where we have an exemplar dataset against which we want to compare another dataset.
If, however, we only want to scan a single dataset the following example would be more appropriate
There are a number of config parameters that can be set to switch the different tests on and off or only search certain features for certain issues where applicable. For a full range of config options see the example notebook for data issues.
Privacy Considerations
If you use data issues scans on our SaaS version, you have the option to leave out details on the Data profile charts by using the config option below:
However the distribution charts might still pick up some details you do not want to exit your organization's environment. If you need help on using or testing Etiq on-prem or on your own instance - just reach out to us directly: info@etiq.ai . Don't forget we are also on AWS Marketplace (release 1.3.4)
Example Notebooks
For example notebooks, code and config files for accuracy scans please see repo link.
Last updated