Etiq Docs
Search…
⌃K

Data Issues

Data Issues Description

Data collection and validation forms an essential part of any machine learning pipeline. A number of issues could come up at the data collection phase and the Etiq library provides a way of detecting these. Instead of having the user define explicit rules as to what constitutes valid data the rules are automatically generated based on an exemplar dataset.
The different kinds of data issues detected are:
Data Issues
Descriptions
Identical Feature
This is a data issue where a feature in one dataset has values which are just identical copies of the exemplar dataset.
Missing Feature
This is a data issue where a feature in the exemplar dataset is missing from the comparison dataset.
Unknown Feature
This is a data issue where a feature in the comparison dataset is missing from the exemplar dataset.
Missing Feature Category
This is a data issue where a categorical feature has values in the exemplar dataset which are missing from the comparison dataset.
Unknown Feature Category
This is a data issue where a categorical feature has values in the comparison dataset which are missing from the exemplar dataset.
Feature Value Below Minimum
This is a data issue where a continuous feature has value(s) in the comparison dataset which are lower than the minimum value for that feature in the exemplar dataset.
Feature Value Above Maximum
This is a data issue where a continuous feature has a value(s) in the comparison dataset which are higher than the maximum value for that feature in the exemplar dataset.

Data Issues Scans

Just as you do for drift, you will have to create your snapshot using the dataset you are assessing for issues and a comparison dataset. Example below:
snapshot = project.snapshots.create(name="Data Issues Snapshot",
dataset=base_dataset,
comparison_dataset=comparison_dataset,
model=None)
For now you don't need to specify any particular config parameters and the syntax to run the scan is:
snapshot.scan_data_issues()
At the moment you can use the same data issues scans in pre-production and production.

Example Notebooks

For example notebooks, code and config files for accuracy scans please see repo link.