Data Issues

Data Issues Description

Data collection and validation forms an essential part of any machine learning pipeline. A number of issues could come up at the data collection phase and the Etiq library provides a way of detecting these. Instead of having the user define explicit rules as to what constitutes valid data the rules are automatically generated based on an exemplar dataset.

The different kinds of data issues detected are:

Data IssuesDescriptions

Identical Feature

This is a data issue where a feature in one dataset has values which are just identical copies of the exemplar dataset.

Missing Feature

This is a data issue where a feature in the exemplar dataset is missing from the comparison dataset.

Unknown Feature

This is a data issue where a feature in the comparison dataset is missing from the exemplar dataset.

Missing Feature Category

This is a data issue where a categorical feature has values in the exemplar dataset which are missing from the comparison dataset.

Unknown Feature Category

This is a data issue where a categorical feature has values in the comparison dataset which are missing from the exemplar dataset.

Feature Value Below Minimum

This is a data issue where a continuous feature has value(s) in the comparison dataset which are lower than the minimum value for that feature in the exemplar dataset.

Feature Value Above Maximum

This is a data issue where a continuous feature has a value(s) in the comparison dataset which are higher than the maximum value for that feature in the exemplar dataset.

Order Violation

This is a data issue where for a particular record the ordering of two features is violated e.g. a start date happens later than an end date. This is only available where a snapshot has a single dataset.

Missing ID

This is a data issue for a particular record where an id feature has a missing value. This is only available where a snapshot has a single dataset.

Duplicate Record

This is a data issue where a particular record is a duplicate of at least one other record. Note that records are identified by the tuple of all the id values unless a subset is specified. This is only available where a snapshot has a single dataset

Data Issues Scans

Just as you do for drift, you will have to create your snapshot using the dataset you are assessing for issues and a comparison dataset. The example below shows how to do a data issues scan where we have an exemplar dataset against which we want to compare another dataset.

 snapshot = project.snapshots.create(name="Data Issues Snapshot",
                                     dataset=base_dataset,
                                     comparison_dataset=comparison_dataset,
                                     model=None)
 snapshot.scan_data_issues()

If, however, we only want to scan a single dataset the following example would be more appropriate

snapshot = project.snapshots.create(name="Data Issues Snapshot",
                                     dataset=base_dataset,
                                     model=None)
snapshot.scan_data_issues()

There are a number of config parameters that can be set to switch the different tests on and off or only search certain features for certain issues where applicable. For a full range of config options see the example notebook for data issues.

Privacy Considerations

If you use data issues scans on our SaaS version, you have the option to leave out details on the Data profile charts by using the config option below:

base_snapshot = project.snapshots.create(name="Base Snapshot",
                                         dataset=base_dataset,
                                         model=None, 
                                         generate_data_profiles=True)

However the distribution charts might still pick up some details you do not want to exit your organization's environment. If you need help on using or testing Etiq on premise or on your own instance - just reach out to us directly: info@etiq.ai . Don't forget we are also on AWS Marketplace (release 1.3.4)

Example Notebooks

For example notebooks, code and config files for accuracy scans please see repo link.

Last updated