Data Issues

Data Issues Description

Data collection and validation forms an essential part of any machine learning pipeline. A number of issues could come up at the data collection phase and the Etiq library provides a way of detecting these. Instead of having the user define explicit rules as to what constitutes valid data the rules are automatically generated based on an exemplar dataset.

The different kinds of data issues detected are:

Data Issues Scans

Just as you do for drift, you will have to create your snapshot using the dataset you are assessing for issues and a comparison dataset. The example below shows how to do a data issues scan where we have an exemplar dataset against which we want to compare another dataset.

 snapshot = project.snapshots.create(name="Data Issues Snapshot",
                                     dataset=base_dataset,
                                     comparison_dataset=comparison_dataset,
                                     model=None)
 snapshot.scan_data_issues()

If, however, we only want to scan a single dataset the following example would be more appropriate

snapshot = project.snapshots.create(name="Data Issues Snapshot",
                                     dataset=base_dataset,
                                     model=None)
snapshot.scan_data_issues()

There are a number of config parameters that can be set to switch the different tests on and off or only search certain features for certain issues where applicable. For a full range of config options see the example notebook for data issues.

Privacy Considerations

If you use data issues scans on our SaaS version, you have the option to leave out details on the Data profile charts by using the config option below:

base_snapshot = project.snapshots.create(name="Base Snapshot",
                                         dataset=base_dataset,
                                         model=None, 
                                         generate_data_profiles=True)

However the distribution charts might still pick up some details you do not want to exit your organization's environment. If you need help on using or testing Etiq on premise or on your own instance - just reach out to us directly: info@etiq.ai . Don't forget we are also on AWS Marketplace (release 1.3.4)

Example Notebooks

For example notebooks, code and config files for accuracy scans please see repo link.

Last updated