Data Fingerprinting
Added in etiq 1.6.0
Data Fingerprinting enables rapid comparison of two related datasets by creating a set of metrics for each feature, testing whether they match and using this data to determine the relationship between the two.
Use Cases:
Oftentimes data scientists or analysts pick up issues with data based on higher-level aggregates, e.g. there is a discrepancy in the monthly sum for a given category of payment. With the fingerprinting feature, we have added these types of aggregate/pivot tests into our testing suite. Tests can be carried out on complete datasets, or on a subset of features.
Typical uses of the fingerprinting feature are:
"Fingerprinting" a dataset: calculating the dataset’s metrics (min, max, mean, median, missing, sum, unique, std for each column - more details on these below)
Testing whether two datasets’ metrics (fingerprints) match, within a certain tolerance
At the dataset level, testing whether two datasets have the same number of rows
Creating summary objects that provide detailed results from these tests
Metrics:
By default, for each dataset,the following metrics are determined for each column of a suitable type (a subset of these metrics can also be calculated if preferred). The overall count of rows in each dataset is also evaluated.
Metric Name | Description | Per Table or Per Feature? |
---|---|---|
count | How many rows are there in the dataset? | Table |
min | Minimum value | Feature |
max | Maximum value | Feature |
mean | Mean value | Feature |
median | Median value | Feature |
missing | How many rows are missing values? | Feature |
sum | Sum of values | Feature |
unique | How many unique values? | Feature |
std | Standard Deviation | Feature |
Table 1: Metric names and descriptions.
Data Relationships
Pairs of datasets can be related in four possible ways:
pivot
- One dataset is an aggregation of the other.replica
- one dataset has the same columns but different data (e.g. sales data from month to month)sampling
- one dataset is a row-wise sample of anotherpart
- one dataset has a subset of columns from the other dataset (and all rows for those columns)
The user does not need to specify how the datasets are related; the relationship will be inferred from the datasets automatically.
Usage
It is easy to run this - you will need two etiq snapshot objects representing the two different datasets. In your notebook or script you can simply;
As with our other scan_* methods on each snapshot, we return three Pandas dataframe objects with the results in.
If you are connected to the etiq dashboard in this session, the scan results will also be uploaded and shown under the snapshot.
Segments
A list of “segments” within the data. Presently just a single “all” segment.
Issues
A list of issues we’ve found with the data. An issue occurs when a feature metric does not match between tables.
We show the name of the feature and the metric which did not match.
You can supply arguments to the scan_fingerprint
method to tune this output - for example if you expect a feature to vary between datasets, then you can adjust the margin so that false positives do not occur. See the API Usage section for more.
Issue Aggregates
This table shows an aggregate of all tests run - that is, the features and metrics tested against each plus the count of how many tests ran for each plus how many failed.
The “threshold” column shows the error margin used for that test. See API Usage below for details.
API Usage
Margin Specification
Often the data will not be identical so we allow the user to specify a margin of error specified as a fraction between 0 (no error) and 1 (complete difference). A margin of 1 will never find any errors!
Margin can either be set for the entire dataset with the margin keyword when we call the scan;
Alternatively you may wish to specify on a per-feature level as your data may vary more in some columns more than others. This is supported with the “per_field_margin” keyword which consists of a dictionary whose keys are feature names and the values are the margin specified as floats. Items not specified in this dict will use the default margin;
Metrics
There may be times when it does not make sense to run all metrics - perhaps you are only interested in one or two, or you know that others will vary too much between datasets resulting in false positives.
The “metrics” argument allows you to specify which metrics to run. The list of available metrics is listed above in the Metrics section.
Last updated