Snapshot

Logging a snapshot

Etiq works via a lightweight logging mechanic. You log your data and your model (a snapshot) and then you run a scan on it - which is the testing functionality itself. As you experiment with more and more snapshots, you keep scanning your model versions and all the test results and issues found get sent to a centralised dashboard.

A snapshot is a combination of dataset and model, especially for in pre-production testing. To start testing your system you need to log your snapshot to Etiq, and to do so you’d log the dataset and the model. For an end-to-end notebook example, go here.

Before you log a snapshot you will need to load your config file. Otherwise you will get an error.

#Log your dataset

dataset = etiq.BiasDatasetBuilder.dataset(data_encoded, label="<target-feature-name>") 
    #can also use SimpleDatasetBuilder
    
#Log your already trained model

model = Model(model_architecture=standard_model, model_fitted=model_fit)

# Creating a snapshot
snapshot = project.snapshots.create(name="<snapshot-name>", dataset=dataset, model=model, bias_params=etiq.BiasDatasetBuilder.bias_params())

For validation and production stages, snapshots are not produced in the course of experimentation, they are produced as a model is deployed and runs in production. But from the point of view of Etiq’s logging mechanic, they get logged the same way. Each time your model scores a new batch of data, it records a new snapshot. However the information needed for testing is slightly different in production vs. pre-production and the tests themselves are a bit different. For drift type tests or generally in production, you might not have the available model, but more importantly, you would need the dataset you’re considering for drift and the benchmark dataset you are comparing against.

This is how you’d log it to Etiq:

# Log a dataset with the comparison data

dataset_s = etiq.SimpleDatasetBuilder.dataset(data_encoded, label="<target-feature-name>")

# Log a dataset with the data from your current view
todays_dataset_s = etiq.SimpleDatasetBuilder.dataset(todays_dataset_df, label="<target-feature-name>")

# Create the snapshot
snapshot = project.snapshots.create(name="<snapshot-name>", dataset=todays_dataset_s, comparison_dataset=dataset_s, model=None)

Dataset

At the moment we support uploading pandas or spark dataframes to the Etiq dataset object, but we are adding new formats all the time. The dataset you use should be already transformed in such a way that it can be inputted to a model class from any of the libraries mentioned. While Etiq contains some transformations, we recommend using your own. Especially with certain types of transformations (such as normalization) , please do NOT apply your transformation to your whole dataset prior to splitting it into train_test_valid as this will can contribute to leakage. (we will be adding scans to check for this as well in the future).

NB: Note that there might be certain scans not available for certain datasets.

There are currently two types of datasets. These are the SimpleDataset and the BiasDataset. The SimpleDataset is a container for the data to be used by the Etiq package. This includes training, validation and testing data along with metadata identifying categorical, continuous, id and date features.

The BiasDataset contains additional metadata identifying "bias" features.

Create a simple dataset from a Pandas DataFrame

In order to create a simple dataset from a pandas dataframe the following function:

etiq.SimpleDatasetBuilder.dataset(features, target,
                                  label, cat_col, cont_col, id_col, date_col,
                                  convert_date_cols, datetime_format,
                                  train_valid_test_splits,
                                  random_seed,
                                  name)

can be used. The only required parameter is a dataframe containing the features (and target). If no other parameters are supplied etiq will either use defaults or make its best guess for the other parameters.

The target label (specified using the label parameter) or a separate dataframe containing the targets (specified using the target parameter) should also be specified otherwise the last column of the features dataframe will be chosen as the target by default. The other parameters are as follows:

  • cat_col - A list of columns containing categorical data

  • cont_col - A list of columns containing continuous data

  • id_col - A list of columns containing id data (note these are not used by the model)

  • date_col - A list of columns containing date information (note these are also not used by the model)

  • convert_date_cols - A True/False flag that determines whether or not to convert the datetime columns from strings to the native datetime format. (This defaults to False)

  • datetime_format - The format to use when converting the datetime columns.

  • train_valid_test_splits - Tuple containing the training/validation/testing split proportions

  • random_seed: Number used to seed the random number generator for random splits

  • name - The name to use for the dataset.

Note that the non-dataframe parameters can be loaded from the dataset section of a config file. See the dataset config options.

If, however, the dataframe has already been split into training, validation and testing then the following function:

etiq.SimpleDatasetBuilder.datasets(training_features, training_target,
                                   validation_features, validation_target,
                                   testing_features, testing_target,
                                   label, cat_col, cont_col, id_col, date_col,
                                   convert_date_cols, datetime_format,
                                   name)

can be used where

  • training_features - a dataframe containing the training features

  • training_target - an (optional) dataframe containing the training target

  • validation_features - a dataframe containing the validation feature

  • validation_target - an (optional) dataframe containing the validation target

  • testing_features - a dataframe containing the testing features

  • testing_target - an (optional) dataframe containing the testing target

The other parameters are identical to the previous dataset constructor.

Create a simple dataset from a Spark DataFrame

If the etiq.spark module is installed a simple dataset can be constructed from a spark dataframe. The two constructors that can be used are as follows:

etiq.SimpleSparkDatasetBuilder.dataset(features,
                                       label, cat_col, cont_col, id_col, date_col,
                                       convert_date_cols, datetime_format,
                                       train_valid_test_splits,
                                       random_seed,
                                       name)

where features is a spark dataframe and the other parameters are as described in the previous section. To create a simple dataset from "pre-split" spark dataframes we use

etiq.SimpleSparkDatasetBuilder.datasets(training_features, 
                                        validation_features, 
                                        testing_features, 
                                        label, cat_col, cont_col, id_col, date_col,
                                        convert_date_cols, datetime_format,
                                        name)

where the parameters are as described in the previous section.

NB: Both features and targets have to be defined in the same spark dataframe.

Create a Bias dataset from a Pandas DataFrame

In order to create a bias dataset from a pandas dataframe the following function:

etiq.BiasDatasetBuilder.dataset(features, target,
                                label, cat_col, cont_col, id_col, date_col,
                                convert_date_cols, datetime_format,
                                train_valid_test_splits,
                                bias_params,
                                random_seed,
                                name)

can be used. where

  • bias_param - a named tuple specifying the bias meta data.

The other parameters are as described for the simple dataset.

The corresponding function for creating a bias dataset where training, validation and testing dataframes (at least one needs to be specified) is

etiq.BiasDatasetBuilder.datasets(training_features, training_target,
                                   validation_features, validation_target,
                                   testing_features, testing_target,
                                   label, cat_col, cont_col, id_col, date_col,
                                   convert_date_cols, datetime_format,
                                   bias_params,
                                   name)

Create a Bias dataset from a Spark DataFrame

In order to create a bias dataset from a pandas dataframe the following function:

etiq.BiasSparkDatasetBuilder.dataset(features, target,
                                label, cat_col, cont_col, id_col, date_col,
                                convert_date_cols, datetime_format,
                                train_valid_test_splits,
                                bias_params,
                                random_seed,
                                name)

can be used. where

  • bias_param - a named tuple specifying the bias meta data.

The other parameters are as described for the simple dataset.

The corresponding function for creating a bias dataset where training, validation and testing dataframes (at least one needs to be specified) is

etiq.BiasSparkDatasetBuilder.datasets(training_features, 
                                      validation_features,
                                      testing_features, 
                                      label, cat_col, cont_col, id_col, date_col,
                                      convert_date_cols, datetime_format,
                                      bias_params,
                                      name)

Model

You can use any already trained model from the supported libraries: XGBoost, LightGBM, PyTorch, TensorFlow, Keras and scikit-learn. It should be compatible with any model that uses the sklearn fit/predict convention.

For example purposes, we also provide out-of-the box model architectures for some model types: DefaultXGBoostClassifier (a wrapper around XGBoost classifier), DefaultRandomForestClassifier (a wrapper around the random forest classifier from sklearn) and DefaultLogisticRegression (a wrapper around the logistic regression classifier from sklearn). However, most use cases will use own fit model or pre-calculated model

Call a model already fitted using the following syntax. For a notebook example, go here.


# Load the dataset
dataset = etiq.BiasDatasetBuilder.datasets(training_features=test,
                                               validation_features=valid)
# Load the bias parameters
bias_params = etiq.BiasDatasetBuilder.bias_params()

# Create your already trained model and log it.
model = Model(model_architecture=standard_model, model_fitted=model_fit)

# Create a Snapshot
snapshot = project.snapshots.create(name="<snapshot-name>",
                                        dataset=dataset,
                                        model=model,
                                        bias_params=bias_params)

Call a wrap around model from Etiq library using the following syntax. For a notebook example, go here

#Log your dataset

dataset = etiq.BiasDatasetBuilder.dataset(data_encoded, label="<target-feature-name>") 

# Load our model
from etiq.model import DefaultXGBoostClassifier
model = DefaultXGBoostClassifier()

# Creating a snapshot
snapshot = project.snapshots.create(name="<snapshot-name>", dataset=dataset, model=model, bias_params=etiq.BiasDatasetBuilder.bias_params())

Pre-calculated model

Sometimes it is not possible to directly use a machine learning model in python for various reasons. However we still want to evaluate the model performance. In order to accommodate such a usecase we make available a "pre-calculation" model. This simply contains the prediction labels for the desired model we would like to evaluate on a dataset.

An example of how to use such a model is provided below.

# This bit is done by someone with access to the model
train, valid = train_test_split(data_encoded, test_size=0.2, random_state=17)
train.reset_index(inplace=True, drop=True)
valid.reset_index(inplace=True, drop=True)

# Split data then train the model
y_train = train['income'].copy() # labels we're going to train the model to predict
x_train = train.drop(columns=['income'])
y_valid = valid['income'].copy() 
x_valid = valid.drop(columns=['income'])

# train a model to predict 'income'
standard_model = MyCustomModel()    
model_fit = standard_model.fit(x_train, y_train)
y_train_pred = standard_model.predict(x_train)
y_valid_pred = standard_model.predict(x_valid)

model_df = data_encoded.copy()
model_df['predicted'] = standard_model.predict(data_encoded.drop(columns=['income'])
model_df.to_csv('data-with-predictions.csv', index=False)

# The precalculated csv file is then provided.
# This bit can be run by someone without access to the model
import pandas as pd
import etiq

labelled_data_with_predictions = pd.read_csv('data-with-predictions.csv')
labelled_data_without_predictions = labelled_data_with_predictions.drop(
                                      ['predicted'], axis=1)
precalc_model = etiq.model.PrecalculatedModel(labelled_data_with_prediction,
                                               prediction_label='predicted')
dataset = etiq.SimpleDatasetBuilder.dataset(labelled_data_without_predictions, 
                                              label="income") 


# Creating a snapshot
snapshot = project.snapshots.create(name="<snapshot-name>", 
                                     dataset=dataset, 
                                     model=precalc-model)
# Run model performance scans                                     

Last updated