Snapshot
Logging a snapshot
Etiq works via a lightweight logging mechanic. You log your data and your model (a snapshot) and then you run a scan on it - which is the testing functionality itself. As you experiment with more and more snapshots, you keep scanning your model versions and all the test results and issues found get sent to a centralised dashboard.
A snapshot is a combination of dataset and model, especially for in pre-production testing. To start testing your system you need to log your snapshot to Etiq, and to do so you’d log the dataset and the model. For an end-to-end notebook example, go here.
Before you log a snapshot you will need to load your config file. Otherwise you will get an error.
For validation and production stages, snapshots are not produced in the course of experimentation, they are produced as a model is deployed and runs in production. But from the point of view of Etiq’s logging mechanic, they get logged the same way. Each time your model scores a new batch of data, it records a new snapshot. However the information needed for testing is slightly different in production vs. pre-production and the tests themselves are a bit different. For drift type tests or generally in production, you might not have the available model, but more importantly, you would need the dataset you’re considering for drift and the benchmark dataset you are comparing against.
This is how you’d log it to Etiq:
Dataset
At the moment we support uploading pandas or spark dataframes to the Etiq dataset object, but we are adding new formats all the time. The dataset you use should be already transformed in such a way that it can be inputted to a model class from any of the libraries mentioned. While Etiq contains some transformations, we recommend using your own. Especially with certain types of transformations (such as normalization) , please do NOT apply your transformation to your whole dataset prior to splitting it into train_test_valid as this will can contribute to leakage. (we will be adding scans to check for this as well in the future).
NB: Note that there might be certain scans not available for certain datasets.
There are currently two types of datasets. These are the SimpleDataset and the BiasDataset. The SimpleDataset is a container for the data to be used by the Etiq package. This includes training, validation and testing data along with metadata identifying categorical, continuous, id and date features.
The BiasDataset contains additional metadata identifying "bias" features.
Create a simple dataset from a Pandas DataFrame
In order to create a simple dataset from a pandas dataframe the following function:
can be used. The only required parameter is a dataframe containing the features (and target). If no other parameters are supplied etiq will either use defaults or make its best guess for the other parameters.
The target label (specified using the label parameter) or a separate dataframe containing the targets (specified using the target parameter) should also be specified otherwise the last column of the features dataframe will be chosen as the target by default. The other parameters are as follows:
cat_col - A list of columns containing categorical data
cont_col - A list of columns containing continuous data
id_col - A list of columns containing id data (note these are not used by the model)
date_col - A list of columns containing date information (note these are also not used by the model)
convert_date_cols - A True/False flag that determines whether or not to convert the datetime columns from strings to the native datetime format. (This defaults to False)
datetime_format - The format to use when converting the datetime columns.
train_valid_test_splits - Tuple containing the training/validation/testing split proportions
random_seed: Number used to seed the random number generator for random splits
name - The name to use for the dataset.
Note that the non-dataframe parameters can be loaded from the dataset section of a config file. See the dataset config options.
If, however, the dataframe has already been split into training, validation and testing then the following function:
can be used where
training_features - a dataframe containing the training features
training_target - an (optional) dataframe containing the training target
validation_features - a dataframe containing the validation feature
validation_target - an (optional) dataframe containing the validation target
testing_features - a dataframe containing the testing features
testing_target - an (optional) dataframe containing the testing target
The other parameters are identical to the previous dataset constructor.
Create a simple dataset from a Spark DataFrame
If the etiq.spark module is installed a simple dataset can be constructed from a spark dataframe. The two constructors that can be used are as follows:
where features is a spark dataframe and the other parameters are as described in the previous section. To create a simple dataset from "pre-split" spark dataframes we use
where the parameters are as described in the previous section.
NB: Both features and targets have to be defined in the same spark dataframe.
Create a Bias dataset from a Pandas DataFrame
In order to create a bias dataset from a pandas dataframe the following function:
can be used. where
bias_param - a named tuple specifying the bias meta data.
The other parameters are as described for the simple dataset.
The corresponding function for creating a bias dataset where training, validation and testing dataframes (at least one needs to be specified) is
Create a Bias dataset from a Spark DataFrame
In order to create a bias dataset from a pandas dataframe the following function:
can be used. where
bias_param - a named tuple specifying the bias meta data.
The other parameters are as described for the simple dataset.
The corresponding function for creating a bias dataset where training, validation and testing dataframes (at least one needs to be specified) is
Model
You can use any already trained model from the supported libraries: XGBoost, LightGBM, PyTorch, TensorFlow, Keras and scikit-learn. It should be compatible with any model that uses the sklearn fit/predict convention.
For example purposes, we also provide out-of-the box model architectures for some model types: DefaultXGBoostClassifier
(a wrapper around XGBoost classifier), DefaultRandomForestClassifier
(a wrapper around the random forest classifier from sklearn) and DefaultLogisticRegression
(a wrapper around the logistic regression classifier from sklearn). However, most use cases will use own fit model or pre-calculated model
Call a model already fitted using the following syntax. For a notebook example, go here.
Call a wrap around model from Etiq library using the following syntax. For a notebook example, go here
Pre-calculated model
Sometimes it is not possible to directly use a machine learning model in python for various reasons. However we still want to evaluate the model performance. In order to accommodate such a usecase we make available a "pre-calculation" model. This simply contains the prediction labels for the desired model we would like to evaluate on a dataset.
An example of how to use such a model is provided below.
Last updated