Config

Configuration file description

An etiq config file is a JSON format file which allows users to set default parameters to be used when loading datasets or running scans. Note that using a config file is entirely optional and it is possible to run any scan without providing default parameters in the config. The parameters provided in the config file for a dataset can also be overridden by providing explicit arguments in the scan itself.

A config can be loaded in globally (i.e. the config will persist throughout the session once loaded unless a different config is subsequently loaded) using the load_config function e.g.

etiq.load_config("./config_demo.json")

assuming the config file is config_demo.json.

A context manager equivalent is also provided. This can be used like the following:

   with etiq.etiq_config("./config_demo.json"):
        # Scans under this config

These are config options that we have built to match our existing scans and requirements. If you need a specific option you cannot find reach out to us (info@etiq.ai). In the future you'll be able to add further options yourself.

Config Structure

  • A 'dataset' section containing default parameters to be used when loading datasets. These include bias parameters if applicable. For example:

        "dataset": {
            "label": "income",
            "bias_params": {
                "protected": "gender",
                "privileged": 1,
                "unprivileged": 0,
                "positive_outcome_label": 1,
                "negative_outcome_label": 0
            },
            "train_valid_test_splits": [0.0, 1.0, 0.0],
            "remove_protected_from_features": true
        }
  • Scan specific sections corresponding to each type of scan and which include the metrics, thresholds and other options available for each of the scans (note that these are optional and do not have to provided in order to run the scans). An example for a "scan_accuracy_metrics" below:

        "scan_accuracy_metrics": {
            "thresholds": {
                "accuracy": [0.8, 1.0],
                "true_pos_rate": [0.6, 1.0],
                "true_neg_rate":  [0.6, 1.0]           
            }
            positive_outcome_label: 1,
            negative_outcome_label: 0
        }

This allows us to run the corresponding scan with just those default parameter i.e.

snapshot.scan_accuracy_metrics()

If for some reason we want to run an accuracy metrics scan with the positive and negative outcome labels flipped we can override the default parameters in the config file by running

scan_accuracy_metrics(positive_outcome_label=0, negative_outcome_label=1)

Depending on the type of scan you're running and stage (pre vs. in-production), there are multiple additional config options available.

Dataset Details Options

The Scan Types section describes which dataset parameters are relevant for each scan, e.g. bias_params. But there are some parameters that can be used across scans:

  • train_valid_test_splits -

  • To input which features are categorical and which are continuous, you can use cat_col and cont_col and then add the name of the features. You also have the option to do this outside the config if easier as per the Dataset section.

  • You also have the option remove_protected_from_features. When you are building a model in a regulated sector, you will not be able to use a protected demographic feature directly in the model. However you will need the protected feature(s) to assess whether you have a bias issue, so you will need this to be part of your dataset. This is not the case for other use cases perhaps. Thus you have the option either to consider the protected feature(s) as part of the model, or to consider them just for the purposes of assessing whether the model has a bias issue.

    "dataset": {
        "label": "income",
        "bias_params": {
            "protected": "gender",
            "privileged": 1,
            "unprivileged": 0,
            "positive_outcome_label": 1,
            "negative_outcome_label": 0
        },
        "train_valid_test_splits": [0.8, 0.2, 0.0],
        "cat_col": ["workclass", "relationship", "occupation", "gender", "race", "native-country", "marital-status", "income", "education"],
        "cont_col": ["age", "educational-num", "fnlwgt", "capital-gain", "capital-loss", "hours-per-week"], 
        "remove_protected_from_features": true
    }

RCA Type scans

  • Defining the issue based on thresholds. When searching for issues you can add options as to which interval you want to be considered, e.g. if you only want to find issues where accuracy is under a certain threshold rather than above, you can ask give it the option "ignore_upper_threshold": true

  • Adding a minimum segment size - e.g. only surfacing segments which are big enough according to your use case.

  • Filtering on which metrics you want calculated - "metric_filter": ["accuracy", "true_pos_rate", "true_neg_rate"] . This will help you only run what you need.

    "scan_accuracy_metrics_rca": {
        "thresholds": {
                "accuracy": [0.0, 0.5],
                "true_pos_rate": [0.0, 0.5],
                "true_neg_rate": [0.0, 0.5]
          },
         "ignore_lower_threshold": false,
         "ignore_upper_threshold": true,
         "metric_filter": ["accuracy", "true_pos_rate", "true_neg_rate"],
         "minimum_segment_size": 1000,    
    },

Drift Type Scans

For drift metrics scans both the regular scan and RCA scan we have a few different options:

  • For both scan_drift_metrics and scan_drift_metrics_rca, you will be able to select which metrics you want using the option "drift_measures". For more info on what drift measures you can choose check out the Drift section.

  • As with typical RCA scans for the drift RCA scan you can choose what to do with the thresholds and also can choose the minimum segment size, beyond which you will not consider the issue

  • Features: you can choose which features to restrict the drift scan on if you so wish. The config option is "features" followed by the names of the features

  • Number of bins: for target and concept drift type scans, you can choose the number of bins you want to use. We will add binning options for feature drift type scan as well in the near future.

    "scan_drift_metrics": {
        "thresholds": {
            "psi": [0.0, 0.15],
            "kolmogorov_smirnov": [0.05, 1.0]
        },
        "drift_measures": ["kolmogorov_smirnov", "psi"]       
    },
    "scan_drift_metrics_rca": {
        "thresholds": {
            "psi": [0.0, 0.15],
            "kolmogorov_smirnov": [0.05, 1.0]
        },
        "drift_measures": ["psi", "kolmogorov_smirnov"],
        "ignore_lower_threshold": true,
        "ignore_upper_threshold": false,
        "features": null,
        "minimum_segment_size": 1000 
    },
    "scan_target_drift_metrics_rca": {
        "thresholds": {
            "psi": [0.0, 0.15],
            "kolmogorov_smirnov": [0.05, 1.0]
        },
        "drift_measures": ["psi", "kolmogorov_smirnov"],
        "ignore_lower_threshold": true,
        "ignore_upper_threshold": false,
        "features": null,
        "minimum_segment_size": 1000 
    },
    "scan_concept_drift_metrics": {
        "thresholds": {
            "earth_mover_distance": [0.0, 0.2],
            "kl_divergence": [0.0, 0.2],
            "jensen_shannon_distance": [0.0, 0.2]
        },
        "drift_measures": ["earth_mover_distance", "kl_divergence", "jensen_shannon_distance"],
        "number_of_bins": 10
    },

Scans with Correlation/Association Measures

There are scan types which are based on different types of correlation/association measures. These include scan_bias_sources:

    "scan_bias_sources": {
        "auto": true,
        "nr_groups": 20,
        "continuous_continuous_measure"  :  "pearsons",
        "categorical_categorical_measure": "cramersv",
        "categorical_continuous_measure": "rankbiserial",
        "binary_continuous_measure": "pointbiserial"
    },

as well as scans related to leakage, such as scan_target_leakage and scan_demographic_leakage

    "scan_target_leakage": {
        "leakage_threshold": 0.85,
        "minimum_segment_size": 1000,
        "continuous_continuous_measure"  :  "pearsons",
        "categorical_categorical_measure": "cramersv",
        "categorical_continuous_measure": "rankbiserial",
        "binary_continuous_measure": "pointbiserial",
        "minimum_segment_size": null
     },
    "scan_demographic_leakage": {
        "leakage_threshold": 0.85,
        "minimum_segment_size": 1000,
        "continuous_continuous_measure"  :  "pearsons",
        "categorical_categorical_measure": "cramersv",
        "categorical_continuous_measure": "rankbiserial",
        "binary_continuous_measure": "pointbiserial",
        "minimum_segment_size": null        
     }

As you can see from the examples above, you have options on what correlation type to use based on the type of features you are looking to assess for correlations:

  • "continuous_continuous_measure" : "pearsons"

  • "categorical_categorical_measure": "cramersv"

  • "categorical_continuous_measure": "rankbiserial"

  • "binary_continuous_measure": "pointbiserial"

The options above are the default provided, but you can customize them. What these options stand for are fairly self-explanatory. E.g. for continuous features this option states that the type of correlation measure used will be pearsons. But remember: to fully utilize these options you should input what is a categorical, binary and continuous feature in the config or when running the scans.

Last updated