Version: 0.11.0

# Model Interpretation on Spark

## Interpretable Machine Learning​

Interpretable Machine Learning helps developers, data scientists and business stakeholders in the organization gain a comprehensive understanding of their machine learning models. It can also be used to debug models, explain predictions and enable auditing to meet compliance with regulatory requirements.

## Why run model interpretation on Spark​

Model-agnostic interpretation methods can be computationally expensive due to the multiple evaluations needed to compute the explanations. Model interpretation on Spark enables users to interpret a black-box model at massive scales with the Apache Spark™ distributed computing ecosystem. Various components support local interpretation for tabular, vector, image and text classification models, with two popular model-agnostic interpretation methods: LIME and Kernel SHAP.

## Usage​

Both LIME and Kernel SHAP are local interpretation methods. Local interpretation explains why does the model predict certain outcome for a given observation.

Both explainers extends from org.apache.spark.ml.Transformer. After setting up the explainer parameters, simply call the transform function on a DataFrame of observations to interpret the model behavior on these observations.

To see examples of model interpretability on Spark in action, take a look at these sample notebooks:

Tabular modelsVector modelsImage modelsText models
LIME explainersTabularLIMEVectorLIMEImageLIMETextLIME
Kernel SHAP explainersTabularSHAPVectorSHAPImageSHAPTextSHAP

### Common local explainer params​

All local explainers support the following params:

ParamTypeDefaultDescription
targetColString"probability"The column name of the prediction target to explain (i.e. the response variable). This is usually set to "prediction" for regression models and "probability" for probabilistic classification models.
targetClassesArray[Int]empty arrayThe indices of the classes for multinomial classification models.
targetClassesColStringThe name of the column that specifies the indices of the classes for multinomial classification models.
outputColStringThe name of the output column for interpretation results.
modelTransformerThe model to be explained.

### Common LIME explainer params​

All LIME based explainers (TabularLIME, VectorLIME, ImageLIME, TextLIME) support the following params:

ParamTypeDefaultDescription
regularizationDouble0Regularization param for the underlying lasso regression.
kernelWidthDoublesqrt(number of features) * 0.75Kernel width for the exponential kernel.
numSamplesInt1000Number of samples to generate.
metricsColString"r2"Column name for fitting metrics.

### Common SHAP explainer params​

All Kernel SHAP based explainers (TabularSHAP, VectorSHAP, ImageSHAP, TextSHAP) support the following params:

ParamTypeDefaultDescription
infWeightDouble1E8The double value to represent infinite weight.
numSamplesInt2 * (number of features) + 2048Number of samples to generate.
metricsColString"r2"Column name for fitting metrics.

### Tabular model explainer params​

All tabular model explainers (TabularLIME, TabularSHAP) support the following params:

ParamTypeDefaultDescription
inputColsArray[String]The names of input columns to the black-box model.
backgroundDataDataFrameA dataframe containing background data. It must contain all the input columns needed by the black-box model.

### Vector model explainer params​

All vector model explainers (VectorLIME, VectorSHAP) support the following params:

ParamTypeDefaultDescription
inputColStringThe names of input vector column to the black-box model.
backgroundDataDataFrameA dataframe containing background data. It must contain the input vector column needed by the black-box model.

### Image model explainer params​

All image model explainers (ImageLIME, ImageSHAP) support the following params:

ParamTypeDefaultDescription
inputColStringThe names of input image column to the black-box model.
cellSizeDouble16Number that controls the size of the super-pixels.
modifierDouble130Controls the trade-off spatial and color distance of super-pixels.
superpixelColString"superpixels"The column holding the super-pixel decompositions.

### Text model explainer params​

All text model explainers (TextLIME, TextSHAP) support the following params:

ParamTypeDefaultDescription
inputColStringThe names of input text column to the black-box model.
tokensColString"tokens"The column holding the text tokens.

### TabularLIME​

ParamTypeDefaultDescription
categoricalFeaturesArray[String]empty arrayThe name of columns that should be treated as categorical variables.

For categorical features, TabularLIME creates new samples by drawing samples based on the value distribution from the background dataset. For numerical features, it creates new samples by drawing from a normal distribution with mean taken from the target value to be explained, and standard deviation taken from the background dataset.

### VectorLIME​

VectorLIME assumes all features are numerical, and categorical features are not supported in VectorLIME.

### ImageLIME​

ParamTypeDefaultDescription
samplingFractionDouble0.7The fraction of super-pixels to keep on during sampling.

ImageLIME creates new samples by randomly turning super-pixels on or off with probability of keeping on set to SamplingFraction.

### TextLIME​

ParamTypeDefaultDescription
samplingFractionDouble0.7The fraction of word tokens to keep on during sampling.

TextLIME creates new samples by randomly turning word tokens on or off with probability of keeping on set to SamplingFraction.

## Result interpretation​

### LIME explainers​

LIME explainers return an array of vectors, and each vector maps to a class being explained. Each component of the vector is the coefficient for the corresponding feature, super-pixel, or word token from the local surrogate model.

• For categorical variables, super-pixels, or word tokens, the coefficient shows the average change in model outcome if this feature is unknown to the model, if the super-pixel is replaced with background color (black), or if the word token is replaced with empty string.
• For numeric variables, the coefficient shows the change in model outcome if the feature value is incremented by 1 unit.

### SHAP explainers​

SHAP explainers return an array of vectors, and each vector maps to a class being explained. Each vector starts with the base value, and each following component of the vector is the Shapley value for each feature, super-pixel, or token.

The base value and Shapley values are additive, and they should add up to the model output for the target observation.

#### Base value​

• For tabular and vector models, the base value represents the mean outcome of the model for the background dataset.
• For image models, the base value represents the model outcome for a background (all black) image.
• For text models, the base value represents the model outcome for an empty string.