Introduction

I searched about SHAP.

References

Paper proposing SHAP https://arxiv.org/abs/1705.07874 --Explanation of shapley value https://dropout009.hatenablog.com/entry/2019/11/20/091450 --Approximate calculation of shap for Tree Ensembles https://arxiv.org/abs/1802.03888 --shap library https://github.com/slundberg/shap --How to use shap library https://orizuru.io/blog/machine-learning/shap/ --Textbook https://christophm.github.io/interpretable-ml-book/shap.html --LIME paper https://www.kdd.org/kdd2016/papers/files/rfp0573-ribeiroA.pdf --LIME paper summary https://www.dendoron.com/boards/40

Focusing on the paper 1 shown in the bibliography, I will explain what SHAP is with reference to 6, 7 and 8. Since the explanation of terms is abstracted in the paper 1, it is easier to understand by looking at the concrete examples of the paper 7.

Overview

What is it like? Proposed SHAP (SHapley Additive exPlanations) values, which is an integrated framework for interpreting forecasts.
What is amazing compared to previous research? There are many methods for defining the importance of features, but the relationship between the methods was not well understood. The method proposed in this paper can find a unique solution based on a certain theory, and the new class integrates the six existing methods.
Where is the "kimo" of technology and method? (1) A new class of additive feature importance scales was identified. (2) It was shown that this class has a unique solution and has some desirable characteristics.

Main paper

Let's say the model we want to describe is $ f (\ boldsymbol {x}) $.

y = f(\boldsymbol{x})

Let the explainable model be $ g (\ boldsymbol {x'}) $. In the method shown below, the feature quantity for explaining the predicted $ f (\ boldsymbol {x}) $ output for each input $ \ boldsymbol {x} $, not the importance of the feature quantity for the entire data Find the importance of. Such methods are called Local methods. For example, LIME is Local methods.

Explainable models often use map functions to simplify the input $ \ boldsymbol {x} $ of the original model to the input $ \ boldsymbol {x'} $ to the explainable model.

\boldsymbol{x} = h_{\boldsymbol{x}}(\boldsymbol{x}')

Here the map function is a function that depends on the input value, as it has the subscript $ \ boldsymbol {x} $. This simplification translates it into an input format of $ \ boldsymbol {x'} $ that allows us to interpret which part of the original model's input $ \ boldsymbol {x} $ contributes to the prediction. For example, if the input is Bag of Words (BOW) and you want to know which words contribute to the prediction, you only need to enter the presence or absence of the word in the input $ \ boldsymbol {x} $. Therefore, simplification is the operation of converting a nonzero frequency value to 1. Also, the simplified input $ \ boldsymbol {x'} $ can be reverted to the original input with $ h_ {\ boldsymbol {x}} $. This is because $ h_ {\ boldsymbol {x}} $ has information on the local input $ \ boldsymbol {x} $.

Simplification: [1, 2, 1, 1, 2, 1, 1, 0, 0, 0] \rightarrow [1, 1, 1, 1, 1, 1, 1, 0, 0, 0] \\
h_{\boldsymbol{x}}: [1, 1, 1, 1, 1, 1, 1, 0, 0, 0] \rightarrow [1, 2, 1, 1, 2, 1, 1, 0, 0, 0] = \boldsymbol{x} \odot \boldsymbol{x'}

If the input is an image and you want to know which part of the image contributes to the prediction, patch the adjacent continuous parts and express the presence or absence of that patch as 1/0. When returning to the original input space with $ h_ {\ boldsymbol {x}} $, the patch part (1 part) should be converted to the original image. The part other than the patch (0 part) is returned with an appropriate background color, which will be explained later. The blog in Reference 8 is very helpful for the explanation of this area.

Local methods are $ g (\ boldsymbol {z'}) \ simeq f (h_ {\ boldsymbol {x}} (\ boldsymbol { x}')) Determine the weight of $ g $ so that it is $.

In the 1 \ paper, we build an explainable model by a method called Additive feature attribution methods. The simplification is limited to conversion to binary variables. Therefore, the explainable model is a linear function of binary variables.

g(z') = \phi_0 + \sum_{i=1}^M\phi_i z_i'

Where $ z'\ in \ {0,1 \} ^ M $ is a simplified feature and $ \ phi_i \ in \ mathbb {R} $ is a factor that represents the effect of the feature. In this model, the effect of each feature is represented by $ \ phi_i $, and the effects of all features are added together to approximate the output of the original model.

The coefficient $ \ phi_i $ can be uniquely determined so as to satisfy the following conditions.

Property 1 (Local accuracy)

  \begin{aligned}
  f(\boldsymbol{x}) = g(\boldsymbol{x}') &= \phi_0 + \sum_{i=1}^M\phi_i x_i' \\
  \boldsymbol{x} &= h_\boldsymbol{x}(\boldsymbol{x}') \\
  \end{aligned}

Property 2 (Missingness)

  x' = 0 \Rightarrow\phi_i=0

Property 3 (Consistency)

$ f_x (\ boldsymbol {z}') = f (h_ \ boldsymbol {x} (\ boldsymbol {z}')) $ is abbreviated, and $ z_i'= 0 $ means $ \ boldsymbol {z}' I will write \ backslash i $. For any two models $ f, f'$

  f_\boldsymbol{x}'(\boldsymbol{z}') -   f_\boldsymbol{x}'(\boldsymbol{z}'\backslash i)
  \geq f_\boldsymbol{x}(\boldsymbol{z}') -   f_\boldsymbol{x}(\boldsymbol{z}'\backslash i) \ \ \mathrm{for \ all \ inputs } \ \boldsymbol{z}' \in \{0,1\}^M

When holds, then $ \ phi_i (f', \ boldsymbol {x}) \ geq \ phi_i (f, \ boldsymbol {x}) $ holds. This requires that $ \ phi_i $ also increase or remain the same if the model is modified so that the marginal contribution of the feature $ i $ increases or remains the same (regardless of other features). I am.

It is known that the only explainable model can be determined by imposing the above three conditions.

\phi_i(f, \boldsymbol{x}) = \sum_{\boldsymbol{z}'\subseteq \boldsymbol{x}'}\frac{|\boldsymbol{z}'|!(M-|\boldsymbol{z}'|-1)!}{M!}[f_\boldsymbol{x}(\boldsymbol{z}')-f_\boldsymbol{x}(\boldsymbol{z}'\backslash i)]

here|\boldsymbol{z}'|Is1The number of ingredients that are\boldsymbol{z}' \subseteq \boldsymbol{x}Is\boldsymbol{x}'All that are a subset of the nonzero component of\boldsymbol{z'}ベクトルを表しています。これIs combined cooperative game theory の結果から得られるもので、&\phi_i&Is下記のShapley Values と呼ばれるものに一致しています。

\phi_i = \sum_{\mathcal{S}\subseteq \mathcal{N}\backslash i}\frac{1}{\frac{N!}{S!(N-S-1)!}}\left( v\left(\mathcal{S}\cup\{i\}\right) - v\left(\mathcal{S}\right) \right)

hereS=|\mathcal {S}|、N=|\mathcal {N}| 、vIs a utility function and the normalization constant is\mathcal{N} \backslash iThe number of patterns for sorting.

From the above results, it was found that the Additive feature attribution methods are unique for a given $ h_ \ boldsymbol {x} $. Conversely, indicators that are not based on Shapley values are considered undesirable because they do not have the three conditions that feature importance should have.

SHAP (SHapley Additive exPlanation) Values

Consider the SHAP value as a unified measure of the importance of features. This is an approximation of the original model $ f $ to the conditional expectation function in the Shapley values equation.

f_\boldsymbol{x}(\boldsymbol{z}') \equiv f(h_\boldsymbol{x}(\boldsymbol{z}')) \rightarrow E[f(\boldsymbol{z})|z_S] \\

E[f(\boldsymbol{z})|z_S]First of all, the subscriptSIs\boldsymbol{z}'Represents the index of the nonzero element of.z_SIs、確率変数\boldsymbol{z}Against the indexSInput value for the component corresponding toxIt means to fix it with the value of. For example\boldsymbol{z}'=[1,1,0,0,...]If,S=\\{1,2\\}Next,E[f(\boldsymbol{z})|z_S]=E[f(\boldsymbol{z})|z_{1,2}]=E[f(\boldsymbol{z})|z_1=x_1, z_2=x_2]It will be.

Explain why you need to replace it with an expected function. The mapping function $ h_ \ boldsymbol {x} (\ boldsymbol {z}') = z_S $ is strictly an operation to remove features that are not included in the index $ S $. This is because when mapping from the input space of the explainable model to the original feature space, it may make sense to replace the part not included in the index $ S $ with another value. Because there is. For example, in the image, should all the parts that do not correspond to the patch be replaced with 0? Or should it be replaced with 255? .. On the other hand, in the case of multiple regression analysis, the feature quantity deficiency can be replaced with 0 because removing the feature quantity term from the model is equivalent to setting the feature quantity to 0, but most models are like this. It cannot handle missing values naturally. Therefore, the mapping function decides to randomly allocate values according to the distribution of the data, instead of making the part not included in the index $ S $ a missing value. This allows the model to enter data. This makes the model's predicted value $ f $ an expected function conditional on $ z_S $.

Calculating the SHapley value as defined is difficult because it costs a lot of money. ($ \ Sum_ {\ boldsymbol {z}'\ subseteq \ boldsymbol {x}'} $ is the exponential order). However, you can approximate SHAP values by combining the insights of Additive feature attribution methods. Use feature independence and model linearity when approximating.

\begin{aligned}
E[f(\boldsymbol{z})|z_S] =& E[f(\boldsymbol{z})|z_S]\\
=& E_{\bar z_S}[f(z_S, \bar z_S)] \because \mathrm{Independent features} \\
=& f(z_S, E[\bar z_S]) \because \mathrm{Model is linear}
\end{aligned}

In the paper, six methods are shown as approximation results of SHAP, which means that these methods are integrated.

--Model-independent approximation method - 「Shapley sampling values」 --"Kernel SHAP" (new method) --Model-specific approximation method - 「Linear SHAP」 - 「Low-Order SHAP」 --"Max SHAP" (new method) --"Deep SHAP" (new method)

Shapley sampling values The paper is here. Abandoned due to charge.

Kernel SHAP

Kernel SHAP is determined in 5 steps.

z_k'\in \\{0,1\\}^M, k\in\{1,...,K\}(1 = feature present in coalition, 0 = feature absent). -$ h_x (z_k') = x_k \ cdot z_k'$ to return the feature to the original space and model $ f $ to get the predicted value. --Calculate the weight in the SHAP kernel for each $ z_k'$.
Fit weighted linear model.
Return Shapley values \phi_k, the coefficients from the linear model.

The SHAP kernel is defined by the following expression:

\pi_\boldsymbol{x}(\boldsymbol{z}') = \frac{M-1}{_MC_{|\boldsymbol{z}'|}|\boldsymbol{z}'|(M-|\boldsymbol{z}'|)}

Solve the optimization problem with a weighted loss function.

L(f, g, \pi_\boldsymbol{x}) = \sum_{\boldsymbol{z}'\in Z}\left[ f(h_\boldsymbol{x}(\boldsymbol{z}')) - g(\boldsymbol{z}') \right]^2 \pi_\boldsymbol{x}(\boldsymbol{z}')

Linear SHAP

In the linear model $ f (\ boldsymbol {x}) = \ sum_ {j = 1} w_j x_j + b $, if the features are independent, the SHAP value will be as follows. The derivation is detailed in the Appendix of this paper.

\begin{aligned}
\phi_0(f, \boldsymbol{x}) &= b \\
\phi_i(f, \boldsymbol{x}) &= w_i(x_i - E[x_i])
\end{aligned}

Low-Order SHAP (I don't know what you're talking about) Even in the case of linear regression, if Kernel SHAP is used, only $ \ mathcal O (2 ^ M + M ^ 3) $ needs to be calculated, so it is more efficient to make an approximation that makes the features independent (this approximation is Low- Order SHAP?).

Max SHAP It says that there are attached materials, but isn't there? So I don't know. The reviewers say it's wrong.

Deep SHAP It claims that SHAP can be calculated for the Dense layer and Maxpool layer, and it means that SHAP can be calculated by backpropagation (is it possible in the non-linear part?).

in conclusion

It seems that decision tree can calculate SHAP efficiently. With LightGBM and Xgboost, you can easily output SHAP with the library in Reference 4.

I read the SHAP paper

Introduction

References

Overview

Main paper

in conclusion