Boruta is one of the feature selection methods using feature importance.
I studied by referring to the explanation on the following site. It is a very easy-to-understand explanation, and it is very helpful with experiments.
-Feature selection method Boruta using random forest and test
Boruta's algorithm uses the null hypothesis to determine if a feature is useful for prediction. The importance of this feature is the same as the importance of the feature that does not contribute to discrimination (regression). ", And perform the test. If the null hypothesis is correct, the importance of a feature and the capacity of a feature that does not contribute to discrimination (regression) should be random, and the probability that a feature will become important is $ p = 0.5. It should follow the binomial distribution of $.
What I was interested in here is not one Shadow Feature, but the same number as the original feature dimension. Then, the size comparison is performed by "the highest importance of a certain feature VS Shadow Feature".
The above image is taken from [Feature selection method Boruta using random forest and test](https://aotamasaki.hatenablog.com/entry/2019/01/05/195813).Implementation of boruta_py also looks like that.
def _do_tests(self, dec_reg, hit_reg, _iter):
active_features = np.where(dec_reg >= 0)[0]
hits = hit_reg[active_features]
# get uncorrected p values based on hit_reg
to_accept_ps = sp.stats.binom.sf(hits - 1, _iter, .5).flatten()
to_reject_ps = sp.stats.binom.cdf(hits, _iter, .5).flatten()
...
However, if the random variable representing the maximum feature importance of Shadow Feature is $ S_ {\ mathrm {max}} $,
S_{\mathrm{max}} \leq F_1 \Leftrightarrow S_1\leq F_1 \wedge S_2\leq F_1 \wedge S_3\leq F_1 \wedge S_4\leq F_1
Therefore,
\mathrm{Pr}(S_{\mathrm{max}} \leq F_1) = \mathrm{Pr}(S_1 \leq F_1)\mathrm{Pr}(S_2 \leq F_1)\mathrm{Pr}(S_3 \leq F_1)\mathrm{Pr}(S_4 \leq F_1) = \left(\frac{1}{2}\right)^4
If the null hypothesis is correct, then the probability that a feature will be greater than the largest feature importance of the Shadow Feature should follow the binomial distribution of $ p = (1/2) ^ 4 $.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import binom
plt.rcParams['font.family'] = 'IPAPGothic'
plt.rcParams["figure.figsize"] = [12, 6]
plt.rcParams['font.size'] = 20
plt.rcParams['xtick.labelsize'] = 15
plt.rcParams['ytick.labelsize'] = 15
#Number of trials
N = 100
#Number of occurrences during the number of trials(Array)
k = np.arange(N+1)
#Plot on graph
fig, ax = plt.subplots(1,1)
ax.plot(k, binom.pmf(k, N, p=0.5), 'bo', ms=8, label='p=0.5')
ax.vlines(k, 0, binom.pmf(k, N, p), colors='b', lw=1, alpha=0.2)
ax.plot(k, binom.pmf(k, N, p=0.5**4), 'x', ms=8, color='r', label='p=0.5^4')
ax.vlines(k, 0, binom.pmf(k, N, p=0.5**4), colors='r', lw=1, alpha=0.2)
ax.set_xlabel('iterarion')
ax.set_ylabel('probability')
ax.legend()
plt.show()
As you can see in the graph, it is quite different from the rejection area.
I wrote so far, but when I actually use Boruta
, does it work properly? So, is there something wrong with the way of thinking?
Recommended Posts