By using scikit-learn's RandomForestClassifier, you can solve the classification problem in a random forest. As a feature of Random Forest, it is possible to identify outlier data that has a value different from the attribute value that represents the class from the data that belongs to the same class. Since the official scikit-learn does not have a function to calculate outliers, this time I created a script to output outliers. (By the way, it can be calculated with R)
To find outliers, use apply in the scikit-learn RandomForestClassifier method. This is a method that returns the leaf index as to which leaf each data is contained in when batch input data is given to each decision tree created by the random forest algorithm.
In order to find the outliers, it is first necessary to find the degree of approximation of each data.
The degree of approximation of each data is calculated using the array returned by the apply method as an argument. The apply method returns a two-dimensional array of [number of samples, number of decision trees].
In proximity, the data $ x_ {k} $ contained in the same leaf as the data $ x_ {n} $ is counted, and the sum is calculated by performing it on all the decision trees that created it. Finally, the result is divided by the number of decision trees and normalized to obtain the degree of approximation of the data $ x_ {n} $. The final result is returned as a two-dimensional array of [number of samples, number of samples]. (By the way, the array is a diagonal matrix)
def proximity(data):
n_samples = np.zeros((len(data),len(data)))
n_estimators = len(data[0])
for e,est in enumerate(np.transpose(np.array(data))):
for n,n_node in enumerate(est):
for k,k_node in enumerate(est):
if n_node == k_node:
n_samples[n][k] += 1
n_samples = 1.0 * np.array(n_samples) / n_estimators
return n_samples
After finding the degree of approximation, then find the outliers. An array of correct labels is used as an argument to calculate outliers in the same class. The processing flow is as follows.
--Calculate the average value of the degree of approximation in the class
--Calculate outliers for each data
--Calculate the median, median absolute deviation (MAD) of outliers for each class
--Normalize outliers of each data with median and median absolute deviation
This is all you need to do, and you can easily write it by using numpy. If the for statement is also included, the speed can be increased.
You can also use XGBoost by using XGBoost scikit-learn wrapper. It is also possible to specify outliers.
By the way, in the normalization of outliers, the median and MAD are used instead of the mean and standard deviation because they are statistics (robust) that are not easily affected by outliers.
def outlier(data, label):
N = len(label)
pbar = [0] * N
data = np.square(data)
#Find the average of the approximations in the class
for n,n_prox2 in enumerate(data):
for k,k_prox2 in enumerate(n_prox2):
if label[n] == label[k]:
pbar[n] += k_prox2
if pbar[n] == 0.0:
pbar[n] = 1.0e-32
#Find outliers
out = N / np.array(pbar)
#Find the median outliers for each class
meds = {}
for n,l in enumerate(label):
if l not in meds.keys():
meds[l] = []
meds[l].append(out[n])
label_uniq = list(set(label))
med_uniq = {} #The actual median of each class goes into this variable
for l in label_uniq:
med_uniq[l] = np.median(meds[l])
#Median absolute deviation of outliers for each class(MAD)Seeking
mads = {}
for n,l in enumerate(label):
if l not in mads.keys():
mads[l] = []
mads[l].append(np.abs(out[n] - med_uniq[l]))
mad_uniq = {} #The actual MAD of each class goes into this variable
for l in label_uniq:
mad_uniq[l] = np.median(mads[l])
#Normalize outliers of each data with median, MAD
outlier = [0] * N
for n,l in enumerate(label):
if mad_uniq[l] == 0.0:
outlier[n] = out[n] - med_uniq[l]
else:
outlier[n] = (out[n] - med_uniq[l]) / mad_uniq[l]
return outlier
Using the above function, I tried to identify the outliers of iris in the sample data of sklearn. The sample code to output the image of this result is shown below.
outlier.py
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import matplotlib.pyplot as plt
def proximity(data):
n_samples = np.zeros((len(data),len(data)))
n_estimators = len(data[0])
for e,est in enumerate(np.transpose(np.array(data))):
for n,n_node in enumerate(est):
for k,k_node in enumerate(est):
if n_node == k_node:
n_samples[n][k] += 1
n_samples = 1.0 * np.array(n_samples) / n_estimators
return n_samples
def outlier(data, label):
N = len(label)
pbar = [0] * N
data = np.square(data)
#Find the average of the approximations in the class
for n,n_prox2 in enumerate(data):
for k,k_prox2 in enumerate(n_prox2):
if label[n] == label[k]:
pbar[n] += k_prox2
if pbar[n] == 0.0:
pbar[n] = 1.0e-32
#Find outliers
out = N / np.array(pbar)
#Find the median outliers for each class
meds = {}
for n,l in enumerate(label):
if l not in meds.keys():
meds[l] = []
meds[l].append(out[n])
label_uniq = list(set(label))
med_uniq = {} #The actual median of each class goes into this variable
for l in label_uniq:
med_uniq[l] = np.median(meds[l])
#Median absolute deviation of outliers for each class(MAD)Seeking
mads = {}
for n,l in enumerate(label):
if l not in mads.keys():
mads[l] = []
mads[l].append(np.abs(out[n] - med_uniq[l]))
mad_uniq = {} #The actual MAD of each class goes into this variable
for l in label_uniq:
mad_uniq[l] = np.median(mads[l])
#Normalize outliers of each data with median, MAD
outlier = [0] * N
for n,l in enumerate(label):
if mad_uniq[l] == 0.0:
outlier[n] = out[n] - med_uniq[l]
else:
outlier[n] = (out[n] - med_uniq[l]) / mad_uniq[l]
return outlier
if __name__ == '__main__':
iris = load_iris()
X = iris.data
y = iris.target
div = 50
best_oob = len(y)
for i in range(20):
rf = RandomForestClassifier(max_depth=5,n_estimators=10,oob_score=True)
rf.fit(X, y)
if best_oob > rf.oob_score:
app = rf.apply(X)
prx = proximity(app)
out = outlier(prx,y)
fig = plt.figure(figsize=[7,4])
ax = fig.add_subplot(1,1,1)
ax.scatter(np.arange(div),out[:div], c="r",marker='o', label='class 0')
ax.scatter(np.arange(div,div*2),out[div:div*2], c="b",marker='^', label='class 1')
ax.scatter(np.arange(div*2,div*3),out[div*2:], c="g",marker='s', label='class 2')
ax.set_ylabel('outlier')
ax.legend(loc="best")
fig.savefig("out.png ")
Recommended Posts