Identify outliers with RandomForestClassifier in scikit-learn

By using scikit-learn's RandomForestClassifier, you can solve the classification problem in a random forest. As a feature of Random Forest, it is possible to identify outlier data that has a value different from the attribute value that represents the class from the data that belongs to the same class. Since the official scikit-learn does not have a function to calculate outliers, this time I created a script to output outliers. (By the way, it can be calculated with R)

To find outliers, use apply in the scikit-learn RandomForestClassifier method. This is a method that returns the leaf index as to which leaf each data is contained in when batch input data is given to each decision tree created by the random forest algorithm.

Screenshot from 2017-01-22 21-28-40.png

About the code

In order to find the outliers, it is first necessary to find the degree of approximation of each data.

Calculation of approximation

The degree of approximation of each data is calculated using the array returned by the apply method as an argument. The apply method returns a two-dimensional array of [number of samples, number of decision trees].

In proximity, the data $ x_ {k} $ contained in the same leaf as the data $ x_ {n} $ is counted, and the sum is calculated by performing it on all the decision trees that created it. Finally, the result is divided by the number of decision trees and normalized to obtain the degree of approximation of the data $ x_ {n} $. The final result is returned as a two-dimensional array of [number of samples, number of samples]. (By the way, the array is a diagonal matrix)

def proximity(data):
  n_samples = np.zeros((len(data),len(data)))
  n_estimators = len(data[0])

  for e,est in enumerate(np.transpose(np.array(data))):
    for n,n_node in enumerate(est):
      for k,k_node in enumerate(est):
        if n_node == k_node:
          n_samples[n][k] += 1

  n_samples = 1.0 * np.array(n_samples) / n_estimators

  return n_samples

Calculation of outliers

After finding the degree of approximation, then find the outliers. An array of correct labels is used as an argument to calculate outliers in the same class. The processing flow is as follows.

--Calculate the average value of the degree of approximation in the class

--Calculate outliers for each data

--Calculate the median, median absolute deviation (MAD) of outliers for each class

--Normalize outliers of each data with median and median absolute deviation

This is all you need to do, and you can easily write it by using numpy. If the for statement is also included, the speed can be increased.

You can also use XGBoost by using XGBoost scikit-learn wrapper. It is also possible to specify outliers.

By the way, in the normalization of outliers, the median and MAD are used instead of the mean and standard deviation because they are statistics (robust) that are not easily affected by outliers.

def outlier(data, label):
  N = len(label)
  pbar = [0] * N
  data = np.square(data)

  #Find the average of the approximations in the class
  for n,n_prox2 in enumerate(data):
    for k,k_prox2 in enumerate(n_prox2):
      if label[n] == label[k]:
        pbar[n] += k_prox2
    if pbar[n] == 0.0:
      pbar[n] = 1.0e-32

  #Find outliers
  out = N / np.array(pbar)

  #Find the median outliers for each class
  meds = {}
  for n,l in enumerate(label):
    if l not in meds.keys():
      meds[l] = []
    meds[l].append(out[n])
  
  label_uniq = list(set(label))
  med_uniq = {} #The actual median of each class goes into this variable
  for l in label_uniq:
    med_uniq[l] = np.median(meds[l])
  
  #Median absolute deviation of outliers for each class(MAD)Seeking
  mads = {}
  for n,l in enumerate(label):
    if l not in mads.keys():
      mads[l] = []
    mads[l].append(np.abs(out[n] - med_uniq[l]))

  mad_uniq = {} #The actual MAD of each class goes into this variable
  for l in label_uniq:
    mad_uniq[l] = np.median(mads[l])

  #Normalize outliers of each data with median, MAD
  outlier = [0] * N
  for n,l in enumerate(label):
    if mad_uniq[l] == 0.0:
      outlier[n] = out[n] - med_uniq[l]
    else:
      outlier[n] = (out[n] - med_uniq[l]) / mad_uniq[l]

  return outlier

sample

Using the above function, I tried to identify the outliers of iris in the sample data of sklearn. The sample code to output the image of this result is shown below.

out.png

code

outlier.py


from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import matplotlib.pyplot as plt

def proximity(data):
  n_samples = np.zeros((len(data),len(data)))
  n_estimators = len(data[0])
  for e,est in enumerate(np.transpose(np.array(data))):
    for n,n_node in enumerate(est):
      for k,k_node in enumerate(est):
        if n_node == k_node:
          n_samples[n][k] += 1
  n_samples = 1.0 * np.array(n_samples) / n_estimators
  return n_samples

def outlier(data, label):
  N = len(label)
  pbar = [0] * N
  data = np.square(data)

  #Find the average of the approximations in the class
  for n,n_prox2 in enumerate(data):
    for k,k_prox2 in enumerate(n_prox2):
      if label[n] == label[k]:
        pbar[n] += k_prox2
    if pbar[n] == 0.0:
      pbar[n] = 1.0e-32

  #Find outliers
  out = N / np.array(pbar)

  #Find the median outliers for each class
  meds = {}
  for n,l in enumerate(label):
    if l not in meds.keys():
      meds[l] = []
    meds[l].append(out[n])
  
  label_uniq = list(set(label))
  med_uniq = {} #The actual median of each class goes into this variable
  for l in label_uniq:
    med_uniq[l] = np.median(meds[l])
  
  #Median absolute deviation of outliers for each class(MAD)Seeking
  mads = {}
  for n,l in enumerate(label):
    if l not in mads.keys():
      mads[l] = []
    mads[l].append(np.abs(out[n] - med_uniq[l]))

  mad_uniq = {} #The actual MAD of each class goes into this variable
  for l in label_uniq:
    mad_uniq[l] = np.median(mads[l])

  #Normalize outliers of each data with median, MAD
  outlier = [0] * N
  for n,l in enumerate(label):
    if mad_uniq[l] == 0.0:
      outlier[n] = out[n] - med_uniq[l]
    else:
      outlier[n] = (out[n] - med_uniq[l]) / mad_uniq[l]

  return outlier


if __name__ == '__main__':
  iris = load_iris()
  X = iris.data
  y = iris.target
  div = 50
  best_oob = len(y)

  for i in range(20):
    rf = RandomForestClassifier(max_depth=5,n_estimators=10,oob_score=True)
    rf.fit(X, y)
    if best_oob > rf.oob_score:
      app = rf.apply(X)
  
  prx = proximity(app)
  out = outlier(prx,y)
  
  fig = plt.figure(figsize=[7,4])
  ax = fig.add_subplot(1,1,1)

  ax.scatter(np.arange(div),out[:div], c="r",marker='o', label='class 0')
  ax.scatter(np.arange(div,div*2),out[div:div*2], c="b",marker='^', label='class 1')
  ax.scatter(np.arange(div*2,div*3),out[div*2:], c="g",marker='s', label='class 2')
  
  ax.set_ylabel('outlier') 
  ax.legend(loc="best")
  fig.savefig("out.png ")
  

reference

Recommended Posts

Identify outliers with RandomForestClassifier in scikit-learn
Clustering representative schools in summer 2016 with scikit-learn
Fill in missing values with Scikit-learn impute
Isomap with Scikit-learn
DBSCAN with scikit-learn
Clustering with scikit-learn (1)
Clustering with scikit-learn (2)
PCA with Scikit-learn
kmeans ++ with scikit-learn
Cross Validation with scikit-learn
Multi-class SVM with scikit-learn
Clustering with scikit-learn + DBSCAN
Learn with chemoinformatics scikit-learn
Fill outliers with NaN based on quartiles in Pandas
DBSCAN (clustering) with scikit-learn
Continued) Try other distance functions with kmeans in Scikit-learn
Install scikit.learn with pip
Calculate tf-idf with scikit-learn
Use "$ in" operator with mongo-go-driver
Working with LibreOffice in Python
Scraping with chromedriver in python
Neural network with Python (scikit-learn)
Working with sounds in Python
Scraping with Selenium in Python
Parallel processing with Parallel of scikit-learn
Scraping with Tor in Python
Tweet with image in Python
Combined with permutations in Python
[Python] Linear regression with scikit-learn
Robust linear regression with scikit-learn