Introduction

So far, we have summarized logistic regression, support vector machines, neural networks, etc. This time, I will summarize the basic decision trees such as XGBoost, LightGBM, and Random Forest.

What is a decision tree?

A decision tree is an algorithm that "separates data step by step and outputs tree-like analysis results."

The advantages of performing an analysis with this decision tree are:

Easy to understand (the basis of classification can be visualized)
Applicable to both classification and regression
Can handle a wide range of problems

There are features such as. In particular, I think the first advantage is great. Other classifiers (support vector machines, neural networks, etc.) are very complicated and black-boxed in the calculations performed inside, so it is difficult for anyone who is familiar with the contents of the model to understand the contents. There is a face. On the other hand, the decision tree is an easy-to-understand feature because the grounds for division are clear as shown in the above figure. I think this ** easy to understand ** is a huge advantage. This is because I don't think it's a good attitude for an engineer for a person who uses a model to produce results using something that he doesn't understand **.

This decision tree is an algorithm that can be applied to both classification and regression, but this time I would like to summarize the classification. In addition, the decision tree has an algorithm called CART and C4.5. This time we are talking about CART.

How to decide the criteria for division

Divide so that the elements after division (data to be divided) are for each element you want to divide (= the purity after division is the minimum). Impureness is an indicator of how different classes of elements are mixed together.

I would like to briefly consider it with an example. I would like to load and create the make_blobs () method of scikit learn that gives me a spray sample.

`RF.ipynb`


#Imports
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn

from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=500, centers=4,
                  random_state=8, cluster_std=2.2)

#Scatter plot the points
plt.figure(figsize=(10,10))
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='jet')

The original scatter plot is here.

Now, let's think about how to divide this scatter plot.

Due to the dividing line on the left, the element $ [2], opposite to the group containing many elements $ [0], [1] $ (conversely not including $ [2], [3] $). You can see that it can be divided into groups that contain a lot of [3] $ (conversely, $ [0] and [1] $ are not included). This kind of division is a method of division with less purity. On the other hand, in the case of the dividing line on the right, you can see that the elements of $ [0] to [3] $ are mixed in the group after the division. This is expressed as having a lot of impurities. Find the dividing line as shown on the left and classify it several times (= called the depth of the decision tree) to classify the purpose.

How to find purity

Now, let's summarize how to express this impureness. This time, we will focus on Gini's Diversity Index $. The literal translation is the Gini Diversity Index. I think you can understand the meaning in Japanese, but it is often called impure.

Consider $ t $ in a hierarchy (= node) with a decision tree. A node is a group after it has been split. Then, consider the case where there are $ n $ samples in the node and $ c $ classes in the node. At this time, assuming that the number of samples belonging to the class $ i $ in the node $ t $ is $ n_i $, the ratio $ p (i | t) $ of the samples belonging to the class $ i $ can be expressed as follows. I can do it.

p(i|t) = \frac{n_i}{n} \tag{1}

At this time, Gini Impure $ I_G (t) $ can be expressed as follows. $ I_G(t) = 1 - \sum_{i=1}^c {p(i|t)}^2 \tag{2} $

The sum of $ p (i | t) ^ 2 $ will increase if a good division is made. Therefore, $ I_G (t) $ becomes smaller. We will make a good classifier using this evaluation index.

Arranged to make the graph easier to see

Color each class to make it easier to understand.

`RF.ipynb`


def visualize_tree(classifier, X, y, boundaries=True,xlim=None, ylim=None):
    '''
Visualize the decision tree.
    INPUTS:Classification model, X, y, optional x/y limits.
    OUTPUTS:Visualization of decision trees using Meshgrid
    '''
    #Building a model using fit
    classifier.fit(X, y)
    
    #Automatic adjustment of axis
    if xlim is None:
        xlim = (X[:, 0].min() - 0.1, X[:, 0].max() + 0.1)
    if ylim is None:
        ylim = (X[:, 1].min() - 0.1, X[:, 1].max() + 0.1)

    x_min, x_max = xlim
    y_min, y_max = ylim
    
    
    #Make a mesh grid.
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))
    
    #Save classifier predictions as Z
    Z = classifier.predict(np.c_[xx.ravel(), yy.ravel()])

    #Use meshgrid to shape it.
    Z = Z.reshape(xx.shape)
    
    #Color each category.
    plt.figure(figsize=(10,10))
    plt.pcolormesh(xx, yy, Z, alpha=0.2, cmap='jet')
    
    #It also draws training data.
    plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='jet')
    
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)        
    
    def plot_boundaries(i, xlim, ylim):
        '''
Draw a border.
        '''
        if i < 0:
            return

        tree = classifier.tree_
        
        #Call recursively to draw the boundary.
        if tree.feature[i] == 0:
            plt.plot([tree.threshold[i], tree.threshold[i]], ylim, '-k')
            plot_boundaries(tree.children_left[i],
                            [xlim[0], tree.threshold[i]], ylim)
            plot_boundaries(tree.children_right[i],
                            [tree.threshold[i], xlim[1]], ylim)
        
        elif tree.feature[i] == 1:
            plt.plot(xlim, [tree.threshold[i], tree.threshold[i]], '-k')
            plot_boundaries(tree.children_left[i], xlim,
                            [ylim[0], tree.threshold[i]])
            plot_boundaries(tree.children_right[i], xlim,
                            [tree.threshold[i], ylim[1]])
    
    if boundaries:
        plot_boundaries(0, plt.xlim(), plt.ylim())

Try to classify by decision tree

`RF.ipynb`


from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(criterion='entropy',max_depth=2,random_state=0)

visualize_tree(clf,X,y)

I was able to divide it into a good feeling. In the above, determine the depth of the decision tree with max_depth. Increasing (= deepening) the value of this max_depth will result in overfitting. Try it with max_depth = 6.

You can see that it is over-divided (especially the red class). The point is that you need to adjust this depth yourself.

Visualization of the basis of the decision tree

By the way, the image of the decision tree when verified at depth 2 is shown below.

The numerical values and Gini impureness that serve as the criteria for classification are listed. value is the number of elements in classes [0] to [3].

How to output a decision tree image

By the way, it is a lightly posted image, but it is necessary to import the library and install the software.

Install pydotplus
Install graphviz
Through path

The above three actions were necessary for me to save.

Install pydotplus

This is a library that saves the contents divided by the decision tree to a .dot file.

`console`


pip install pydotplus

Just like any other library, you can install it with pip.

Install graphviz

I downloaded the installer (graphviz-2.38.msi) from this URL.

https://graphviz.gitlab.io/_pages/Download/Download_windows.html

After the download is complete, double-click "graphviz-2.38.msi" to install it. In addition, install graphviz on pip.

pass the path

Then, pass path to convert it to pdf. Specify the folder where dot.exe is located. I have summarized the method before, so please refer to here.

https://qiita.com/Fumio-eisan/items/340de9fe220a90607013

Finally, you need to rewrite the syntax in graphviz.py as follows:

At this point, the decision tree can be converted to .dot and pdf below.

`RF.ipynb`



import pydotplus
import os
from graphviz import Source
from sklearn.tree import export_graphviz

export_graphviz(
        clf,
        out_file=os.path.join("text_classification.dot"),
        class_names=['1', '2','3','4'],
        rounded=True,
        filled=True
    )
with open("random.dot", 'w') as f:
    f = export_graphviz(clf, out_file=f)

data = export_graphviz(clf, out_file=None)
graph = pydotplus.graph_from_dot_data(data)
graph.write_pdf("random.pdf")

I referred to this site.

GraphViz error handling (GraphViz ’s executables not found) https://niwakomablog.com/graphviz-error-handling/

At the end

This time, we have summarized the contents and implementation of the classification of decision trees. The idea is easy to understand and easy to implement. However, at the end, I felt a little hurdle to make the decision tree into pdf. Next, I would like to make a regression.

The full program is here. https://github.com/Fumio-eisan/RF_20200423

I tried to understand the decision tree (CART) that makes the classification carefully

Introduction

What is a decision tree?

How to decide the criteria for division

RF.ipynb

How to find purity

Arranged to make the graph easier to see

RF.ipynb

Try to classify by decision tree

RF.ipynb

Visualization of the basis of the decision tree

How to output a decision tree image

console

RF.ipynb

At the end

`RF.ipynb`

`RF.ipynb`

`RF.ipynb`

`console`

`RF.ipynb`