So far, we have summarized logistic regression, support vector machines, neural networks, etc. This time, I will summarize the basic decision trees such as XGBoost, LightGBM, and Random Forest.
A decision tree is an algorithm that "separates data step by step and outputs tree-like analysis results."
The advantages of performing an analysis with this decision tree are:
There are features such as. In particular, I think the first advantage is great. Other classifiers (support vector machines, neural networks, etc.) are very complicated and black-boxed in the calculations performed inside, so it is difficult for anyone who is familiar with the contents of the model to understand the contents. There is a face. On the other hand, the decision tree is an easy-to-understand feature because the grounds for division are clear as shown in the above figure. I think this ** easy to understand ** is a huge advantage. This is because I don't think it's a good attitude for an engineer for a person who uses a model to produce results using something that he doesn't understand **.
This decision tree is an algorithm that can be applied to both classification and regression, but this time I would like to summarize the classification. In addition, the decision tree has an algorithm called CART and C4.5. This time we are talking about CART.
Divide so that the elements after division (data to be divided) are for each element you want to divide (= the purity after division is the minimum). Impureness is an indicator of how different classes of elements are mixed together.
I would like to briefly consider it with an example. I would like to load and create the make_blobs ()
method of scikit learn
that gives me a spray sample.
RF.ipynb
#Imports
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=500, centers=4,
random_state=8, cluster_std=2.2)
#Scatter plot the points
plt.figure(figsize=(10,10))
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='jet')
The original scatter plot is here.
Now, let's think about how to divide this scatter plot.
Due to the dividing line on the left, the element $ [2], opposite to the group containing many elements $ [0], [1] $ (conversely not including $ [2], [3] $). You can see that it can be divided into groups that contain a lot of [3] $ (conversely, $ [0] and [1] $ are not included). This kind of division is a method of division with less purity. On the other hand, in the case of the dividing line on the right, you can see that the elements of $ [0] to [3] $ are mixed in the group after the division. This is expressed as having a lot of impurities. Find the dividing line as shown on the left and classify it several times (= called the depth of the decision tree) to classify the purpose.
Now, let's summarize how to express this impureness. This time, we will focus on Gini's Diversity Index $. The literal translation is the Gini Diversity Index. I think you can understand the meaning in Japanese, but it is often called impure.
Consider $ t $ in a hierarchy (= node) with a decision tree. A node is a group after it has been split. Then, consider the case where there are $ n $ samples in the node and $ c $ classes in the node. At this time, assuming that the number of samples belonging to the class $ i $ in the node $ t $ is $ n_i $, the ratio $ p (i | t) $ of the samples belonging to the class $ i $ can be expressed as follows. I can do it.
At this time, Gini Impure $ I_G (t) $ can be expressed as follows.
The sum of $ p (i | t) ^ 2 $ will increase if a good division is made. Therefore, $ I_G (t) $ becomes smaller. We will make a good classifier using this evaluation index.
Color each class to make it easier to understand.
RF.ipynb
def visualize_tree(classifier, X, y, boundaries=True,xlim=None, ylim=None):
'''
Visualize the decision tree.
INPUTS:Classification model, X, y, optional x/y limits.
OUTPUTS:Visualization of decision trees using Meshgrid
'''
#Building a model using fit
classifier.fit(X, y)
#Automatic adjustment of axis
if xlim is None:
xlim = (X[:, 0].min() - 0.1, X[:, 0].max() + 0.1)
if ylim is None:
ylim = (X[:, 1].min() - 0.1, X[:, 1].max() + 0.1)
x_min, x_max = xlim
y_min, y_max = ylim
#Make a mesh grid.
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
np.linspace(y_min, y_max, 100))
#Save classifier predictions as Z
Z = classifier.predict(np.c_[xx.ravel(), yy.ravel()])
#Use meshgrid to shape it.
Z = Z.reshape(xx.shape)
#Color each category.
plt.figure(figsize=(10,10))
plt.pcolormesh(xx, yy, Z, alpha=0.2, cmap='jet')
#It also draws training data.
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='jet')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
def plot_boundaries(i, xlim, ylim):
'''
Draw a border.
'''
if i < 0:
return
tree = classifier.tree_
#Call recursively to draw the boundary.
if tree.feature[i] == 0:
plt.plot([tree.threshold[i], tree.threshold[i]], ylim, '-k')
plot_boundaries(tree.children_left[i],
[xlim[0], tree.threshold[i]], ylim)
plot_boundaries(tree.children_right[i],
[tree.threshold[i], xlim[1]], ylim)
elif tree.feature[i] == 1:
plt.plot(xlim, [tree.threshold[i], tree.threshold[i]], '-k')
plot_boundaries(tree.children_left[i], xlim,
[ylim[0], tree.threshold[i]])
plot_boundaries(tree.children_right[i], xlim,
[tree.threshold[i], ylim[1]])
if boundaries:
plot_boundaries(0, plt.xlim(), plt.ylim())
RF.ipynb
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(criterion='entropy',max_depth=2,random_state=0)
visualize_tree(clf,X,y)
I was able to divide it into a good feeling. In the above, determine the depth of the decision tree with max_depth
. Increasing (= deepening) the value of this max_depth
will result in overfitting. Try it with max_depth = 6
.
You can see that it is over-divided (especially the red class). The point is that you need to adjust this depth yourself.
By the way, the image of the decision tree when verified at depth 2 is shown below.
The numerical values and Gini impureness that serve as the criteria for classification are listed. value is the number of elements in classes [0] to [3].
By the way, it is a lightly posted image, but it is necessary to import the library and install the software.
The above three actions were necessary for me to save.
This is a library that saves the contents divided by the decision tree to a .dot
file.
console
pip install pydotplus
Just like any other library, you can install it with pip
.
I downloaded the installer (graphviz-2.38.msi) from this URL.
https://graphviz.gitlab.io/_pages/Download/Download_windows.html
After the download is complete, double-click "graphviz-2.38.msi" to install it. In addition, install graphviz
on pip
.
Then, pass path
to convert it to pdf. Specify the folder where dot.exe
is located. I have summarized the method before, so please refer to here.
https://qiita.com/Fumio-eisan/items/340de9fe220a90607013
Finally, you need to rewrite the syntax in graphviz.py
as follows:
At this point, the decision tree can be converted to .dot and pdf below.
RF.ipynb
import pydotplus
import os
from graphviz import Source
from sklearn.tree import export_graphviz
export_graphviz(
clf,
out_file=os.path.join("text_classification.dot"),
class_names=['1', '2','3','4'],
rounded=True,
filled=True
)
with open("random.dot", 'w') as f:
f = export_graphviz(clf, out_file=f)
data = export_graphviz(clf, out_file=None)
graph = pydotplus.graph_from_dot_data(data)
graph.write_pdf("random.pdf")
I referred to this site.
GraphViz error handling (GraphViz ’s executables not found) https://niwakomablog.com/graphviz-error-handling/
This time, we have summarized the contents and implementation of the classification of decision trees. The idea is easy to understand and easy to implement. However, at the end, I felt a little hurdle to make the decision tree into pdf. Next, I would like to make a regression.
The full program is here. https://github.com/Fumio-eisan/RF_20200423
Recommended Posts