2. Multivariate analysis spelled out in Python 7-1. Decision tree (scikit-learn)

The decision tree is also known as the Decision Tree.
In order to reach a certain purpose, it is a method to classify by repeating branching based on each attribute (explanatory variable) of data. It is named tree for decision-making because the process is shown on the tree diagram as a whole.
When the objective variable to be classified is categorical data, it is called ** classification tree **, and when it is numerical data, it is called ** regression tree **.

** Here, let's first go through an example of a classified tree. ** **

⑴ Import library

#A class that builds a decision tree model
from sklearn.tree import DecisionTreeClassifier
#Module based on decision tree model
from sklearn import tree

#Package of dataset for machine learning
from sklearn import datasets
#Utility for splitting data
from sklearn.model_selection import train_test_split

#Module to display images in Notebook
from IPython.display import Image  
#Module for visualizing decision tree model
import pydotplus

⑵ Data acquisition and reading

iris = datasets.load_iris()
print(iris)

It's a very famous data set, but it seems to be new, but it is a data set that stores 4 features that measure the length and width of "petals" and "gaku" of 3 types of irises.
The three types are category data of Setosa, Versicolour, and Virginica, with 50 samples for each type, for a total of 150 samples.
Click here for the official scikit-learn explanation, https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html

	Variable name	meaning	Note	Data type
1	sepal length	Sepal length	Continuous amount(cm)	float64
2	sepal width	Sepal width	Continuous amount(cm)	float64
3	petal length	Petal length	Continuous amount(cm)	float64
4	petal width	Petal width	Continuous amount(cm)	float64
5	species	Type	Setosa=1, Versicolour=2, Virginica=3	int64

The contents of the iris dataset consist of five parts: labels and data for explanatory variables (features), labels and data for objective variables (types), and an overview of the data.
Just in case, check how the data is stored.

#Label of explanatory variable
print(iris.feature_names)

#Explanatory variable shape
print(iris.data.shape)

#Show the first 5 lines of the explanatory variable
iris.data[0:5, :]

A total of 150 samples with 4 measured values as explanatory variables.

Next, the objective variables are stored as categorical variables with three types [0, 1, 2].

#Objective variable label
print(iris.target_names)

#Shape of objective variable
print(iris.target.shape)

#Show objective variable
iris.target

(3) Data preprocessing

#Store explanatory variables and objective variables respectively
X = iris.data
y = iris.target

#Separated for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Unless you specify a specific allocation ratio, it will be randomly divided at a ratio of 75% for training and 25% for testing by default.
By specifying random_state = 0 in the argument, the first split state will be reproduced no matter how many times the split is repeated.

⑷ Model construction and evaluation of decision trees

#Initialize the class that builds the decision tree model
clf = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=0)

#Generate decision tree model
model = clf.fit(X_train, y_train)

#Calculate the correct answer rate for each training and test
print('Correct answer rate(train):{:.3f}'.format(model.score(X_train, y_train)))
print('Correct answer rate(test):{:.3f}'.format(model.score(X_test, y_test)))

The basis for the process of generating the decision tree is the ** impure ** index of category identification. How much impurities are mixed, if it is 0.0, it means that it can be classified purely.
That's why the argument is criterion ='gini', which explicitly specifies ** Gini impurity **. (The default is Gini Impure, so there is no need to describe it.)
Also, ** Depth of conditional branch hierarchy ** is specified as max_depth = 3 in this example, up to 3 layers at most. Increasing the number of hierarchies to a "deep tree" will increase the accuracy rate, but it can also increase the risk of overfitting.
For the generated model, calculate the correct answer rate for each training and test with the score () function. The training data is very high at 0.982, and the test data is slightly below that, but both are close to 1.0 and high.

⑸ Drawing a tree diagram

Drawing the diagram is the following 3 steps.

** Convert decision tree model to DOT data **
** Draw a diagram from DOT data **
** Convert to png and display in Notebook **

#Convert decision tree model to DOT data
dot_data = tree.export_graphviz(model,                              #Specify decision tree model
                                out_file = None,                    #Specifies to return a string instead of an output file
                                feature_names = iris.feature_names, #Specify the display name of the feature amount
                                class_names = iris.target_names,    #Specify the display name of the classification
                                filled = True)                      #Color nodes in the majority class

#Draw a diagram
graph = pydotplus.graph_from_dot_data(dot_data)  

#View diagram
Image(graph.create_png())

** DOT data ** is data written in the DOT language. ** DOT language ** is a language for describing graph structures (network structures consisting of nodes and edges). A node means a nodule (□) and an edge means a connecting line (↓).
A tool for drawing such graph structures is ** Graphviz **. The ʻexport_graphviz ()` function, which belongs to the tree module of sklearn, converts the decision tree model to DOT format. At that time, detailed specifications such as drawing specifications and display name are specified as arguments.
Next, draw a graph using the graph_from_dot_data () function of the module ** pydotplus ** for handling the DOT language in Python.
In addition, in order to display the graph in Notebook, I convert the graph to png and execute the ʻImage ()` method of the ** IPython.display ** module.

How to read a tree diagram

The decision tree will be viewed from above. First of all, the classification conditions that are valid ➀, start from here.
The conditional expression petal width (cm) <= 0.8 means that the width of the petals is 0.8 or less. If applicable, follow the arrow of True, otherwise follow the arrow of False.
Nodes that come down the True arrow show a ** Gini impureness of 0.0 **, and all 37 samples are purely classified as setosa species. This is the first goal.
On the other hand, the node on the False side branches to True or False according to the new classification condition ➁. In this way, we will go down the hierarchy while branching, aiming for the goal ** where the purity of Gini is 0.0.

Supplement

As mentioned above, since it is implemented on google colaboratory, perform as follows to import the tree diagram to the local PC.

#Export to png file
graph.write_png("iris.png ")

#Download from google colaboratory
from google.colab import files
files.download('iris.png')