** Here, let's first go through an example of a classified tree. ** **
#A class that builds a decision tree model
from sklearn.tree import DecisionTreeClassifier
#Module based on decision tree model
from sklearn import tree
#Package of dataset for machine learning
from sklearn import datasets
#Utility for splitting data
from sklearn.model_selection import train_test_split
#Module to display images in Notebook
from IPython.display import Image
#Module for visualizing decision tree model
import pydotplus
iris = datasets.load_iris()
print(iris)
Variable name | meaning | Note | Data type | |
---|---|---|---|---|
1 | sepal length | Sepal length | Continuous amount(cm) | float64 |
2 | sepal width | Sepal width | Continuous amount(cm) | float64 |
3 | petal length | Petal length | Continuous amount(cm) | float64 |
4 | petal width | Petal width | Continuous amount(cm) | float64 |
5 | species | Type | Setosa=1, Versicolour=2, Virginica=3 | int64 |
#Label of explanatory variable
print(iris.feature_names)
#Explanatory variable shape
print(iris.data.shape)
#Show the first 5 lines of the explanatory variable
iris.data[0:5, :]
#Objective variable label
print(iris.target_names)
#Shape of objective variable
print(iris.target.shape)
#Show objective variable
iris.target
#Store explanatory variables and objective variables respectively
X = iris.data
y = iris.target
#Separated for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
random_state = 0
in the argument, the first split state will be reproduced no matter how many times the split is repeated.#Initialize the class that builds the decision tree model
clf = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=0)
#Generate decision tree model
model = clf.fit(X_train, y_train)
#Calculate the correct answer rate for each training and test
print('Correct answer rate(train):{:.3f}'.format(model.score(X_train, y_train)))
print('Correct answer rate(test):{:.3f}'.format(model.score(X_test, y_test)))
criterion ='gini'
, which explicitly specifies ** Gini impurity **. (The default is Gini Impure, so there is no need to describe it.)max_depth = 3
in this example, up to 3 layers at most. Increasing the number of hierarchies to a "deep tree" will increase the accuracy rate, but it can also increase the risk of overfitting.score ()
function. The training data is very high at 0.982, and the test data is slightly below that, but both are close to 1.0 and high.png
and display in Notebook **#Convert decision tree model to DOT data
dot_data = tree.export_graphviz(model, #Specify decision tree model
out_file = None, #Specifies to return a string instead of an output file
feature_names = iris.feature_names, #Specify the display name of the feature amount
class_names = iris.target_names, #Specify the display name of the classification
filled = True) #Color nodes in the majority class
#Draw a diagram
graph = pydotplus.graph_from_dot_data(dot_data)
#View diagram
Image(graph.create_png())
graph_from_dot_data ()
function of the module ** pydotplus ** for handling the DOT language in Python.png
and execute the ʻImage ()` method of the ** IPython.display ** module.petal width (cm) <= 0.8
means that the width of the petals is 0.8 or less. If applicable, follow the arrow of True
, otherwise follow the arrow of False
.True
arrow show a ** Gini impureness of 0.0 **, and all 37 samples are purely classified as setosa species. This is the first goal.False
side branches to True or False according to the new classification condition ➁. In this way, we will go down the hierarchy while branching, aiming for the goal ** where the purity of Gini is 0.0.#Export to png file
graph.write_png("iris.png ")
#Download from google colaboratory
from google.colab import files
files.download('iris.png')
Recommended Posts