Summary of Decision Tree

What is Decision Tree?

The decision tree is a method of defining conditions one after another for data and classifying them according to each condition. In the figure below, I'm trying to decide whether or not to windsurf. Therefore, we first categorize by the strength of the wind, and then categorize by whether it is sunny or not.

This model on the right is called a decision tree. In the decision tree, as shown in the figure on the left, classification is performed by performing linear classification multiple times.

Screen Shot 2017-05-09 at 11.39.46.png Extracted by 'Introduction to Machine Learning', Udacity

By the way, the decision tree seems to have regression and classification, but this time I will talk about classification.

default code

`python`



DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, 
min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, 
max_features=None, random_state=None, max_leaf_nodes=None, 
min_impurity_split=1e-07, class_weight=None, presort=False)

Description of the Parameter in the Decision Tree

It has a lot of contents. Only the main ones will be explained below.

min_samples_split

If min_samples_split = 2, and the value of the branch destination is 2 or more, the branch will continue. Looking at the figure below, the area surrounded by the light blue circle has two or more samples, so branching continues. However, depending on the number of samples, there is a high possibility of overfitting, so adjustment is necessary. Screen Shot 2017-05-09 at 12.35.35.png Extracted from 'Introduction to Machine Learning', Udacity

criterion

Specify how to split the data with'gini' or'entropy'.

'gini': The lower the purity (gini coefficient) of the kth classification destination, the better. Use'entropy': information gain to find the most efficient conditions. It seems that there is not much difference, but the details are here and here. blog / sklearn-gini-vs-entropy-criteria)

max_depth

This determines and limits the maximum depth of the decision tree to prevent overfitting.

Decision Tree visualization (for Iris set)

`python`




from IPython.display import Image  
dot_data = tree.export_graphviz(clf, out_file=None, 
                     feature_names=iris.feature_names, 
                     class_names=iris.target_names,  
                     filled=True, rounded=True,  
                     special_characters=True)  
graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png())

Screen Shot 2017-05-09 at 11.00.27.png

The image and the code are extracted from 'Sklearn document'

How to prevent overfitting of decision trees

--Specify the minimum value at the end of the tree with min_samples_leaf. --Max_depth limits the depth of the tree.

The pros and cons of Decision Tree.

good point

Whereas other algorithms require normalization of data, creation of dummy variables, etc., Decision Tree requires almost no delivery processing that can handle categorical data and numerical data. Also, since it can be visualized as above, you can see that it is a very easy-to-understand algorithm.

--Bad point

Easy to overfit. Data is classified by vertical and horizontal straight lines, so if the data cannot be separated by a boundary line parallel to the axis, it cannot be classified well.

Summary

The above is the outline of Decision Tree as far as I can understand. We will update it daily, so if you have something to add or fix, we would appreciate it if you could comment.

Machine learning ③ Summary of decision tree

Summary of Decision Tree

What is Decision Tree?

default code

python

Description of the Parameter in the Decision Tree

Decision Tree visualization (for Iris set)

python

How to prevent overfitting of decision trees

The pros and cons of Decision Tree.

Summary

`python`

`python`