The decision tree is a method of defining conditions one after another for data and classifying them according to each condition. In the figure below, I'm trying to decide whether or not to windsurf. Therefore, we first categorize by the strength of the wind, and then categorize by whether it is sunny or not.
This model on the right is called a decision tree. In the decision tree, as shown in the figure on the left, classification is performed by performing linear classification multiple times.
Extracted by 'Introduction to Machine Learning', Udacity
By the way, the decision tree seems to have regression and classification, but this time I will talk about classification.
python
DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None,
min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0,
max_features=None, random_state=None, max_leaf_nodes=None,
min_impurity_split=1e-07, class_weight=None, presort=False)
It has a lot of contents. Only the main ones will be explained below.
If min_samples_split = 2, and the value of the branch destination is 2 or more, the branch will continue. Looking at the figure below, the area surrounded by the light blue circle has two or more samples, so branching continues. However, depending on the number of samples, there is a high possibility of overfitting, so adjustment is necessary. Extracted from 'Introduction to Machine Learning', Udacity
Specify how to split the data with'gini' or'entropy'.
'gini': The lower the purity (gini coefficient) of the kth classification destination, the better. Use'entropy': information gain to find the most efficient conditions. It seems that there is not much difference, but the details are here and here. blog / sklearn-gini-vs-entropy-criteria)
This determines and limits the maximum depth of the decision tree to prevent overfitting.
python
from IPython.display import Image
dot_data = tree.export_graphviz(clf, out_file=None,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())
The image and the code are extracted from 'Sklearn document'
--Specify the minimum value at the end of the tree with min_samples_leaf. --Max_depth limits the depth of the tree.
Whereas other algorithms require normalization of data, creation of dummy variables, etc., Decision Tree requires almost no delivery processing that can handle categorical data and numerical data. Also, since it can be visualized as above, you can see that it is a very easy-to-understand algorithm.
--Bad point
Easy to overfit. Data is classified by vertical and horizontal straight lines, so if the data cannot be separated by a boundary line parallel to the axis, it cannot be classified well.
The above is the outline of Decision Tree as far as I can understand. We will update it daily, so if you have something to add or fix, we would appreciate it if you could comment.
Recommended Posts