Machine learning ③ Summary of decision tree

Summary of Decision Tree

What is Decision Tree?

The decision tree is a method of defining conditions one after another for data and classifying them according to each condition. In the figure below, I'm trying to decide whether or not to windsurf. Therefore, we first categorize by the strength of the wind, and then categorize by whether it is sunny or not.

This model on the right is called a decision tree. In the decision tree, as shown in the figure on the left, classification is performed by performing linear classification multiple times.

Screen Shot 2017-05-09 at 11.39.46.png Extracted by 'Introduction to Machine Learning', Udacity

By the way, the decision tree seems to have regression and classification, but this time I will talk about classification.

default code

python



DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, 
min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, 
max_features=None, random_state=None, max_leaf_nodes=None, 
min_impurity_split=1e-07, class_weight=None, presort=False)


Description of the Parameter in the Decision Tree

It has a lot of contents. Only the main ones will be explained below.

If min_samples_split = 2, and the value of the branch destination is 2 or more, the branch will continue. Looking at the figure below, the area surrounded by the light blue circle has two or more samples, so branching continues. However, depending on the number of samples, there is a high possibility of overfitting, so adjustment is necessary. Screen Shot 2017-05-09 at 12.35.35.png Extracted from 'Introduction to Machine Learning', Udacity

Specify how to split the data with'gini' or'entropy'.

'gini': The lower the purity (gini coefficient) of the kth classification destination, the better. Use'entropy': information gain to find the most efficient conditions. It seems that there is not much difference, but the details are here and here. blog / sklearn-gini-vs-entropy-criteria)

This determines and limits the maximum depth of the decision tree to prevent overfitting.

Decision Tree visualization (for Iris set)

python




from IPython.display import Image  
dot_data = tree.export_graphviz(clf, out_file=None, 
                     feature_names=iris.feature_names, 
                     class_names=iris.target_names,  
                     filled=True, rounded=True,  
                     special_characters=True)  
graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png())  

Screen Shot 2017-05-09 at 11.00.27.png

The image and the code are extracted from 'Sklearn document'

How to prevent overfitting of decision trees

--Specify the minimum value at the end of the tree with min_samples_leaf. --Max_depth limits the depth of the tree.

The pros and cons of Decision Tree.

Whereas other algorithms require normalization of data, creation of dummy variables, etc., Decision Tree requires almost no delivery processing that can handle categorical data and numerical data. Also, since it can be visualized as above, you can see that it is a very easy-to-understand algorithm.

--Bad point

Easy to overfit. Data is classified by vertical and horizontal straight lines, so if the data cannot be separated by a boundary line parallel to the axis, it cannot be classified well.

Summary

The above is the outline of Decision Tree as far as I can understand. We will update it daily, so if you have something to add or fix, we would appreciate it if you could comment.

Recommended Posts

Machine learning ③ Summary of decision tree
Machine Learning: Supervised --Decision Tree
Machine learning ⑤ AdaBoost Summary
Summary of evaluation functions used in machine learning
Basics of Machine Learning (Notes)
Machine learning ② Naive Bayes Summary
Machine learning article summary (self-authored)
Importance of machine learning datasets
Machine learning ④ K-nearest neighbor Summary
Machine learning beginners try to make a decision tree
Summary of the basic flow of machine learning with Python
Significance of machine learning and mini-batch learning
[Machine learning] Try studying decision trees
Machine learning ① SVM (Support Vector Machine) Summary
Machine learning summary by Python beginners
Machine learning
A Tour of Go Learning Summary
A beginner's summary of Python machine learning is super concise.
Summary of articles posted so far (statistics / machine learning / mathematics etc.)
Summary of recommended APIs for artificial intelligence, machine learning, and AI
scikit-learn How to use summary (machine learning)
[Machine learning] FX prediction using decision trees
2020 Recommended 20 selections of introductory machine learning books
Machine learning algorithm (implementation of multi-class classification)
Machine learning algorithm classification and implementation summary
[Machine learning] List of frequently used packages
Machine learning algorithm (linear regression summary & regularization)
Judgment of igneous rock by machine learning ②
[Machine learning] Summary and execution of model evaluation / indicators (w / Titanic dataset)
Decision tree (load_iris)
[Memo] Machine learning
Machine learning classification
Machine Learning sample
Summary of mathematical scope and learning resources required for machine learning and data science
Classification of guitar images by machine learning Part 1
Beginning of machine learning (recommended teaching materials / information)
Machine learning of sports-Analysis of J-League as an example-②
Python & Machine Learning Study Memo ⑤: Classification of irises
Numerai Tournament-Fusion of Traditional Quants and Machine Learning-
Python & Machine Learning Study Memo ②: Introduction of Library
Full disclosure of methods used in machine learning
List of links that machine learning beginners are learning
Overview of machine learning techniques learned from scikit-learn
About the development contents of machine learning (Example)
Analysis of shared space usage by machine learning
[Translation] scikit-learn 0.18 Tutorial Introduction of machine learning by scikit-learn
Machine learning memo of a fledgling engineer Part 2
Reasonable price estimation of Mercari by machine learning
Classification of guitar images by machine learning Part 2
Get a glimpse of machine learning in Python
Try using Jupyter Notebook of Azure Machine Learning
Arrangement of self-mentioned things related to machine learning
Causal reasoning using machine learning (organization of causal reasoning methods)
Numerical summary of data
Key points of "Machine learning with Azure ML Studio"
About machine learning overfitting
Machine Learning Professional Series Round Reading Session Slide Summary
Summary of Tensorflow / Keras
[Recommended tagging for machine learning # 2] Extension of scraping script
[Recommended tagging for machine learning # 2.5] Modification of scraping script
Machine Learning: Supervised --AdaBoost