Machine learning A story about people who are not familiar with GBDT using GBDT in Python

For some reason, there weren't many GBDT pages in Japanese, so I wrote it. As the title suggests, I'm a person who doesn't understand machine learning well, so I can't deal with tsukkomi from apt people. Please note. I hope it helps people like me who "know a little about machine learning but don't know anything when it comes to advanced topics."

What is GBDT

One of the supervised machine learning. Abbreviation for Gradient Boosting Decision Tree. It clusters based on the teacher data that SVM wants to see. However, unlike SVM, which is a basic binary classification, other class classification is possible. Differences from other classifications such as Random forest are under-researched. sorry. I will omit detailed stories such as theory because they are compiled by smarter people. http://www.housecat442.com/?p=480 http://qiita.com/Quasi-quant2010/items/a30980bd650deff509b4

I will try it for the time being

There was Article solving CodeIQ problem with SVM, so I will try to imitate it with GBDT.

`sample_gbdt.py`


# -*- coding: utf-8 -*-
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np

#Training data
train_data = np.loadtxt('CodeIQ_auth.txt', delimiter=' ')
X_train = [[x[0], x[1]] for x in train_data]
y_train = [int(x[2]) for x in train_data]

#Test data
X_test = np.loadtxt('CodeIQ_mycoins.txt', delimiter=' ')
y_test = np.array([1,0,0,1,1,0,1,1,1,0,0,1,1,0,0,1,0,0,0,1])

clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,
        max_depth=1).fit(X_train, y_train)

print ("Predict ",clf.predict(X_test))
print ("Expected", y_test)
print clf.score(X_test, y_test)

I used to use SVM, but I just do it with GBDT. This time it is a binary classification. It seems that if you increase the types of labels, they will classify them into other classes accordingly.

Click here for results

('Predict ', array([1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1]))
('Expected', array([1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1]))
0.85

You made a mistake in three places. .. ..

Parameter adjustment

Let's take a look at the sklearn page here. You don't have to look at it separately. http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html Lots of parameters. 15 pieces. Moreover, I don't understand well in English. However, if you look closely, the default of max_depth is 3. My own is 1. Why?

That's why I fixed it and tried again.

('Predict ', array([1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1]))
('Expected', array([1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1]))
0.95

It went up. It's up. It went up, but after all I made a mistake in one place. Let's get on with it and raise max_depth. For example, 5

('Predict ', array([1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0]))
('Expected', array([1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1]))
0.9

It went down ... I knew it wasn't something I should give, but that's right. But even for problems of this scale, changing the parameters will change the results. I learned firsthand how important parameter adjustment is in machine learning.

And this parameter adjustment. I feel that I can't do well without a little more specialized knowledge. It seems difficult to move it a little with python and be happy.

I will study a little more seriously and start again.