Since I am writing an article as an output of study, there may be some mistakes. So feel free to comment, comment, and point out. This time, I wrote about regression in Predictive Statistics (Practice Edition Simple Regression) Python, but I will write about classification.
・ Super overview of machine learning ・ Review about classification ・ Modeling flow ・ Practice
Before we write about classification, let's take a brief look at machine learning. Machine learning is to learn and predict patterns from past data. It is now used all over the world because it has higher prediction accuracy than human data analysis. Machine learning requires many parameters (called hyperparameters in the world of machine learning), and the number can be tens of thousands. Furthermore, even when using that parameter, overfitting will occur if there are unnecessary parameters, so adjustment is required. And the following two are indispensable for studying machine learning. ・ Supervised learning ・ Unsupervised learning
Supervised learning is a method in which the data you enter is the correct answer. You actually enter the data, and the computer predicts future data based on that data. The image is that the teacher teaches the students to study.
Unsupervised learning predicts data without entering the actual correct data. This is a method for understanding the essence of data and there is no correct answer, so the prediction accuracy cannot be measured. The image is like a job hunter who can't find what he wants to do.
In this practice, we have adopted a technique called decision tree in supervised learning.
As I wrote in the previous article, classification is a division into categories. For example, divide a dog into dachshunds and chihuahuas, or divide a cake into shortcakes and chocolate cakes. If the method used in this classification is supervised learning and the dogs are classified, first enter the characteristics (parameters) of the dachshund and chihuahua. If the body is long, enter Dax, or if the eyes are round, enter Chihuahua, and let the computer judge. Of course, if there are many parameters, the classification accuracy will increase, but if there are too many parameters, overfitting will occur. For example, in the case of dogs, if you enter a parameter called tail length, it will be difficult to determine the length of the tail because there are individual differences. Therefore, when determining the parameters, the classification accuracy will be higher if they are independent of each other.
I will explain the flow to create a model. ① Basic analysis </ b> ・ Data reading ・ Confirmation of basic statistics ・ Confirmation and correction of defective land ・ Cross tabulation ・ Binning ② Creating a decision tree model </ b> ・ Selection and determination of explanatory variables ・ Selection and determination of objective variables -Create variables for the decision tree model -Substitute data for variables in the decision tree model ・ Predict test data based on decision tree model
I will put the actual code.
① Basic analysis First, let's import the library required to create a decision tree model.
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
from sklearn.tree import DecisionTreeClassifier as DT
from sklearn.tree import export_graphviz
import pydotplus
from IPython.display import Image
Then use the pandas library to load the data and look at the basic statistics
train = pd.read_csv("train.csv") #Substitution of past data
test = pd.read_csv("test.csv") #Substitution of forecast data
sample = pd.read_csv("submit_sample.csv",header=None)
train.describe() #Confirmation of basic statistics of past data
test.describe() #Check forecast data
Next, let's check the defective area
train.isnull().sum()
test.isnull().sum()
Next, we will do cross tabulation. A crosstab is a table that shows the numerical relationship between data in one column and data in another column. The code looks like this:
#You can crosstab with the crosstab function, the margins option also outputs the total value
pd.crosstab(train["Column name"],train["Column name"],margins=True)
Next is binning. Binning calculates how many numbers there are in a certain section. The image is like the relationship between the histogram class and frequency.
#Divide the column data of the first argument by the numerical value of the second argument.
bining_data = pd.cut(train["Column name"],[1,10,20,30,50,100])
② Creation of decision tree model First, decide which parameter you want to use. Then, let's assign that parameter to a variable. The iloc function is convenient when you want to assign multiple rows and multiple columns.
#All rows and columns are extracted from the 0th to 17th rows.
trainX = train.iloc[:,0:17]
y = train["Objective variable"]
#copy function extracts all columns
testX = test.copy()
Next, let's dummy the extracted data.
trainX = pd.get_dummies(trainX)
testX = pd.get_dummies(testX)
And let's prepare the variables to make the decision tree model
#The first argument is the depth to the leaf, the second argument is the minimum value of the sample
clf1 = DT(max_depth=2,min_samples_leaf=500)
Let's substitute past data
#Be sure to assign from the explanatory variable
clf1 = clf1.fit(trainX,y)
Next, I want to display it on Jupiter, but since it can be displayed directly, I will write it to a dot file and then display it.
export_graphviz(clf1, out_file="tree.dot", feature_names=trainX.columns, class_names=["0","1"], filled=True, rounded=True)
g = pydotplus.graph_from_dot_file(path="tree.dot")
Image(g.create_png())
Well, I will finally predict. This time we use the predit_proba function </ b> instead of the predict function because it is a classification rather than a regression.
pred = clf1.predict_proba(testX)
This concludes the analysis using classification. You don't have to remember the display of the decision tree model, so you can copy it. Please change the options as needed. This time I used the decision tree model. However, the depth to the leaves and the minimum value to be set are important when using the decision tree model. It is important to note that the deeper the leaves are, the more likely they are to overfit. Therefore, it is possible to create a model that is difficult to overfit by tuning the parameters. I will write it again in the next article.
Recommended Posts