1. Purpose

Once you've learned basic python programming, let's work on kaggle! At that time, let's refer to the kernel! I think that there are many articles and books, and in fact, I think that it is a way to get insanely powerful.

However, from the point of view of a true beginner, I thought ** "It's too difficult to read the kernel and I don't understand the meaning ..." ** "No, it's not such an advanced technology, but basic machine learning for the time being. I just wanted to make a model ... "**, and I found it difficult to be myself.

So, in this article, how much will the accuracy of various machine learning models change by gradually increasing the level, such as ** "super-basic method" and "slightly devised method" **? I would like to verify **. By doing so, ** "I see, how do you make a super-basic machine learning model?" "How can I raise the level a little more?" I will share what I learned. What you do ** is the purpose of this article.

2. Introduction

(1) Kaggle to use (classification problem)

In my qiita article, I use kaggle's Kickstarter Projects dataset, which I often see. https://www.kaggle.com/kemical/kickstarter-projects

(2) Machine learning model to be compared this time

I tried to collect orthodox things.

・ Logistic regression ・ SVM ・ Decision tree ・ Random forest ・ AdaBoost

(3) Stage of verification

A: No adjustment (default) B: (Only required models) Regularization C: (Only required models) Standardization D: Hyperparameter tuning E: Feature selection

Here is a summary of (2) and (3) and the accuracy results of all the patterns that will be verified later.

◆ How to read the table

From the perspective, pattern 1 is the default version of logistic regression with an accuracy of 0.52958, pattern 2 is the version with only the regularization of logistic regression with accuracy of 0.59815, and pattern 3 is the version with regularization and standardization with logistic regression. And the accuracy is 0.66181 ...

◆ Notes

Basically, it is assumed that the accuracy will increase as the pattern progresses in each model, but this time the feature selection of E uses the built-in method. → Please note that the built-in method is based on a linear model, so the accuracy does not necessarily increase (in fact, there were patterns where it was better not to do E with multiple patterns).

If it is not linear, it is better to try another feature selection method, but this time there is a balance with time, so we will proceed only with the built-in method.

◆ For super beginners

I think it's a good idea to look at the A version of each model first and then gradually advance the pattern for each model.

Pattern 1 → 6 → 10 → 13 → 16 → 2 → 7 ...

	model	pattern	accuracy
pattern 1	Logistic regression	A	0.52958
Pattern 2	Logistic regression	B	0.59815
Pattern 3	Logistic regression	B+C	0.66181
Pattern 4	Logistic regression	B+C+D	0.66181
Pattern 5	Logistic regression	B+C+D+E	0.66185
Pattern 6	SVM	A	0.61935
Pattern 7	SVM	C	0.64871
Pattern 8	SVM	C+D	0.65393
Pattern 9	SVM	C+D+E	0.65066
Pattern 10	Decision tree	A	0.63727
Pattern 11	Decision tree	D	0.66376
Pattern 12	Decision tree	D+E	0.65732
Pattern 13	Random forest	A	0.64522
Pattern 14	Random forest	D	0.67762
Pattern 15	Random forest	D+E	0.66308
Pattern 16	AdaBoost	A	0.63947
Pattern 17	AdaBoost	D	0.67426
Pattern 18	AdaBoost	D+E	0.659367

(4) Reference

For each machine learning model, we will only implement it this time, but we have posted a series of articles that understand the background from mathematics, so we hope that you can refer to that as well.

[[Machine learning] Understanding logistic regression from both scikit-learn and mathematics] (https://qiita.com/Hawaii/items/ee2a0687ca451fe213be) [[Machine learning] Understanding SVM from both scikit-learn and mathematics] (https://qiita.com/Hawaii/items/4688a50cffb2140f297d) [Machine learning] Understanding decision trees from both scikit-learn and mathematics [[Machine learning] Understanding Random Forest] (https://qiita.com/Hawaii/items/5831e667723b66b46fba)

3. Finally, build a machine learning model

(1) Before that

I will do the processing common to all models here.

(I) Import

Let's import it all at once. Imagine that this common process is done at the beginning of each pattern, and then the code for each pattern is written in succession.

#numpy,Import pandas
import numpy as np
import pandas as pd

#Import to perform some processing on date data
import datetime

#Import for training and test data split
from sklearn.model_selection import train_test_split

#Import for standardization
from sklearn.preprocessing import StandardScaler

#Import for accuracy verification
from sklearn.model_selection import cross_val_score

#Import for hyperparameter tuning
from sklearn.model_selection import train_test_split, GridSearchCV

#Import for feature selection
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV

#Import for logistic regression
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import log_loss, accuracy_score, confusion_matrix

#Import for SVM
from sklearn.svm import SVC

#Import for decision tree
from sklearn.tree import DecisionTreeClassifier, export_graphviz

#Import for Random Forest
from sklearn.ensemble import RandomForestClassifier

#Import for AdaBoost
from sklearn.ensemble import AdaBoostClassifier

(Ii) Reading data


df = pd.read_csv(r"C:~~\ks-projects-201801.csv")

(Iii) Data appearance

From the following, you can see that it is the dataset of (378661, 15). Since the amount of data is quite large, models that take a long time to process are trained using some of the data.

df.shape

Also, let's take a quick look at the data in .head.

df.head()

(Iv) Data molding

◆ Number of recruitment days

I will omit the details, but since the recruitment start time and end time of crowdfunding are in the data, I will convert this to "recruitment days".

df['deadline'] = pd.to_datetime(df["deadline"])
df["launched"] = pd.to_datetime(df["launched"])
df["days"] = (df["deadline"] - df["launched"]).dt.days

◆ About the objective variable

I will omit the details here as well, but there are categories other than success ("successful") and failure ("failed") for the objective variable "state", but this time I will only use data for success and failure.

df = df[(df["state"] == "successful") | (df["state"] == "failed")]

Then replace success with 1 and failure with 0.

df["state"] = df["state"].replace("failed",0)
df["state"] = df["state"].replace("successful",1)

◆ Delete unnecessary lines

Before building the model, delete the id and name that you think you don't need (this should be kept, but this time it's gone), and the variables that you only need to crowdfunding ..

df = df.drop(["ID","name","deadline","launched","backers","pledged","usd pledged","usd_pledged_real","usd_goal_real"], axis=1)

◆ Category variable processing

Perform categorical variable processing with pd.get_dummies.

df = pd.get_dummies(df,drop_first = True)

(2) Patterns 1-5 [Logistic regression]

(I) Pattern 1 ~ Default ~

I would like to implement logistic regression without any adjustments.

First, divide it into training data and test data.

y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

Next, build a model for the random forest. The reason why I put the argument in SGDClassifier while saying that I do nothing is that I can not make a logistic regression model unless loss is log, and after this I will verify the accuracy when there is regularization, so The penalty is set to none here.

clf = SGDClassifier(loss = "log", penalty = "none",random_state=1234)
clf.fit(X_train,y_train)

Finally, let's verify the accuracy with test data.

clf.score(X_test, y_test)

Then the accuracy is ** 0.52958 **.

(Ii) Pattern 2-Implemented only regularization-

I will omit what regularization is here, but I will verify with L1 regularization and L2 regularization whether the accuracy will improve if only regularization is performed.

First, divide it into training data and test data.

y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

◆ L1 regularization Set penalty to l1 and check the accuracy.

clf_L1 = SGDClassifier(loss = "log", penalty = "l1", random_state=1234)
clf_L1.fit(X_train,y_train)

clf_L1.score(X_test, y_test)

Then, the accuracy was ** 0.52958 **, which was the same as before.

◆ L2 regularization Similarly, set the penalty to l2 and check the accuracy.

clf_L2 = SGDClassifier(loss = "log", penalty = "l2", random_state=1234)
clf_L2.fit(X_train,y_train)
clf_L2.score(X_test, y_test)

Then, the accuracy was ** 0.59815 **, which was higher than that of pattern 1. L2 regularization may be more suitable for this data.

(Iii) Pattern 3 ~ Regularization + Standardization implementation ~

Let's verify what kind of accuracy will be obtained by adding standardization processing to the regularization. After performing the standardization process, we will add L1 regularization and L2 regularization.

First, divide it into training data and test data.

y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

Next, standardize. Since the features are small this time, only goals and days that should be standardized are standardized, and the columns that have been processed with categorical variables by get_dumiies are not standardized. However, it seems that there is no problem even if the entire data is standardized (I asked the teacher of the machine learning course I am taking).

stdsc = StandardScaler()
df["goal"] = stdsc.fit_transform(df[["goal"]].values)
df["days"] = stdsc.fit_transform(df[["days"]].values)

After that, L1 regularization processing, L2 regularization processing and accuracy verification are performed.

#L1 regularization and accuracy
clf_L1 = SGDClassifier(loss = "log", penalty = "l1", random_state=1234)
clf_L1.fit(X_train,y_train)
clf_L1.score(X_test, y_test)

#L2 regularization and accuracy

clf_L2 = SGDClassifier(loss = "log", penalty = "l2", random_state=1234)
clf_L2.fit(X_train,y_train)
clf_L2.score(X_test, y_test)

The accuracy of L1 regularization is ** 0.66181 **, and the accuracy of L2 regularization is ** 0.65750 **, improving the accuracy at once. After standardization, L1 regularization seems to be a better match. Here, I will keep a record of ** 0.66181 ** of L1 regularization, which had good accuracy.

(Iv) Pattern 4-Regularization + Standardization + Hyperparameter tuning-

Let's perform further hyperparameter tuning on pattern 3. Hyperparameter tuning refers to exploring and deciding what numerical values we have to decide for ourselves to create a machine learning model.

Here, we will use Gridsearch.

It would be nice if all parameters could be searched in every range, but that would take too much time, so this time we will tune the penalty and alpha that seem to be important in the SGD Classifier.

First is standardization processing + division of training data and test data.

Regularization is not set yet at this timing because we will search for which is better, L1 or L2, by parameter tuning this time.

#Standardization
stdsc = StandardScaler()
df["goal"] = stdsc.fit_transform(df[["goal"]].values)
df["days"] = stdsc.fit_transform(df[["days"]].values)

#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

parameters = {'penalty':['l1', 'l2'], 'alpha':[0.0001,0.001, 0.01, 0.1, 1, 10, 100],'loss':['log']} #Edit here
model = SGDClassifier(random_state=1234)
clf = GridSearchCV(model, parameters, cv=3)
clf.fit(X_train, y_train)
print(clf.best_params_)

This time, I wanted to compare the accuracy for each pattern shown in the table at the beginning, so I was careful to ** use GridSearch including the default value **. This is because if the default value is not included, the accuracy comparison will not be possible.

The default value is described on the sklearn page. For example, the SGD Classifier this time is as follows. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html

The penalty is l1 or l2, and alpha contains the default value of 0.0001, which is fairly wide. As before, if you do not specify loss in log, it will not be logistic regression in the first place, so specify it in log.

As a point to note for super beginners, as mentioned above, ** you have to set "which parameter" and "in what range" to search by yourself **, so here are also rules of thumb and more I feel that a deep understanding is needed.

Then, {'alpha': 0.0001,'loss':'log','penalty':'l1'} was displayed, and he searched for the best parameter.

Note that the alpha of this best parameter is actually the exact same value as the default value when you look at sklearn's site **. In addition, in the L1 regularization of pattern 3, loss is set to log and penalty is set to l1, so in theory, even if hyperparameter search is performed, ** the same parameters as L1 regularization of pattern 3 are used. **.

In other words, let's continue, thinking that the accuracy should be 0.66181, which is the same as the L1 regularization of pattern 3.

Now let's build the model again using this best parameter. The following means using "** clf.best_paramas_" to train the SGD Classifier with the best parameters I just mentioned.

clf_2 = SGDClassifier(**clf.best_params_,random_state=1234)
clf_2.fit(X_train,y_train)

If you do not fix the random number as random_state = 1234 here, you will not be able to compare it with the L1 regularization of pattern 3, so fix it firmly.

Finally, check the accuracy with the test data.

clf_2.score(X_test, y_test)

It became ** 0.66181 **, and it became the same system as pattern 3 as hypothesized.

(V) Pattern 5 ~ Regularization + Standardization + Hyperparameter tuning + Feature selection ~

Up to pattern 4, I made a model with the features I chose arbitrarily, but here I will select the features using the method called the built-in method.

As for the embedding method, we have posted a little more in-depth content with exactly the same material as this time, so please refer to here for a detailed explanation and explanation of the embedding method. [[Machine learning] Feature selection-Implementing the built-in method with scikit-learn-] (https://qiita.com/Hawaii/items/1490587aad33b08d3936)

First, standardize and divide the data.

#Standardization
stdsc = StandardScaler()
df["goal"] = stdsc.fit_transform(df[["goal"]].values)
df["days"] = stdsc.fit_transform(df[["days"]].values)

#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

Next, we will select the features using the built-in method.

estimator = LassoCV(normalize = True, cv = 10, random_state = 1234)
sfm = SelectFromModel(estimator, threshold = 1e-5)
sfm.fit(X_train,y_train)

The selected features will be overwritten and updated in the training data and test data.

X_train_selected = sfm.transform(X_train)
X_test_selected = sfm.transform(X_test)

Check the accuracy of the SGDClassifier with the overwritten training data (at this point, we have not tuned the hyperparameters yet).

classifier = SGDClassifier(random_state=1234)
classifier.fit(X_train_selected, y_train)

From here, perform hyperparameter tuning. The method is basically the same as before, but the training data (X_train_selected) that overwrites the contents of .fit is used.

parameters = {'penalty':['l1', 'l2'], 'alpha':[0.0001,0.001, 0.01, 0.1, 1, 10, 100],'loss':['log']} 
model = SGDClassifier(random_state=1234)
clf = GridSearchCV(model, parameters, cv=3)
clf.fit(X_train_selected, y_train)
print(clf.best_params_)

Train the SGD Classifier again with the best parameters you have given.

Excuse me many times, but since it is the feature data selected by the built-in method that is trained here, use X_train_selected instead of X_train.

clf_2 = SGDClassifier(**clf.best_params_,random_state=1234)
clf_2.fit(X_train_selected,y_train)

Finally, check the accuracy.

clf_2.score(X_test_selected, y_test)

With ** 0.66185 **, we were able to achieve the best accuracy ever.

This is the end of logistic regression. Let's summarize the accuracy once.

	model	pattern	accuracy
pattern 1	Logistic regression	A	0.52958
Pattern 2	Logistic regression	B	0.59815
Pattern 3	Logistic regression	B+C	0.66181
Pattern 4	Logistic regression	B+C+D	0.66181
Pattern 5	Logistic regression	B+C+D+E	0.66185
Pattern 6	SVM	A
Pattern 7	SVM	C
Pattern 8	SVM	C+D
Pattern 9	SVM	C+D+E
Pattern 10	Decision tree	A
Pattern 11	Decision tree	D
Pattern 12	Decision tree	D+E
Pattern 13	Random forest	A
Pattern 14	Random forest	D
Pattern 15	Random forest	D+E
Pattern 16	AdaBoost	A
Pattern 17	AdaBoost	D
Pattern 18	AdaBoost	D+E

Now let's move on to SVM.

(3) Patterns 6-9 [SVM]

I also tried SVM and realized it was a mess, but it takes a lot of time to process. Therefore, if we were to build a model and tune hyperparameters using all training data like logistic regression, we wouldn't have enough time, so we paid attention to making the data smaller.

* Supplement-Estimated how long it will take to process data- *

I trained all the data from the first shot and experienced many times that it did not end even if it took hours, so I first tried with a very small number of data (and parameters), and the time it took Make a note of it. So, I think it's better to start the process after setting a guideline for how long it will take because this amount is about many times as much.

(I) Pattern 6 ~ Default ~

I'm going to implement SVM without any adjustments. First, divide it into training data and test data.

y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

The major changes I have made in processing SVMs with other models are as follows, but it takes a lot of time to train SVMs with all training data like logistic regression.

Therefore, let's treat the training data once divided into 70% of the total, and then treat 1.5% (= 10.5% of all data) as training data.

1.5% is not clearly decided, and it is a sense value that I would like this as training data. Originally I wanted it to be a little more training data, but when I tried it at 30%, it was left for 3 hours and the processing was not completed, so this time I set it to 1.5% in consideration of time.

Even with 1.5%, it took more than 3 hours to complete the process. ..

#1 from the training data.Sampling 50%
X_train_sample = pd.DataFrame(X_train).sample(frac = 0.15, random_state=1234)
y_train_sample = pd.DataFrame(y_train).sample(frac = 0.15, random_state=1234)

#Model building
clf = SVC(random_state=1234)
clf.fit(X_train_sample, y_train_sample)

Now that we have a model, let's check the accuracy with test data. It is important to note that the accuracy of the test data is not reduced from 30% of the total, but the accuracy is confirmed using all the test data. This is because I thought that if the number of test data was reduced only for SVM, it would not be possible to compare the accuracy with other models.

clf.score(X_test,y_test)

The accuracy was ** 0.61935 **.

(Ii) Pattern 7-Implementation of standardization only-

Since SVM does not require regularization itself, it starts with the implementation of standardization.

By adjusting hyperparameters, for example, adjusting the depth of a tree plays a role like regularization in a sense.

First, let's standardize and divide the data. After that, it is the same as pattern 6.

#Standardization
stdsc = StandardScaler()
df["goal"] = stdsc.fit_transform(df[["goal"]].values)
df["days"] = stdsc.fit_transform(df[["days"]].values)

#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

#1 from the training data.Sampling 50%
X_train_sample = pd.DataFrame(X_train).sample(frac = 0.15, random_state=1234)
y_train_sample = pd.DataFrame(y_train).sample(frac = 0.15, random_state=1234)

#Model building
clf = SVC(random_state=1234)
clf.fit(X_train_sample, y_train_sample)

#Accuracy evaluation
clf.score(X_test, y_test)

The accuracy is ** 0.64871 **, which is better than pattern 6.

(Iii) Pattern 8 ~ Standardization + Hyperparameter tuning ~

Next, let's implement standardization + hyperparameter tuning.

First is standardization and data partitioning.

stdsc = StandardScaler()
df["goal"] = stdsc.fit_transform(df[["goal"]].values)
df["days"] = stdsc.fit_transform(df[["days"]].values)

y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

Next, perform hyperparameter tuning. Again, it takes a lot of time to search using all the training data, so 3% (= 2.1% of the total) of the 70% of the training data is used for hyperparameter tuning.

Originally, I would like to use more data to adjust the parameters, but since the processing is not really finished, I chose this value.

#Training data 3%Sampling
X_train_grid = pd.DataFrame(X_train).sample(frac = 0.03,random_state=1234)
y_train_grid = pd.DataFrame(y_train).sample(frac = 0.03,random_state=1234)

Then, perform hyperparameter tuning.

parameters = {'kernel':['linear', 'rbf'], 'C':[0.001, 0.01,0.1,1,10]} #Edit here
model = SVC(random_state=1234)
clf = GridSearchCV(model, parameters, cv=2,return_train_score=False)
clf.fit(X_train_grid, y_train_grid)
print(clf.best_params_, clf.best_score_)

The point that is different from logistic regression is that the argument cv of GridSearch is set to 2 + return_train_score is set to False. Originally, cv was set to 3 and return_train_score was not set in particular, but since the process does not end forever, I checked it on the site and set it.

It seems that cv is the number of cross-validations (this seems to contribute to efficiency), and return_train_score = False does not have to bother to measure the accuracy of training data.

At this point, we have searched for the "optimal parameters"! We will train using the training data with this optimum parameter and verify the accuracy with the test data.

The above 3% is the data used for the optimum parameter search. The same number of data as logistic regression is used to train the model learner.

#1 from the training data.Sampling 50%
X_train_sample = pd.DataFrame(X_train).sample(frac = 0.15, random_state=1234)
y_train_sample = pd.DataFrame(y_train).sample(frac = 0.15, random_state=1234)

#Model training
clf = SVC(**clf.best_params_,random_state=1234)
clf.fit(X_train_sample, y_train_sample) 

#Check accuracy with test data
clf.score(X_test, y_test)

The accuracy was ** 0.65393 **.

(Iv) Pattern 9-Standardization + Hyperparameter tuning + Feature selection-

Finally, add feature selection.

Again, the training data is further divided into pattern 8 for feature selection, so be careful not to confuse which training data you are using for what.

#Standardization
stdsc = StandardScaler()
df["goal"] = stdsc.fit_transform(df[["goal"]].values)
df["days"] = stdsc.fit_transform(df[["days"]].values)

#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

#Feature selection
estimator = LassoCV(normalize = True, cv = 10, random_state = 1234)
sfm = SelectFromModel(estimator, threshold = 1e-5)
sfm.fit(X_train,y_train)

#Overwrite training data with selected features
X_train_selected = sfm.transform(X_train)
X_test_selected = sfm.transform(X_test)

This is the feature selection. Next, hyperparameter tuning is performed based on the selected features.

#3 of training data overwritten with selected features for hyperparameter tuning%Sampling
X_train_grid = pd.DataFrame(X_train_selected).sample(frac = 0.03,random_state=1234)
y_train_grid = pd.DataFrame(y_train).sample(frac = 0.03,random_state=1234)

#Hyperparameter tuning
parameters = {'kernel':['linear', 'rbf'], 'C':[0.001, 0.01,0.1,1,10]} #Edit here
model = SVC(random_state=1234)
clf = GridSearchCV(model, parameters, cv=2,return_train_score=False)
clf.fit(X_train_grid, y_train_grid)
print(clf.best_params_, clf.best_score_)

At this point, hyperparameter tuning using the features selected by the built-in method has been completed, and the best parameters have been determined.

Let's train the SVM model with the training data (X_train_selected) overwritten with this selected feature and the best parameters with 1.5% of the training data of 70% of the total.

#Sampling 30% from the overwritten training data
X_train_sample = pd.DataFrame(X_train_selected).sample(frac = 0.15, random_state=1234)
y_train_sample = pd.DataFrame(y_train).sample(frac = 0.15, random_state=1234)

#Model construction with an additional 30% of sample data from 70% of the training data
clf = SVC(**clf.best_params_,random_state=1234)
clf.fit(X_train_sample, y_train_sample) 

#Accuracy evaluation with test data
clf.score(X_test_selected, y_test)

The accuracy is ** 0.65066 **.

Here, let's summarize the accuracy again.

	model	pattern	accuracy
pattern 1	Logistic regression	A	0.52958
Pattern 2	Logistic regression	B	0.59815
Pattern 3	Logistic regression	B+C	0.66181
Pattern 4	Logistic regression	B+C+D	0.66181
Pattern 5	Logistic regression	B+C+D+E	0.66185
Pattern 6	SVM	A	0.61935
Pattern 7	SVM	C	0.64871
Pattern 8	SVM	C+D	0.65393
Pattern 9	SVM	C+D+E	0.65066
Pattern 10	Decision tree	A
Pattern 11	Decision tree	D
Pattern 12	Decision tree	D+E
Pattern 13	Random forest	A
Pattern 14	Random forest	D
Pattern 15	Random forest	D+E
Pattern 16	AdaBoost	A
Pattern 17	AdaBoost	D
Pattern 18	AdaBoost	D+E

(4) Patterns 10-12 [Decision tree]

Next is the decision tree.

(I) Pattern 10 ~ Default ~

Divide the data.

y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

We build a model of the decision tree and verify the accuracy.

clf = DecisionTreeClassifier(random_state=1234)
clf = clf.fit(X_train, y_train)

clf.score(X_test, y_test)

The accuracy is now ** 0.63727 **.

(Ii) Pattern 11 ~ Hyperparameter tuning ~

The decision tree does not require regularization or standardization, so we start with hyperparameter tuning.

First is data division.

y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

Next, perform gridsearch for hyperparameter tuning. After that, we will build the model with the best parameters and verify the accuracy of the test data.

#GridSearch
parameters = {'criterion':['gini', 'entropy'], 'max_depth':[i for i in range(1, 11)],'max_features':['auto','sqrt','log2'], 'min_samples_leaf':[i for i in range(1, 11)],'random_state':[1234]} #Edit here
model = DecisionTreeClassifier(random_state=1234)
clf = GridSearchCV(model, parameters, cv=3)
clf.fit(X_train, y_train)
print(clf.best_params_, clf.best_score_)

#Build a model with the best parameters
clf = DecisionTreeClassifier(**clf.best_params_,random_state=1234)
clf.fit(X_train, y_train)

#Accuracy verification
clf.score(X_test,y_test)

The accuracy was ** 0.66376 **.

(Iii) Pattern 12 ~ Hyperparameter tuning + Feature selection ~

As always, start with data splitting.

#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

#Feature selection
estimator = LassoCV(normalize = True, cv = 10, random_state = 1234)
sfm = SelectFromModel(estimator, threshold = 1e-5)
sfm.fit(X_train,y_train)

#Overwrite training data with selected features
X_train_selected = sfm.transform(X_train)
X_test_selected = sfm.transform(X_test)

#Hyperparameter tuning
parameters = {'criterion':['gini', 'entropy'], 'max_depth':[i for i in range(1, 11)],'max_features':['auto','sqrt','log2'], 'min_samples_leaf':[i for i in range(1, 11)],'random_state':[1234]}
model = DecisionTreeClassifier(random_state=1234)
clf = GridSearchCV(model, parameters, cv=3)
clf.fit(X_train_selected, y_train)
print(clf.best_params_, clf.best_score_)

#Train learners with optimal parameters
clf_2 = DecisionTreeClassifier(**clf.best_params_,random_state=1234)
clf_2.fit(X_train_selected,y_train)

#Check accuracy with test data
clf_2.score(X_test_selected, y_test)

The accuracy is ** 0.65732 **.

	model	pattern	accuracy
pattern 1	Logistic regression	A	0.52958
Pattern 2	Logistic regression	B	0.59815
Pattern 3	Logistic regression	B+C	0.66181
Pattern 4	Logistic regression	B+C+D	0.66181
Pattern 5	Logistic regression	B+C+D+E	0.66185
Pattern 6	SVM	A	0.61935
Pattern 7	SVM	C	0.64871
Pattern 8	SVM	C+D	0.65393
Pattern 9	SVM	C+D+E	0.65066
Pattern 10	Decision tree	A	0.63727
Pattern 11	Decision tree	D	0.66376
Pattern 12	Decision tree	D+E	0.65732
Pattern 13	Random forest	A
Pattern 14	Random forest	D
Pattern 15	Random forest	D+E
Pattern 16	AdaBoost	A
Pattern 17	AdaBoost	D
Pattern 18	AdaBoost	D+E

(5) Patterns 13 to 15 [Random Forest]

(I) Pattern 13 ~ Default ~

Divide the data.

y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

Random forest model construction is performed and accuracy verification is performed.

clf = RandomForestClassifier(random_state=1234)
clf.fit(X_train, y_train)

clf.score(X_test, y_test)

The accuracy is now ** 0.64522 **.

(Ii) Pattern 14 ~ Hyperparameter tuning ~

Like decision trees, random forests do not require regularization and standardization.

As before, we will perform hyperparameter tuning after data partitioning. The difference from the previous model is that the search range is narrowed a little (each index is searched in the range of 1 to 5). This alone took about 35 minutes, so I felt that it would be difficult to partition the processing in consideration of my own time if I expanded the range further, so I narrowed the range.

#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

#Hyperparameter tuning
parameters = {'max_depth':[2,4,6,None],  'min_samples_leaf':[1,3,5],'min_samples_split':[2,4,6]}
model = RandomForestClassifier(random_state=1234)
clf = GridSearchCV(model, parameters, cv=3,)
clf.fit(X_train, y_train)
print(clf.best_params_, clf.best_score_)

#Train learners with optimal parameters
clf = RandomForestClassifier(**clf.best_params_,random_state=1234)
clf.fit(X_train, y_train)

#Check accuracy with test data
clf.score(X_test, y_test)

Since it took a long time to narrow down the numerical values in the hyperparameter search, we are switching to the search of only even numbers or only odd numbers on the assumption that the default values are included. The accuracy was ** 0.67762 **.

(Iii) Pattern 15 ~ Hyperparameter tuning + Feature selection ~

#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

#Feature selection
estimator = LassoCV(normalize = True, cv = 10, random_state = 1234)
sfm = SelectFromModel(estimator, threshold = 1e-5)
sfm.fit(X_train,y_train)

#Overwrite training data with selected features
X_train_selected = sfm.transform(X_train)
X_test_selected = sfm.transform(X_test)

#Hyperparameter tuning
parameters = {'max_depth':[2,4,6,None],  'min_samples_leaf':[1,3,5],'min_samples_split':[2,4,6]}
model = RandomForestClassifier(random_state=1234)
clf = GridSearchCV(model, parameters, cv=3,)
clf.fit(X_train_selected, y_train)
print(clf.best_params_, clf.best_score_)

#Train learners with optimal parameters
clf = RandomForestClassifier(**clf.best_params_,random_state=1234)
clf.fit(X_train_selected, y_train)

#Check accuracy with test data
clf.score(X_test_selected, y_test)

The accuracy is now ** 0.66308 **.

Let's check the accuracy again.

	model	pattern	accuracy
pattern 1	Logistic regression	A	0.52958
Pattern 2	Logistic regression	B	0.59815
Pattern 3	Logistic regression	B+C	0.66181
Pattern 4	Logistic regression	B+C+D	0.66181
Pattern 5	Logistic regression	B+C+D+E	0.66185
Pattern 6	SVM	A	0.61935
Pattern 7	SVM	C	0.64871
Pattern 8	SVM	C+D	0.65393
Pattern 9	SVM	C+D+E	0.65066
Pattern 10	Decision tree	A	0.63727
Pattern 11	Decision tree	D	0.66376
Pattern 12	Decision tree	D+E	0.65732
Pattern 13	Random forest	A	0.64522
Pattern 14	Random forest	D	0.67762
Pattern 15	Random forest	D+E	0.66308
Pattern 16	AdaBoost	A
Pattern 17	AdaBoost	D
Pattern 18	AdaBoost	D+E

(6) Patterns 16-18 [AdaBoost]

(I) Pattern 16 ~ Default ~

Divide the data.

#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

We build a model of AdaBoost and verify the accuracy.

clf = AdaBoostClassifier(DecisionTreeClassifier(random_state=1234))
clf.fit(X_train, y_train)

clf.score(X_test, y_test)

The accuracy is now ** 0.63947 **.

(Ii) Pattern 17 ~ Hyperparameter tuning ~

#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

#Hyperparameter tuning
parameters = {'learning_rate':[0.1,0.5,1.0]}
model = AdaBoostClassifier(random_state=1234)
clf = GridSearchCV(model, parameters, cv=3,)
clf.fit(X_train, y_train)
print(clf.best_params_, clf.best_score_)

#Train learners with optimal parameters
clf = AdaBoostClassifier(**clf.best_params_,random_state=1234)
clf.fit(X_train, y_train)

#Check accuracy with test data
clf.score(X_test, y_test)

The accuracy was ** 0.67426 **.

(Iii) Pattern 18 ~ Hyperparameter tuning + Feature selection ~

#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

#Feature selection
estimator = LassoCV(normalize = True, cv = 10, random_state = 1234)
sfm = SelectFromModel(estimator, threshold = 1e-5)
sfm.fit(X_train,y_train)

#Overwrite training data with selected features
X_train_selected = sfm.transform(X_train)
X_test_selected = sfm.transform(X_test)

#Hyperparameter tuning
parameters = {'learning_rate':[0.1,0.5,1.0]}
model = AdaBoostClassifier(random_state=1234)
clf = GridSearchCV(model, parameters, cv=3)
clf.fit(X_train_selected, y_train)
print(clf.best_params_, clf.best_score_)

#Train learners with optimal parameters
clf = AdaBoostClassifier(**clf.best_params_,random_state=1234)
clf.fit(X_train_selected, y_train)

#Check accuracy with test data
clf.score(X_test_selected, y_test)

The accuracy was ** 0.659367 **.

	model	pattern	accuracy
pattern 1	Logistic regression	A	0.52958
Pattern 2	Logistic regression	B	0.59815
Pattern 3	Logistic regression	B+C	0.66181
Pattern 4	Logistic regression	B+C+D	0.66181
Pattern 5	Logistic regression	B+C+D+E	0.66185
Pattern 6	SVM	A	0.61935
Pattern 7	SVM	C	0.64871
Pattern 8	SVM	C+D	0.65393
Pattern 9	SVM	C+D+E	0.65066
Pattern 10	Decision tree	A	0.63727
Pattern 11	Decision tree	D	0.66376
Pattern 12	Decision tree	D+E	0.65732
Pattern 13	Random forest	A	0.64522
Pattern 14	Random forest	D	0.67762
Pattern 15	Random forest	D+E	0.66308
Pattern 16	AdaBoost	A	0.63947
Pattern 17	AdaBoost	D	0.67426
Pattern 18	AdaBoost	D+E	0.659367

4. Conclusion

What did you think.

Surprisingly, I think that there are few sites that introduce super-basic model building methods, and I always think, "I don't want to know such advanced things, I just want to make a model once!" I did.

This article focuses on my own problems, so I hope it helps to deepen my understanding.

I tried to compare the accuracy of machine learning models using kaggle as a theme.

1. Purpose

2. Introduction

(1) Kaggle to use (classification problem)

(2) Machine learning model to be compared this time

(3) Stage of verification

◆ How to read the table

◆ Notes

◆ For super beginners

(4) Reference

3. Finally, build a machine learning model

(1) Before that

(I) Import

(Ii) Reading data

(Iii) Data appearance

(Iv) Data molding

◆ Number of recruitment days

◆ About the objective variable

◆ Delete unnecessary lines

◆ Category variable processing

(2) Patterns 1-5 [Logistic regression]

(I) Pattern 1 ~ Default ~

(Ii) Pattern 2-Implemented only regularization-

(Iii) Pattern 3 ~ Regularization + Standardization implementation ~

(Iv) Pattern 4-Regularization + Standardization + Hyperparameter tuning-

(V) Pattern 5 ~ Regularization + Standardization + Hyperparameter tuning + Feature selection ~

(3) Patterns 6-9 [SVM]

* Supplement-Estimated how long it will take to process data- *

(I) Pattern 6 ~ Default ~

(Ii) Pattern 7-Implementation of standardization only-

(Iii) Pattern 8 ~ Standardization + Hyperparameter tuning ~

(Iv) Pattern 9-Standardization + Hyperparameter tuning + Feature selection-

(4) Patterns 10-12 [Decision tree]

(I) Pattern 10 ~ Default ~

(Ii) Pattern 11 ~ Hyperparameter tuning ~

(Iii) Pattern 12 ~ Hyperparameter tuning + Feature selection ~

(5) Patterns 13 to 15 [Random Forest]

(I) Pattern 13 ~ Default ~

(Ii) Pattern 14 ~ Hyperparameter tuning ~

(Iii) Pattern 15 ~ Hyperparameter tuning + Feature selection ~

(6) Patterns 16-18 [AdaBoost]

(I) Pattern 16 ~ Default ~

(Ii) Pattern 17 ~ Hyperparameter tuning ~

(Iii) Pattern 18 ~ Hyperparameter tuning + Feature selection ~

4. Conclusion