Once you've learned basic python programming, let's work on kaggle! At that time, let's refer to the kernel! I think that there are many articles and books, and in fact, I think that it is a way to get insanely powerful.
However, from the point of view of a true beginner, I thought ** "It's too difficult to read the kernel and I don't understand the meaning ..." ** "No, it's not such an advanced technology, but basic machine learning for the time being. I just wanted to make a model ... "**, and I found it difficult to be myself.
So, in this article, how much will the accuracy of various machine learning models change by gradually increasing the level, such as ** "super-basic method" and "slightly devised method" **? I would like to verify **. By doing so, ** "I see, how do you make a super-basic machine learning model?" "How can I raise the level a little more?" I will share what I learned. What you do ** is the purpose of this article.
In my qiita article, I use kaggle's Kickstarter Projects dataset, which I often see. https://www.kaggle.com/kemical/kickstarter-projects
I tried to collect orthodox things.
・ Logistic regression ・ SVM ・ Decision tree ・ Random forest ・ AdaBoost
A: No adjustment (default) B: (Only required models) Regularization C: (Only required models) Standardization D: Hyperparameter tuning E: Feature selection
Here is a summary of (2) and (3) and the accuracy results of all the patterns that will be verified later.
From the perspective, pattern 1 is the default version of logistic regression with an accuracy of 0.52958, pattern 2 is the version with only the regularization of logistic regression with accuracy of 0.59815, and pattern 3 is the version with regularization and standardization with logistic regression. And the accuracy is 0.66181 ...
Basically, it is assumed that the accuracy will increase as the pattern progresses in each model, but this time the feature selection of E uses the built-in method. → Please note that the built-in method is based on a linear model, so the accuracy does not necessarily increase (in fact, there were patterns where it was better not to do E with multiple patterns).
I think it's a good idea to look at the A version of each model first and then gradually advance the pattern for each model.
model | pattern | accuracy | |
---|---|---|---|
pattern 1 | Logistic regression | A | 0.52958 |
Pattern 2 | Logistic regression | B | 0.59815 |
Pattern 3 | Logistic regression | B+C | 0.66181 |
Pattern 4 | Logistic regression | B+C+D | 0.66181 |
Pattern 5 | Logistic regression | B+C+D+E | 0.66185 |
Pattern 6 | SVM | A | 0.61935 |
Pattern 7 | SVM | C | 0.64871 |
Pattern 8 | SVM | C+D | 0.65393 |
Pattern 9 | SVM | C+D+E | 0.65066 |
Pattern 10 | Decision tree | A | 0.63727 |
Pattern 11 | Decision tree | D | 0.66376 |
Pattern 12 | Decision tree | D+E | 0.65732 |
Pattern 13 | Random forest | A | 0.64522 |
Pattern 14 | Random forest | D | 0.67762 |
Pattern 15 | Random forest | D+E | 0.66308 |
Pattern 16 | AdaBoost | A | 0.63947 |
Pattern 17 | AdaBoost | D | 0.67426 |
Pattern 18 | AdaBoost | D+E | 0.659367 |
For each machine learning model, we will only implement it this time, but we have posted a series of articles that understand the background from mathematics, so we hope that you can refer to that as well.
[[Machine learning] Understanding logistic regression from both scikit-learn and mathematics] (https://qiita.com/Hawaii/items/ee2a0687ca451fe213be) [[Machine learning] Understanding SVM from both scikit-learn and mathematics] (https://qiita.com/Hawaii/items/4688a50cffb2140f297d) [Machine learning] Understanding decision trees from both scikit-learn and mathematics [[Machine learning] Understanding Random Forest] (https://qiita.com/Hawaii/items/5831e667723b66b46fba)
I will do the processing common to all models here.
Let's import it all at once. Imagine that this common process is done at the beginning of each pattern, and then the code for each pattern is written in succession.
#numpy,Import pandas
import numpy as np
import pandas as pd
#Import to perform some processing on date data
import datetime
#Import for training and test data split
from sklearn.model_selection import train_test_split
#Import for standardization
from sklearn.preprocessing import StandardScaler
#Import for accuracy verification
from sklearn.model_selection import cross_val_score
#Import for hyperparameter tuning
from sklearn.model_selection import train_test_split, GridSearchCV
#Import for feature selection
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV
#Import for logistic regression
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import log_loss, accuracy_score, confusion_matrix
#Import for SVM
from sklearn.svm import SVC
#Import for decision tree
from sklearn.tree import DecisionTreeClassifier, export_graphviz
#Import for Random Forest
from sklearn.ensemble import RandomForestClassifier
#Import for AdaBoost
from sklearn.ensemble import AdaBoostClassifier
df = pd.read_csv(r"C:~~\ks-projects-201801.csv")
From the following, you can see that it is the dataset of (378661, 15). Since the amount of data is quite large, models that take a long time to process are trained using some of the data.
df.shape
Also, let's take a quick look at the data in .head.
df.head()
I will omit the details, but since the recruitment start time and end time of crowdfunding are in the data, I will convert this to "recruitment days".
df['deadline'] = pd.to_datetime(df["deadline"])
df["launched"] = pd.to_datetime(df["launched"])
df["days"] = (df["deadline"] - df["launched"]).dt.days
I will omit the details here as well, but there are categories other than success ("successful") and failure ("failed") for the objective variable "state", but this time I will only use data for success and failure.
df = df[(df["state"] == "successful") | (df["state"] == "failed")]
Then replace success with 1 and failure with 0.
df["state"] = df["state"].replace("failed",0)
df["state"] = df["state"].replace("successful",1)
Before building the model, delete the id and name that you think you don't need (this should be kept, but this time it's gone), and the variables that you only need to crowdfunding ..
df = df.drop(["ID","name","deadline","launched","backers","pledged","usd pledged","usd_pledged_real","usd_goal_real"], axis=1)
Perform categorical variable processing with pd.get_dummies.
df = pd.get_dummies(df,drop_first = True)
I would like to implement logistic regression without any adjustments.
First, divide it into training data and test data.
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
Next, build a model for the random forest. The reason why I put the argument in SGDClassifier while saying that I do nothing is that I can not make a logistic regression model unless loss is log, and after this I will verify the accuracy when there is regularization, so The penalty is set to none here.
clf = SGDClassifier(loss = "log", penalty = "none",random_state=1234)
clf.fit(X_train,y_train)
Finally, let's verify the accuracy with test data.
clf.score(X_test, y_test)
Then the accuracy is ** 0.52958 **.
I will omit what regularization is here, but I will verify with L1 regularization and L2 regularization whether the accuracy will improve if only regularization is performed.
First, divide it into training data and test data.
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
◆ L1 regularization Set penalty to l1 and check the accuracy.
clf_L1 = SGDClassifier(loss = "log", penalty = "l1", random_state=1234)
clf_L1.fit(X_train,y_train)
clf_L1.score(X_test, y_test)
Then, the accuracy was ** 0.52958 **, which was the same as before.
◆ L2 regularization Similarly, set the penalty to l2 and check the accuracy.
clf_L2 = SGDClassifier(loss = "log", penalty = "l2", random_state=1234)
clf_L2.fit(X_train,y_train)
clf_L2.score(X_test, y_test)
Then, the accuracy was ** 0.59815 **, which was higher than that of pattern 1. L2 regularization may be more suitable for this data.
Let's verify what kind of accuracy will be obtained by adding standardization processing to the regularization. After performing the standardization process, we will add L1 regularization and L2 regularization.
First, divide it into training data and test data.
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
Next, standardize. Since the features are small this time, only goals and days that should be standardized are standardized, and the columns that have been processed with categorical variables by get_dumiies are not standardized. However, it seems that there is no problem even if the entire data is standardized (I asked the teacher of the machine learning course I am taking).
stdsc = StandardScaler()
df["goal"] = stdsc.fit_transform(df[["goal"]].values)
df["days"] = stdsc.fit_transform(df[["days"]].values)
After that, L1 regularization processing, L2 regularization processing and accuracy verification are performed.
#L1 regularization and accuracy
clf_L1 = SGDClassifier(loss = "log", penalty = "l1", random_state=1234)
clf_L1.fit(X_train,y_train)
clf_L1.score(X_test, y_test)
#L2 regularization and accuracy
clf_L2 = SGDClassifier(loss = "log", penalty = "l2", random_state=1234)
clf_L2.fit(X_train,y_train)
clf_L2.score(X_test, y_test)
The accuracy of L1 regularization is ** 0.66181 **, and the accuracy of L2 regularization is ** 0.65750 **, improving the accuracy at once. After standardization, L1 regularization seems to be a better match. Here, I will keep a record of ** 0.66181 ** of L1 regularization, which had good accuracy.
Let's perform further hyperparameter tuning on pattern 3. Hyperparameter tuning refers to exploring and deciding what numerical values we have to decide for ourselves to create a machine learning model.
Here, we will use Gridsearch.
It would be nice if all parameters could be searched in every range, but that would take too much time, so this time we will tune the penalty and alpha that seem to be important in the SGD Classifier.
First is standardization processing + division of training data and test data.
#Standardization
stdsc = StandardScaler()
df["goal"] = stdsc.fit_transform(df[["goal"]].values)
df["days"] = stdsc.fit_transform(df[["days"]].values)
#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
parameters = {'penalty':['l1', 'l2'], 'alpha':[0.0001,0.001, 0.01, 0.1, 1, 10, 100],'loss':['log']} #Edit here
model = SGDClassifier(random_state=1234)
clf = GridSearchCV(model, parameters, cv=3)
clf.fit(X_train, y_train)
print(clf.best_params_)
This time, I wanted to compare the accuracy for each pattern shown in the table at the beginning, so I was careful to ** use GridSearch including the default value **. This is because if the default value is not included, the accuracy comparison will not be possible.
The penalty is l1 or l2, and alpha contains the default value of 0.0001, which is fairly wide. As before, if you do not specify loss in log, it will not be logistic regression in the first place, so specify it in log.
Then, {'alpha': 0.0001,'loss':'log','penalty':'l1'} was displayed, and he searched for the best parameter.
Note that the alpha of this best parameter is actually the exact same value as the default value when you look at sklearn's site **. In addition, in the L1 regularization of pattern 3, loss is set to log and penalty is set to l1, so in theory, even if hyperparameter search is performed, ** the same parameters as L1 regularization of pattern 3 are used. **.
In other words, let's continue, thinking that the accuracy should be 0.66181, which is the same as the L1 regularization of pattern 3.
Now let's build the model again using this best parameter. The following means using "** clf.best_paramas_" to train the SGD Classifier with the best parameters I just mentioned.
clf_2 = SGDClassifier(**clf.best_params_,random_state=1234)
clf_2.fit(X_train,y_train)
Finally, check the accuracy with the test data.
clf_2.score(X_test, y_test)
It became ** 0.66181 **, and it became the same system as pattern 3 as hypothesized.
Up to pattern 4, I made a model with the features I chose arbitrarily, but here I will select the features using the method called the built-in method.
First, standardize and divide the data.
#Standardization
stdsc = StandardScaler()
df["goal"] = stdsc.fit_transform(df[["goal"]].values)
df["days"] = stdsc.fit_transform(df[["days"]].values)
#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
Next, we will select the features using the built-in method.
estimator = LassoCV(normalize = True, cv = 10, random_state = 1234)
sfm = SelectFromModel(estimator, threshold = 1e-5)
sfm.fit(X_train,y_train)
The selected features will be overwritten and updated in the training data and test data.
X_train_selected = sfm.transform(X_train)
X_test_selected = sfm.transform(X_test)
Check the accuracy of the SGDClassifier with the overwritten training data (at this point, we have not tuned the hyperparameters yet).
classifier = SGDClassifier(random_state=1234)
classifier.fit(X_train_selected, y_train)
From here, perform hyperparameter tuning. The method is basically the same as before, but the training data (X_train_selected) that overwrites the contents of .fit is used.
parameters = {'penalty':['l1', 'l2'], 'alpha':[0.0001,0.001, 0.01, 0.1, 1, 10, 100],'loss':['log']}
model = SGDClassifier(random_state=1234)
clf = GridSearchCV(model, parameters, cv=3)
clf.fit(X_train_selected, y_train)
print(clf.best_params_)
Train the SGD Classifier again with the best parameters you have given.
clf_2 = SGDClassifier(**clf.best_params_,random_state=1234)
clf_2.fit(X_train_selected,y_train)
Finally, check the accuracy.
clf_2.score(X_test_selected, y_test)
With ** 0.66185 **, we were able to achieve the best accuracy ever.
This is the end of logistic regression. Let's summarize the accuracy once.
model | pattern | accuracy | |
---|---|---|---|
pattern 1 | Logistic regression | A | 0.52958 |
Pattern 2 | Logistic regression | B | 0.59815 |
Pattern 3 | Logistic regression | B+C | 0.66181 |
Pattern 4 | Logistic regression | B+C+D | 0.66181 |
Pattern 5 | Logistic regression | B+C+D+E | 0.66185 |
Pattern 6 | SVM | A | |
Pattern 7 | SVM | C | |
Pattern 8 | SVM | C+D | |
Pattern 9 | SVM | C+D+E | |
Pattern 10 | Decision tree | A | |
Pattern 11 | Decision tree | D | |
Pattern 12 | Decision tree | D+E | |
Pattern 13 | Random forest | A | |
Pattern 14 | Random forest | D | |
Pattern 15 | Random forest | D+E | |
Pattern 16 | AdaBoost | A | |
Pattern 17 | AdaBoost | D | |
Pattern 18 | AdaBoost | D+E |
Now let's move on to SVM.
I also tried SVM and realized it was a mess, but it takes a lot of time to process. Therefore, if we were to build a model and tune hyperparameters using all training data like logistic regression, we wouldn't have enough time, so we paid attention to making the data smaller.
I trained all the data from the first shot and experienced many times that it did not end even if it took hours, so I first tried with a very small number of data (and parameters), and the time it took Make a note of it. So, I think it's better to start the process after setting a guideline for how long it will take because this amount is about many times as much.
I'm going to implement SVM without any adjustments. First, divide it into training data and test data.
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
The major changes I have made in processing SVMs with other models are as follows, but it takes a lot of time to train SVMs with all training data like logistic regression.
Therefore, let's treat the training data once divided into 70% of the total, and then treat 1.5% (= 10.5% of all data) as training data.
Even with 1.5%, it took more than 3 hours to complete the process. ..
#1 from the training data.Sampling 50%
X_train_sample = pd.DataFrame(X_train).sample(frac = 0.15, random_state=1234)
y_train_sample = pd.DataFrame(y_train).sample(frac = 0.15, random_state=1234)
#Model building
clf = SVC(random_state=1234)
clf.fit(X_train_sample, y_train_sample)
Now that we have a model, let's check the accuracy with test data. It is important to note that the accuracy of the test data is not reduced from 30% of the total, but the accuracy is confirmed using all the test data. This is because I thought that if the number of test data was reduced only for SVM, it would not be possible to compare the accuracy with other models.
clf.score(X_test,y_test)
The accuracy was ** 0.61935 **.
Since SVM does not require regularization itself, it starts with the implementation of standardization.
First, let's standardize and divide the data. After that, it is the same as pattern 6.
#Standardization
stdsc = StandardScaler()
df["goal"] = stdsc.fit_transform(df[["goal"]].values)
df["days"] = stdsc.fit_transform(df[["days"]].values)
#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
#1 from the training data.Sampling 50%
X_train_sample = pd.DataFrame(X_train).sample(frac = 0.15, random_state=1234)
y_train_sample = pd.DataFrame(y_train).sample(frac = 0.15, random_state=1234)
#Model building
clf = SVC(random_state=1234)
clf.fit(X_train_sample, y_train_sample)
#Accuracy evaluation
clf.score(X_test, y_test)
The accuracy is ** 0.64871 **, which is better than pattern 6.
Next, let's implement standardization + hyperparameter tuning.
First is standardization and data partitioning.
stdsc = StandardScaler()
df["goal"] = stdsc.fit_transform(df[["goal"]].values)
df["days"] = stdsc.fit_transform(df[["days"]].values)
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
Next, perform hyperparameter tuning. Again, it takes a lot of time to search using all the training data, so 3% (= 2.1% of the total) of the 70% of the training data is used for hyperparameter tuning.
Originally, I would like to use more data to adjust the parameters, but since the processing is not really finished, I chose this value.
#Training data 3%Sampling
X_train_grid = pd.DataFrame(X_train).sample(frac = 0.03,random_state=1234)
y_train_grid = pd.DataFrame(y_train).sample(frac = 0.03,random_state=1234)
Then, perform hyperparameter tuning.
parameters = {'kernel':['linear', 'rbf'], 'C':[0.001, 0.01,0.1,1,10]} #Edit here
model = SVC(random_state=1234)
clf = GridSearchCV(model, parameters, cv=2,return_train_score=False)
clf.fit(X_train_grid, y_train_grid)
print(clf.best_params_, clf.best_score_)
The point that is different from logistic regression is that the argument cv of GridSearch is set to 2 + return_train_score is set to False. Originally, cv was set to 3 and return_train_score was not set in particular, but since the process does not end forever, I checked it on the site and set it.
At this point, we have searched for the "optimal parameters"! We will train using the training data with this optimum parameter and verify the accuracy with the test data.
#1 from the training data.Sampling 50%
X_train_sample = pd.DataFrame(X_train).sample(frac = 0.15, random_state=1234)
y_train_sample = pd.DataFrame(y_train).sample(frac = 0.15, random_state=1234)
#Model training
clf = SVC(**clf.best_params_,random_state=1234)
clf.fit(X_train_sample, y_train_sample)
#Check accuracy with test data
clf.score(X_test, y_test)
The accuracy was ** 0.65393 **.
Finally, add feature selection.
Again, the training data is further divided into pattern 8 for feature selection, so be careful not to confuse which training data you are using for what.
#Standardization
stdsc = StandardScaler()
df["goal"] = stdsc.fit_transform(df[["goal"]].values)
df["days"] = stdsc.fit_transform(df[["days"]].values)
#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
#Feature selection
estimator = LassoCV(normalize = True, cv = 10, random_state = 1234)
sfm = SelectFromModel(estimator, threshold = 1e-5)
sfm.fit(X_train,y_train)
#Overwrite training data with selected features
X_train_selected = sfm.transform(X_train)
X_test_selected = sfm.transform(X_test)
This is the feature selection. Next, hyperparameter tuning is performed based on the selected features.
#3 of training data overwritten with selected features for hyperparameter tuning%Sampling
X_train_grid = pd.DataFrame(X_train_selected).sample(frac = 0.03,random_state=1234)
y_train_grid = pd.DataFrame(y_train).sample(frac = 0.03,random_state=1234)
#Hyperparameter tuning
parameters = {'kernel':['linear', 'rbf'], 'C':[0.001, 0.01,0.1,1,10]} #Edit here
model = SVC(random_state=1234)
clf = GridSearchCV(model, parameters, cv=2,return_train_score=False)
clf.fit(X_train_grid, y_train_grid)
print(clf.best_params_, clf.best_score_)
At this point, hyperparameter tuning using the features selected by the built-in method has been completed, and the best parameters have been determined.
Let's train the SVM model with the training data (X_train_selected) overwritten with this selected feature and the best parameters with 1.5% of the training data of 70% of the total.
#Sampling 30% from the overwritten training data
X_train_sample = pd.DataFrame(X_train_selected).sample(frac = 0.15, random_state=1234)
y_train_sample = pd.DataFrame(y_train).sample(frac = 0.15, random_state=1234)
#Model construction with an additional 30% of sample data from 70% of the training data
clf = SVC(**clf.best_params_,random_state=1234)
clf.fit(X_train_sample, y_train_sample)
#Accuracy evaluation with test data
clf.score(X_test_selected, y_test)
The accuracy is ** 0.65066 **.
Here, let's summarize the accuracy again.
model | pattern | accuracy | |
---|---|---|---|
pattern 1 | Logistic regression | A | 0.52958 |
Pattern 2 | Logistic regression | B | 0.59815 |
Pattern 3 | Logistic regression | B+C | 0.66181 |
Pattern 4 | Logistic regression | B+C+D | 0.66181 |
Pattern 5 | Logistic regression | B+C+D+E | 0.66185 |
Pattern 6 | SVM | A | 0.61935 |
Pattern 7 | SVM | C | 0.64871 |
Pattern 8 | SVM | C+D | 0.65393 |
Pattern 9 | SVM | C+D+E | 0.65066 |
Pattern 10 | Decision tree | A | |
Pattern 11 | Decision tree | D | |
Pattern 12 | Decision tree | D+E | |
Pattern 13 | Random forest | A | |
Pattern 14 | Random forest | D | |
Pattern 15 | Random forest | D+E | |
Pattern 16 | AdaBoost | A | |
Pattern 17 | AdaBoost | D | |
Pattern 18 | AdaBoost | D+E |
Next is the decision tree.
Divide the data.
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
We build a model of the decision tree and verify the accuracy.
clf = DecisionTreeClassifier(random_state=1234)
clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)
The accuracy is now ** 0.63727 **.
The decision tree does not require regularization or standardization, so we start with hyperparameter tuning.
First is data division.
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
Next, perform gridsearch for hyperparameter tuning. After that, we will build the model with the best parameters and verify the accuracy of the test data.
#GridSearch
parameters = {'criterion':['gini', 'entropy'], 'max_depth':[i for i in range(1, 11)],'max_features':['auto','sqrt','log2'], 'min_samples_leaf':[i for i in range(1, 11)],'random_state':[1234]} #Edit here
model = DecisionTreeClassifier(random_state=1234)
clf = GridSearchCV(model, parameters, cv=3)
clf.fit(X_train, y_train)
print(clf.best_params_, clf.best_score_)
#Build a model with the best parameters
clf = DecisionTreeClassifier(**clf.best_params_,random_state=1234)
clf.fit(X_train, y_train)
#Accuracy verification
clf.score(X_test,y_test)
The accuracy was ** 0.66376 **.
As always, start with data splitting.
#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
#Feature selection
estimator = LassoCV(normalize = True, cv = 10, random_state = 1234)
sfm = SelectFromModel(estimator, threshold = 1e-5)
sfm.fit(X_train,y_train)
#Overwrite training data with selected features
X_train_selected = sfm.transform(X_train)
X_test_selected = sfm.transform(X_test)
#Hyperparameter tuning
parameters = {'criterion':['gini', 'entropy'], 'max_depth':[i for i in range(1, 11)],'max_features':['auto','sqrt','log2'], 'min_samples_leaf':[i for i in range(1, 11)],'random_state':[1234]}
model = DecisionTreeClassifier(random_state=1234)
clf = GridSearchCV(model, parameters, cv=3)
clf.fit(X_train_selected, y_train)
print(clf.best_params_, clf.best_score_)
#Train learners with optimal parameters
clf_2 = DecisionTreeClassifier(**clf.best_params_,random_state=1234)
clf_2.fit(X_train_selected,y_train)
#Check accuracy with test data
clf_2.score(X_test_selected, y_test)
The accuracy is ** 0.65732 **.
model | pattern | accuracy | |
---|---|---|---|
pattern 1 | Logistic regression | A | 0.52958 |
Pattern 2 | Logistic regression | B | 0.59815 |
Pattern 3 | Logistic regression | B+C | 0.66181 |
Pattern 4 | Logistic regression | B+C+D | 0.66181 |
Pattern 5 | Logistic regression | B+C+D+E | 0.66185 |
Pattern 6 | SVM | A | 0.61935 |
Pattern 7 | SVM | C | 0.64871 |
Pattern 8 | SVM | C+D | 0.65393 |
Pattern 9 | SVM | C+D+E | 0.65066 |
Pattern 10 | Decision tree | A | 0.63727 |
Pattern 11 | Decision tree | D | 0.66376 |
Pattern 12 | Decision tree | D+E | 0.65732 |
Pattern 13 | Random forest | A | |
Pattern 14 | Random forest | D | |
Pattern 15 | Random forest | D+E | |
Pattern 16 | AdaBoost | A | |
Pattern 17 | AdaBoost | D | |
Pattern 18 | AdaBoost | D+E |
Divide the data.
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
Random forest model construction is performed and accuracy verification is performed.
clf = RandomForestClassifier(random_state=1234)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
The accuracy is now ** 0.64522 **.
Like decision trees, random forests do not require regularization and standardization.
As before, we will perform hyperparameter tuning after data partitioning. The difference from the previous model is that the search range is narrowed a little (each index is searched in the range of 1 to 5). This alone took about 35 minutes, so I felt that it would be difficult to partition the processing in consideration of my own time if I expanded the range further, so I narrowed the range.
#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
#Hyperparameter tuning
parameters = {'max_depth':[2,4,6,None], 'min_samples_leaf':[1,3,5],'min_samples_split':[2,4,6]}
model = RandomForestClassifier(random_state=1234)
clf = GridSearchCV(model, parameters, cv=3,)
clf.fit(X_train, y_train)
print(clf.best_params_, clf.best_score_)
#Train learners with optimal parameters
clf = RandomForestClassifier(**clf.best_params_,random_state=1234)
clf.fit(X_train, y_train)
#Check accuracy with test data
clf.score(X_test, y_test)
Since it took a long time to narrow down the numerical values in the hyperparameter search, we are switching to the search of only even numbers or only odd numbers on the assumption that the default values are included. The accuracy was ** 0.67762 **.
#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
#Feature selection
estimator = LassoCV(normalize = True, cv = 10, random_state = 1234)
sfm = SelectFromModel(estimator, threshold = 1e-5)
sfm.fit(X_train,y_train)
#Overwrite training data with selected features
X_train_selected = sfm.transform(X_train)
X_test_selected = sfm.transform(X_test)
#Hyperparameter tuning
parameters = {'max_depth':[2,4,6,None], 'min_samples_leaf':[1,3,5],'min_samples_split':[2,4,6]}
model = RandomForestClassifier(random_state=1234)
clf = GridSearchCV(model, parameters, cv=3,)
clf.fit(X_train_selected, y_train)
print(clf.best_params_, clf.best_score_)
#Train learners with optimal parameters
clf = RandomForestClassifier(**clf.best_params_,random_state=1234)
clf.fit(X_train_selected, y_train)
#Check accuracy with test data
clf.score(X_test_selected, y_test)
The accuracy is now ** 0.66308 **.
Let's check the accuracy again.
model | pattern | accuracy | |
---|---|---|---|
pattern 1 | Logistic regression | A | 0.52958 |
Pattern 2 | Logistic regression | B | 0.59815 |
Pattern 3 | Logistic regression | B+C | 0.66181 |
Pattern 4 | Logistic regression | B+C+D | 0.66181 |
Pattern 5 | Logistic regression | B+C+D+E | 0.66185 |
Pattern 6 | SVM | A | 0.61935 |
Pattern 7 | SVM | C | 0.64871 |
Pattern 8 | SVM | C+D | 0.65393 |
Pattern 9 | SVM | C+D+E | 0.65066 |
Pattern 10 | Decision tree | A | 0.63727 |
Pattern 11 | Decision tree | D | 0.66376 |
Pattern 12 | Decision tree | D+E | 0.65732 |
Pattern 13 | Random forest | A | 0.64522 |
Pattern 14 | Random forest | D | 0.67762 |
Pattern 15 | Random forest | D+E | 0.66308 |
Pattern 16 | AdaBoost | A | |
Pattern 17 | AdaBoost | D | |
Pattern 18 | AdaBoost | D+E |
Divide the data.
#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
We build a model of AdaBoost and verify the accuracy.
clf = AdaBoostClassifier(DecisionTreeClassifier(random_state=1234))
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
The accuracy is now ** 0.63947 **.
#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
#Hyperparameter tuning
parameters = {'learning_rate':[0.1,0.5,1.0]}
model = AdaBoostClassifier(random_state=1234)
clf = GridSearchCV(model, parameters, cv=3,)
clf.fit(X_train, y_train)
print(clf.best_params_, clf.best_score_)
#Train learners with optimal parameters
clf = AdaBoostClassifier(**clf.best_params_,random_state=1234)
clf.fit(X_train, y_train)
#Check accuracy with test data
clf.score(X_test, y_test)
The accuracy was ** 0.67426 **.
#Data split
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
#Feature selection
estimator = LassoCV(normalize = True, cv = 10, random_state = 1234)
sfm = SelectFromModel(estimator, threshold = 1e-5)
sfm.fit(X_train,y_train)
#Overwrite training data with selected features
X_train_selected = sfm.transform(X_train)
X_test_selected = sfm.transform(X_test)
#Hyperparameter tuning
parameters = {'learning_rate':[0.1,0.5,1.0]}
model = AdaBoostClassifier(random_state=1234)
clf = GridSearchCV(model, parameters, cv=3)
clf.fit(X_train_selected, y_train)
print(clf.best_params_, clf.best_score_)
#Train learners with optimal parameters
clf = AdaBoostClassifier(**clf.best_params_,random_state=1234)
clf.fit(X_train_selected, y_train)
#Check accuracy with test data
clf.score(X_test_selected, y_test)
The accuracy was ** 0.659367 **.
model | pattern | accuracy | |
---|---|---|---|
pattern 1 | Logistic regression | A | 0.52958 |
Pattern 2 | Logistic regression | B | 0.59815 |
Pattern 3 | Logistic regression | B+C | 0.66181 |
Pattern 4 | Logistic regression | B+C+D | 0.66181 |
Pattern 5 | Logistic regression | B+C+D+E | 0.66185 |
Pattern 6 | SVM | A | 0.61935 |
Pattern 7 | SVM | C | 0.64871 |
Pattern 8 | SVM | C+D | 0.65393 |
Pattern 9 | SVM | C+D+E | 0.65066 |
Pattern 10 | Decision tree | A | 0.63727 |
Pattern 11 | Decision tree | D | 0.66376 |
Pattern 12 | Decision tree | D+E | 0.65732 |
Pattern 13 | Random forest | A | 0.64522 |
Pattern 14 | Random forest | D | 0.67762 |
Pattern 15 | Random forest | D+E | 0.66308 |
Pattern 16 | AdaBoost | A | 0.63947 |
Pattern 17 | AdaBoost | D | 0.67426 |
Pattern 18 | AdaBoost | D+E | 0.659367 |
What did you think.
Surprisingly, I think that there are few sites that introduce super-basic model building methods, and I always think, "I don't want to know such advanced things, I just want to make a model once!" I did.
This article focuses on my own problems, so I hope it helps to deepen my understanding.