If you want to try machine learning, anyone can use scikit-learn etc. to implement it relatively easily. However, in order to achieve results at work or to improve your level ** You can see that the explanation "I don't know the background, but I got this result" is obviously weak **.
In [Machine learning] Understanding decision trees from both scikit-learn and mathematics that I posted last time, I described the details of decision trees. This time, I will summarize the random forest that is used in more practical work and competitions such as kaggle.
I don't talk about mathematics as usual this time, but ** I could only understand "Random forest is a combination of decision trees" **, so I organized it myself * * The purpose of this time is to help you understand "what is a random forest" and "what should be done for parameter tuning" while keeping the background in mind **.
Also, this time, O'Reilly's [Machine learning starting with Python](https://www.amazon.co.jp/Python%E3%81%A7%E3%81%AF%E3%81%98%E3%82%81] % E3% 82% 8B% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92-% E2% 80% 95scikit-learn% E3% 81% A7% E5% AD% A6% E3% 81% B6% E7% 89% B9% E5% BE% B4% E9% 87% 8F% E3% 82% A8% E3% 83% B3% E3% 82% B8% E3% 83% 8B% E3% 82% A2% E3% 83% AA% E3% 83% B3% E3% 82% B0% E3% 81% A8% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 81% AE% E5% 9F% BA% E7% A4% 8E-Andreas-C-Muller / dp / 4873117984 / ref = sr_1_4? adgrpid = 79259353864 & dchild = 1 & gclid = Cj0KCQjw3qzzBRDnARIsAECmrypfSaVgzur1vjdrANcvYmfbh5o4vqR0LY6sH-cKX14mFgJ95QpG5sQaAkdAEALw_wcB & hvadid = 358533815035 & hvdev = c & hvlocphy = 1009318 & hvnetw = g & hvqmt = e & hvrand = 15282066364140801380 & hvtargid = kwd-475056195101 & hydadcr = 27269_11561183 & jp-ad-ap = 0 & keywords = python% E3% 81% A7% E3% 81% AF% E3% 81% E3% 82% 8B%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92&qid=1584146942&sr=8-4) was referred to.
To understand Random Forest, we will touch on ensemble learning.
Ensemble learning is ** a way to build more powerful models by combining multiple machine learning models **.
There are many machine learning models such as "logistic regression", "SVM", and "decision tree", but each of these makes predictions for data independently.
However, in general, I think that there are many cases where some kind of ** majority vote **, in which several people come together to come up with an answer, produces better results than one person giving an answer at their own discretion.
Ensemble learning is exactly this way of thinking, and it is a learning method that makes a final decision based on the judgment results of multiple machine learning models. The image is below.
There are two main types of ensemble learning methods, "bagging" and "boostering". Random forest makes predictions based on this "bagging".
It is a method to train multiple models in parallel using the method of ** bootstrap **. → When new data comes in, we will make a majority vote for classification and average prediction for regression.
A method of sampling some data from the original data by ** restoration extraction **. In the restoration extraction, the data once taken is also returned to the original data and sampled, so the same data may be selected many times.
How to prepare multiple models and proceed with learning in series. We will build the next model while referring to the results of the model created earlier.
Models based on boosting have AdaBoost (not mentioned this time).
Random forest is a collection of a lot of slightly different decision trees based on ** ensemble learning bagging **.
Random forests are one way to deal with this problem, as the decision tree alone has the drawback of being overfitting.
As mentioned in the bagging, each decision tree is constructed with each data overfitted because it randomly samples several groups from the original data.
** The idea is that if you make a lot of decision trees that are overfitting in different directions, you can reduce the degree of overfitting by averaging the results **.
Let's illustrate this idea. STEP1: Randomly sample data from the original data with bootstrap and create data groups for N groups
STEP2: Create a decision tree model for each of the N groups.
STEP3: Make a prediction once with the decision tree model of each N group.
STEP4: Take a majority vote of N groups (regression is average) and make a final prediction.
The concrete implementation with scikit-learn will be done from the next, but I will explain how to set each parameter first.
However, as a premise, Random Forest is known to provide reasonably good accuracy without much parameter tuning (no need to convert scales such as data standardization). Therefore, this time we will only introduce it, and in the next implementation, we will build the model with the default settings.
Here, [Machine learning starting with Python] introduced at the beginning (https://www.amazon.co.jp/Python%E3%81%A7%E3%81%AF%E3%81%98%E3%82% 81% E3% 82% 8B% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92-% E2% 80% 95scikit-learn% E3% 81% A7% E5% AD % A6% E3% 81% B6% E7% 89% B9% E5% BE% B4% E9% 87% 8F% E3% 82% A8% E3% 83% B3% E3% 82% B8% E3% 83% 8B % E3% 82% A2% E3% 83% AA% E3% 83% B3% E3% 82% B0% E3% 81% A8% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7 % BF% 92% E3% 81% AE% E5% 9F% BA% E7% A4% 8E-Andreas-C-Muller / dp / 4873117984 / ref = sr_1_4? adgrpid = 79259353864 & dchild = 1 & gclid = Cj0KCQjw3qzzBRDnARIsAECmrypfSaVgzur1vjdrANcvYmfbh5o4vqR0LY6sH-cKX14mFgJ95QpG5sQaAkdAEALw_wcB & hvadid = 358533815035 & hvdev = c & hvlocphy = 100009318 & hvnetw = g & hvqmt = e & hvrand = 152802066364140801380 & hvtargid = kwd-475056195101 & hydadcr = 27269_11561183 & jp-ad-ap = 0 & keywords = python% E3% 81% A7% E3% 81% AF% E3% 81% 98% E3% 82% % 8B% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92 & qid = 1584146942 & sr = 8-4) mentions "important parameters to adjust" on page 87. Introducing n_estimators, max_features.
◆n_estimators Set how many decision trees to prepare. It is how many N of "N data" shown in the figure. The bigger this is, the better (the image is that you can get a majority vote from many people), but if you increase it too much, it will take time and memory, so I think it will be a balance with this area.
◆max_features This is the first time I will describe it here, but there is actually one more thing that is done when sampling the data in STEP1. It is "selection of features". Not all features are used for model construction, and features are also randomly distributed when building a decision tree in each group. Set the number of features in each group with max_features.
Increasing max_features should make each decision tree model similar, while decreasing it will result in significantly different decision tree models, but too small will result in decision trees that do not fit the data. I will end up.
max_features is generally stated in "Machine learning starting with python" that default values should be used.
Now let's actually implement a random forest with scikit-learn.
Use kaggle's Kickstarter Projects dataset. https://www.kaggle.com/kemical/kickstarter-projects
import pandas as pd#Import pandas
import datetime#Import for date processing of original data
from sklearn.model_selection import train_test_split#For data division
from sklearn.ensemble import RandomForestClassifier#Random forest
df = pd.read_csv(r"C:~~\ks-projects-201801.csv")
From the following, you can see that it is the dataset of (378661, 15).
df.shape
Also, let's take a quick look at the data in .head.
df.head()
Since we will focus on random forest this time, we will omit the details, but since the recruitment start time and end time of crowdfunding are in the data, we will convert this to "recruitment days".
df['deadline'] = pd.to_datetime(df["deadline"])
df["launched"] = pd.to_datetime(df["launched"])
df["days"] = (df["deadline"] - df["launched"]).dt.days
I will omit the details here as well, but there are categories other than success ("successful") and failure ("failed") for the objective variable "state", but this time I will only use data for success and failure.
df = df[(df["state"] == "successful") | (df["state"] == "failed")]
Then replace success with 1 and failure with 0.
df["state"] = df["state"].replace("failed",0)
df["state"] = df["state"].replace("successful",1)
Before building the model, delete the id and name that you think you don't need (this should be kept, but this time it's gone), and the variables that you only need to crowdfunding ..
df = df.drop(["ID","name","deadline","launched","backers","pledged","usd pledged","usd_pledged_real","usd_goal_real"], axis=1)
Perform categorical variable processing with pd.get_dummies.
df = pd.get_dummies(df,drop_first = True)
First, divide it into training data and test data.
train_data = df.drop("state", axis=1)
y = df["state"].values
X = train_data.values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
clf = RandomForestClassifier(random_state=1234)
clf.fit(X_train, y_train)
print("score=", clf.score(X_test, y_test))
If you do the above, you should get an accuracy of about 0.638. If it's a basic model, that's it!
How was it? My thought is, "I can't interpret the extremely complicated code from the beginning, so I don't care about the accuracy, so I'll try to implement a basic series of steps with scikit-learn etc." I think it's very important.
However, once I get used to it, I feel that it is very important to understand from the background how they work behind the scenes. As I learn more, I would like to update this random forest to a deeper level.
I think there are many contents that are difficult to understand, but I hope it helps to deepen my understanding.
Recommended Posts