Intermediate Machine Learning digs deeper into machine learning
~ Flow of Intermediate Machine Learning ~
This time it is the contents of 5!
Machine learning is an interactive task. Which explanatory variable to use, which model to use, what arguments to pass to that model, etc. We will consider these while measuring the quality of the model by verification.
However, these methods have drawbacks. Let's say you have a dataset with 5000 rows (which means you have a __less __ dataset). 20% for verification is 1000 lines. The model you've created may be __ working well __ on one 1000 lines, but __ not so __ on another 1000 lines.
As an extreme example, consider the case where the validation data is one line. When comparing multiple models, which model makes the best prediction for that row will be __ luck __!
In general, the more validation data you have, the smaller the __measurement error __ (called "noise") in your model and the more reliable it is. Unfortunately, a large amount of validation data can only be obtained by extracting a large amount from the training data. Doing so will result in inadequate learning and poor model quality!
Cross-validation is a method of validating a model with higher accuracy for small datasets.
For example, if the validation data is 20% of the total, a total of 5 trials can be repeated. This is said to be divided into 5 __ "fold" __.
Quoted from kaggle
Cross-validation takes longer because of the increased complexity. So __ does not have to perform cross-validation when the __ dataset is large enough.
There is no clear standard for a dataset to be sufficient, but if your model finishes its calculations in minutes, it may be worth performing cross-validation.
Other than that, if you run cross-validation and all folds give similar results, then one validation will suffice.
The data used this time is the same as last time. It is located at here.
import pandas as pd
from sklearn.model_selection import train_test_split
# Data reading
train_data = pd.read_csv('train.csv', index_col='Id')
test_data = pd.read_csv('test.csv', index_col='Id')
# Exclude rows where the objective variable is missing, isolate the objective variable
train_data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = train_data.SalePrice
train_data.drop(['SalePrice'], axis=1, inplace=True)
# Extract a column of numbers
numeric_cols = [cname for cname in train_data.columns if train_data[cname].dtype in ['int64', 'float64']]
X = train_data[numeric_cols].copy()
X_test = test_data[numeric_cols].copy()
First, make a pipeline. To make up for missing values, SimpleImputer The model used is RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
my_pipeline = Pipeline(steps=[
('preprocessor', SimpleImputer()),
('model', RandomForestRegressor(n_estimators=50, random_state=0))])
Next, define a function to average the MAE by cross-validation. Put the number of random forests in n_estimators. Since cross_val_score__ in __scikit-learn returns MAE with __ minus, it is multiplied by -1. (* I didn't understand the reason *) Adjust the number of folds with the arguments you pass to cv.
from sklearn.model_selection import cross_val_score
def get_score(n_estimators):
my_pipeline = Pipeline(steps=[
('preprocessor', SimpleImputer()),
('model', RandomForestRegressor(n_estimators, random_state=0))])
scores = -1 * cross_val_score(my_pipeline, X, y,
return scores.mean()
Finally, we'll put numbers into the function defined above. In addition, visualize the change in MAE obtained by the entered number in a graph and find the minimum value. scikit-learn's cross_val_score returns a return value in list format, so results is placed in an empty list.
results = {}
for i in range(1,9):
results[50*i] = get_score(50*i)
import matplotlib.pyplot as plt
n_estimators_best = min(results, key=results.get)
plt.plot(list(results.keys()), list(results.values()))
Execution result
