This time, we worked on SIGNATE's "[Practice question] Prediction of rental bicycle users". Regarding machine learning, I haven't been able to do much yet, but I hope to grow little by little through competitions.
I worked on the following exercises.
Create this model that predicts the number of rental bicycle users per hour each day from seasonal information and weather information for 2 years
#Library import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
#Read / display files
train = pd.read_csv('train.tsv',sep='\t')
test = pd.read_csv('test.tsv',sep='\t')
train.head()
#Try plotting usage
plt.figure(figsize=(12,5))
plt.plot(train['id'],train['cnt'])
#Plot the usage status for a certain week as a trial
# 1.Store in variable by date
_day0703 = train.query('dteday == "2011-07-03"')#Day
_day0704 = train.query('dteday == "2011-07-04"')#Month
_day0705 = train.query('dteday == "2011-07-05"')#fire
_day0706 = train.query('dteday == "2011-07-06"')#water
_day0707 = train.query('dteday == "2011-07-07"')#wood
_day0708 = train.query('dteday == "2011-07-08"')#Money
_day0709 = train.query('dteday == "2011-07-09"')#soil
# 2.Graph display of each date
plt.figure(figsize=(12,5))
plt.plot(_day0703['hr'],_day0703['cnt'],label='Sun')
plt.plot(_day0704['hr'],_day0704['cnt'],label='Mon')
plt.plot(_day0705['hr'],_day0705['cnt'],label='Tue')
plt.plot(_day0706['hr'],_day0706['cnt'],label='Wed')
plt.plot(_day0707['hr'],_day0707['cnt'],label='Thu')
plt.plot(_day0708['hr'],_day0708['cnt'],label='Fri')
plt.plot(_day0709['hr'],_day0709['cnt'],label='Sat')
plt.legend()
plt.grid()
・ It seems that the usage status differs between holidays and weekdays. ・ On weekdays, it is often used from 6 am to 10 am and from 16 pm to 9 pm, so it seems to be used for commuting to work or school.
Since the usage status changes depending on holidays and time of day, I chose XGBoost because I thought linear regression was not suitable.
#XGBoost library import
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
#Creating an xgboost model
reg = xgb.XGBRegressor()
#Before id2500, the tendency is different, so cut(Operation start etc.?)
train = train[train['id'] > 2500]
#Store explanatory variables and objective variables
X_train = train.drop(['id','dteday','cnt'], axis=1)
y_train = train['cnt']
X_test = test.drop(['id','dteday'], axis=1)
#Hyperparameter search
reg_cv = GridSearchCV(reg, {'max_depth': [2,4,6], 'n_estimators': [50,100,200]}, verbose=1)
reg_cv.fit(X_train, y_train)
print(reg_cv.best_params_, reg_cv.best_score_)
#Learn again with optimal parameters
reg = xgb.XGBRegressor(**reg_cv.best_params_)
reg.fit(X_train, y_train)
#Prediction using training data
pred_train = reg.predict(X_train)
#Check if the predicted value is valid
train_value = y_train.values
_df = pd.DataFrame({'actual':train_value,'pred':pred_train})
_df.plot(figsize=(12,5))
In general, it seems that you can predict correctly.
#feature importance plot
importances = pd.Series(reg.feature_importances_, index = X_train.columns)
importances = importances.sort_values()
importances.plot(kind = "barh")
plt.title("imporance in the xgboost Model")
plt.show()
#Calculation of predicted values for test data
pred_test = reg.predict(X_test)
#Paste the result and output to a file
sample = pd.read_csv("sample_submit.csv",header=None)
sample[1] = pred_test
sample.to_csv("submit01.csv",index=None,header=None)
It was 29th out of 209 people. This time, I simply put it in XGBoost, so there seems to be room for other ideas such as creating features, another learning model, and ensemble learning. I would like to try again, so I would like to write an article again at that time.
Recommended Posts