I tried SIGNATE "[Practice question] Predicting the number of rental bicycle users"

Introduction

This time, we worked on SIGNATE's "[Practice question] Prediction of rental bicycle users". Regarding machine learning, I haven't been able to do much yet, but I hope to grow little by little through competitions.

Contents of the competition

I worked on the following exercises.

** [Practice question] Forecasting the number of rental bicycle users **

Create this model that predicts the number of rental bicycle users per hour each day from seasonal information and weather information for 2 years

Actual code

1. Read data

#Library import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
#Read / display files
train = pd.read_csv('train.tsv',sep='\t')
test  = pd.read_csv('test.tsv',sep='\t')
train.head()

data1.png

2. Understanding the data content

#Try plotting usage
plt.figure(figsize=(12,5))
plt.plot(train['id'],train['cnt'])

graph1.png

#Plot the usage status for a certain week as a trial
# 1.Store in variable by date
_day0703 = train.query('dteday == "2011-07-03"')#Day
_day0704 = train.query('dteday == "2011-07-04"')#Month
_day0705 = train.query('dteday == "2011-07-05"')#fire
_day0706 = train.query('dteday == "2011-07-06"')#water
_day0707 = train.query('dteday == "2011-07-07"')#wood
_day0708 = train.query('dteday == "2011-07-08"')#Money
_day0709 = train.query('dteday == "2011-07-09"')#soil
# 2.Graph display of each date
plt.figure(figsize=(12,5))
plt.plot(_day0703['hr'],_day0703['cnt'],label='Sun')
plt.plot(_day0704['hr'],_day0704['cnt'],label='Mon')
plt.plot(_day0705['hr'],_day0705['cnt'],label='Tue')
plt.plot(_day0706['hr'],_day0706['cnt'],label='Wed')
plt.plot(_day0707['hr'],_day0707['cnt'],label='Thu')
plt.plot(_day0708['hr'],_day0708['cnt'],label='Fri')
plt.plot(_day0709['hr'],_day0709['cnt'],label='Sat')
plt.legend()
plt.grid()

graph2.png

・ It seems that the usage status differs between holidays and weekdays. ・ On weekdays, it is often used from 6 am to 10 am and from 16 pm to 9 pm, so it seems to be used for commuting to work or school.

Since the usage status changes depending on holidays and time of day, I chose XGBoost because I thought linear regression was not suitable.

3. Learning with XGBoost

#XGBoost library import
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
#Creating an xgboost model
reg = xgb.XGBRegressor()
#Before id2500, the tendency is different, so cut(Operation start etc.?)
train = train[train['id'] > 2500]
#Store explanatory variables and objective variables
X_train = train.drop(['id','dteday','cnt'], axis=1)
y_train = train['cnt']
X_test = test.drop(['id','dteday'], axis=1)
#Hyperparameter search
reg_cv = GridSearchCV(reg, {'max_depth': [2,4,6], 'n_estimators': [50,100,200]}, verbose=1)
reg_cv.fit(X_train, y_train)
print(reg_cv.best_params_, reg_cv.best_score_)
#Learn again with optimal parameters
reg = xgb.XGBRegressor(**reg_cv.best_params_)
reg.fit(X_train, y_train)

4. Check the model

#Prediction using training data
pred_train = reg.predict(X_train)
#Check if the predicted value is valid
train_value = y_train.values
_df = pd.DataFrame({'actual':train_value,'pred':pred_train})
_df.plot(figsize=(12,5))

graph3.png

In general, it seems that you can predict correctly.

5. Confirm the importance of rating features

#feature importance plot
importances = pd.Series(reg.feature_importances_, index = X_train.columns)
importances = importances.sort_values()
importances.plot(kind = "barh")
plt.title("imporance in the xgboost Model")
plt.show()

graph4.png

6. Create submission file

#Calculation of predicted values for test data
pred_test = reg.predict(X_test)
#Paste the result and output to a file
sample = pd.read_csv("sample_submit.csv",header=None)
sample[1] = pred_test
sample.to_csv("submit01.csv",index=None,header=None)

Result Summary

It was 29th out of 209 people. This time, I simply put it in XGBoost, so there seems to be room for other ideas such as creating features, another learning model, and ensemble learning. I would like to try again, so I would like to write an article again at that time.

Recommended Posts

I tried SIGNATE "[Practice question] Predicting the number of rental bicycle users"
I tried the asynchronous server of Django 3.0
I tried to estimate the similarity of the question intent using gensim's Doc2Vec
(Python) I tried to analyze 1 million hands ~ I tried to estimate the number of AA ~
I tried the common story of predicting the Nikkei 225 using deep learning (backtest)
I tried the pivot table function of pandas
I tried to touch the API of ebay
I tried to correct the keystone of the image
I tried using the image filter of OpenCV
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
Python practice 100 knocks I tried to visualize the decision tree of Chapter 5 using graphviz
python beginners tried to predict the number of criminals
I tried to summarize the basic form of GPLVM
I tried the MNIST tutorial for beginners of tensorflow.
I tried using the API of the salmon data project
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I tried scraping the advertisement of the pirated cartoon site
I tried increasing or decreasing the number by programming
I tried the simplest method of multi-label document classification
I tried to classify the voices of voice actors
I tried running the sample code of the Ansible module
I tried to summarize the string operations of Python
I tried to tabulate the number of deaths per capita of COVID-19 (new coronavirus) by country
I tried to get the number of days of the month holidays (Saturdays, Sundays, and holidays) with python
I tried to find the trend of the number of ships in Tokyo Bay from satellite images.
I tried to find the entropy of the image with python
[Horse Racing] I tried to quantify the strength of racehorses
I tried to get the location information of Odakyu Bus
I tried the accuracy of three Stirling's approximations in python
I tried to find the average of the sequence with TensorFlow
I tried refactoring the CNN model of TensorFlow using TF-Slim
I tried face recognition of the laughter problem using Keras.
I tried morphological analysis of the general review of Kusoge of the Year
[Python] I tried to visualize the follow relationship of Twitter
I tried a little bit of the behavior of the zip function
[Machine learning] I tried to summarize the theory of Adaboost
[Python] I tried collecting data using the API of wikipedia
I tried to fight the Local Minimum of Goldstein-Price Function
I displayed the chat of YouTube Live and tried playing
I tried to predict the number of domestically infected people of the new corona with a mathematical model
I tried to solve the first question of the University of Tokyo 2019 math entrance exam with python sympy
I tried to sort out the objects from the image of the steak set meal-② Overlap number sorting
I tried fitting the exponential function and logistics function to the number of COVID-19 positive patients in Tokyo
10. Counting the number of lines
Get the number of digits
I tried the changefinder library!
Calculate the number of changes
I tried scraping the ranking of Qiita Advent Calendar with Python
[Linux] I tried to summarize the command of resource confirmation system
I checked the distribution of the number of video views of "Flag-chan!" [Python] [Graph]
I tried to get the index of the list using the enumerate function
I tried to automate the watering of the planter with Raspberry Pi
I tried to build the SD boot image of LicheePi Nano
I looked at the meta information of BigQuery & tried using it
I investigated the X-means method that automatically estimates the number of clusters
I tried to expand the size of the logical volume with LVM
I tried running the DNN part of OpenPose with Chainer CPU
I tried to summarize the frequently used implementation method of pytest-mock
I tried to improve the efficiency of daily work with Python