I couldn't go out during Golden Week, so I tried kaggle's "Predict Future Sales".
This theme predicts the number of products sold next month (November 2015) from the data aggregated daily (January 2013 to October 2015).
The feature of this data is
・ Anyway, there is a lot of data ・ The type of shop and the type of item are different between the training data and the test data. ・ Since the learning data is daily and the monthly sales volume is predicted, it is necessary to match the format with the test data. ・ Since there are time-series elements, it is necessary to add features with lag.
Let's do it.
This time, I used this as a reference. kaggle: Feature engineering, xgboost
import copy
import numpy as np
import pandas as pd
from itertools import product
import matplotlib.pyplot as plt
import seaborn as sns
from lightgbm.sklearn import LGBMRegressor
from lightgbm import plot_importance
from sklearn.preprocessing import LabelEncoder
pd.set_option('display.max_rows', 100)
day_train_data = pd.read_csv('path/sales_train.csv')
month_test_data = pd.read_csv('path/test.csv')
item_categories = pd.read_csv('path/item_categories.csv')
items = pd.read_csv('path/items.csv')
shops = pd.read_csv('path/shops.csv')
This time,'date' in day_train_data is not used, so delete it first.
day_train_data.drop('date', axis=1, inplace=True)
If you check the'item_category_name'of item_categories, the categories are divided before and after the'-'.
item_categories['item_category_name'].head()
# [Output result]
# 0 PC - Гарнитуры/Наушники
# 1 Аксессуары - PS2
# 2 Аксессуары - PS3
# 3 Аксессуары - PS4
# 4 Аксессуары - PSP
# Name: item_category_name, dtype: object
We will decompose this and label encode it. Also, there are some with similar names, so I will correct them.
item_categories['split_name'] = item_categories['item_category_name'].str.split('-')
item_categories['big_category_name'] =\
item_categories['split_name'].map(lambda x: x[0].strip())
item_categories['sub_category_name'] =\
tem_categories['split_name'].map(lambda x: x[1].strip() if len(x) > 1 else x[0].strip())
item_categories.loc[item_categories['big_category_name'] == 'Чистые носители(штучные)', 'big_category_name'] = 'Чистые носители (шпиль)'
# label encoding
item_categories['big_category_id'] = LabelEncoder().fit_transform(item_categories['big_category_name'])
item_categories['sub_category_id'] = LabelEncoder().fit_transform(item_categories['sub_category_name'])
#Data organization
item_categories = item_categories[['item_category_id', 'big_category_id', 'sub_category_id']]
As with categories, you can get city information from'shop_name' in shops, so let's make it a variable. By the way, there are some typos, so I'll fix them as well.
shops.loc[shops['shop_name'] == 'Сергиев Посад ТЦ "7Я"', 'shop_name'] = 'СергиевПосад ТЦ "7Я"'
shops['city_name'] = shops['shop_name'].map(lambda x: x.split(' ')[0])
shops.loc[shops['city_name'] == '!Якутск', 'city_name'] = 'Якутск'
# label encoding
shops['city_id'] = LabelEncoder().fit_transform(shops['city_name'])
#Data organization
shops = shops[['shop_id', 'city_id']]
If you look at'shop_name', it has a similar name. Checking kaggle's Notebooks, it looks the same, so I'll fix it. There are several more than this.
shops[(shops['shop_id'] == 0) | (shops['shop_id'] == 57)]
[Output result]
shop_id | shop_name | |
0 | !Якутск Орджоникидзе, 56 фран | 0 |
57 | Якутск Орджоникидзе, 56 | 57 |
The files to modify are day_train_data and month_test_data. In this program, I think it is better not to mess with shops. (Because I will merge later, I will fall into the data amount hell.)
# shop_Modify id
day_train_data.loc[day_train_data['shop_id'] == 0, 'shop_id'] = 57
month_test_data.loc[month_test_data['shop_id'] == 0, 'shop_id'] = 57
day_train_data.loc[day_train_data['shop_id'] == 1, 'shop_id'] = 58
month_test_data.loc[month_test_data['shop_id'] == 1, 'shop_id'] = 58
day_train_data.loc[day_train_data['shop_id'] == 10, 'shop_id'] = 11
month_test_data.loc[month_test_data['shop_id'] == 10, 'shop_id'] = 11
We will delete the data whose values are significantly different in item_price and item_cnt_day. Also, since item_price has a negative value, we will correct it with the median value of items sold at the same place at the same time.
#Remove outliers
day_train_data.drop(day_train_data[day_train_data['item_price'] >= 100000].index, inplace=True)
day_train_data.drop(day_train_data[day_train_data['item_cnt_day'] > 1000].index, inplace=True)
# item_Fixed price error value
median = day_train_data[(day_train_data['date_block_num'] == 4) & (day_train_data['shop_id'] == 32) & (day_train_data['item_id'] == 2973) & (day_train_data['item_price'] >= 0)]['item_price'].median()
day_train_data.loc[day_train_data['item_price'] < 0, 'item_price'] = median
I would like to create a dataset. The dataset created this time is like incomplete panel data that combines day_train_data and month_test_data. Also, since it is the item for each shop in month_test_data that is predicted, I think that it is also possible to copy month_test_data by the amount of date_block_num and create complete panel data that sticks vertically. However, if you do that method, part of day_train_data will be wasted, so this time I would like to do it with the former.
# data_Create set
data_set = []
cols = ['date_block_num', 'shop_id', 'item_id']
for date_num in range(34):
sales = day_train_data[day_train_data['date_block_num'] == date_num]
data_set.append(np.array(list(product([date_num],
sales['shop_id'].unique(),
sales['item_id'].unique()))))
data_set = pd.DataFrame(np.vstack(data_set), columns=cols)
data_set.sort_values(cols, inplace=True)
data_set = data_set.reset_index(drop=True)
# month_test_data merge
month_test_data['date_block_num'] = 34
month_test_data['date_block_num'] = month_test_data['date_block_num']
month_test_data['shop_id'] = month_test_data['shop_id']
month_test_data['item_id'] = month_test_data['item_id']
data_set = pd.concat([data_set, month_test_data], ignore_index=True, sort=False, keys=cols)
# data_set = data_set + shops [city_id]
data_set = pd.merge(data_set, shops, on=['shop_id'], how='left')
# data_set = data_set + items [item_category_id]
data_set = pd.merge(data_set, items[['item_id', 'item_category_id']], on=['item_id'], how='left')
# data_set = data_set + item_categories [big_category_id, sub_category_id]
data_set = pd.merge(data_set, item_categories, on=['item_category_id'], how='left')
I would like to create a monthly sales quantity from day_train_data. This variable becomes the objective variable. Looking at kaggle, it seems that clipping this variable will improve the prediction accuracy.
# item_sales_Create day
day_train_data['item_sales_day'] = day_train_data['item_price'] * day_train_data['item_cnt_day']
# item_cnt_month creation and merge
merge_data = day_train_data.groupby(['date_block_num', 'shop_id', 'item_id'], as_index=False)['item_cnt_day'].agg('sum').rename(columns={'item_cnt_day': 'item_cnt_month'})
data_set = pd.merge(data_set, merge_data, on=cols, how='left')
data_set['item_cnt_month'] = data_set['item_cnt_month'].fillna(0).clip(0, 20)
Let's graph the item_cnt_month for each month.
plot_data = day_train_data.groupby(['date_block_num'], as_index=False)['item_cnt_day'].agg('sum').rename(columns={'item_cnt_day': 'item_cnt_month'})
plt.figure(figsize=(20, 10))
sns.lineplot(data=plot_data, x='date_block_num', y='item_cnt_month')
It is decreasing year by year, and the number of units sold is increasing in December every year, so you can see the periodicity. It seems that the accuracy will improve if you put such a place as a variable. I would like to take a closer look at variable decomposition.
import statsmodels.api as sm
res = sm.tsa.seasonal_decompose(plot_data['item_cnt_month'], freq=12)
plt.subplots_adjust(hspace=0.3)
plt.figure(figsize=(15, 9))
plt.subplot(411)
plt.plot(res.observed, lw=.6, c='darkblue')
plt.title('observed')
plt.subplot(412)
plt.plot(res.trend, lw=.6, c='indianred')
plt.title('trend')
plt.subplot(413)
plt.plot(res.seasonal, lw=.6, c='indianred')
plt.title('seasonal')
plt.subplot(414)
plt.plot(res.resid, lw=.6, c='indianred')
plt.title('residual')
Here, we will cast the data once.
data_set.fillna(0, inplace=True)
data_set['date_block_num'] = data_set['date_block_num'].astype(np.uint8)
data_set['shop_id'] = data_set['shop_id'].astype(np.uint8)
data_set['item_id'] = data_set['item_id'].astype(np.uint16)
data_set['item_cnt_month'] = data_set['item_cnt_month'].astype(np.float16)
data_set['ID'] = data_set['ID'].astype(np.uint32)
data_set['city_id'] = data_set['city_id'].astype(np.uint8)
data_set['item_category_id'] = data_set['item_category_id'].astype(np.uint8)
data_set['big_category_id'] = data_set['big_category_id'].astype(np.uint8)
data_set['sub_category_id'] = data_set['sub_category_id'].astype(np.uint8)
If you look at this, you can see that there are trends and seasonality.
Feature Engineering This function will be used a lot this time, but be careful because it uses a lot of memory temporarily.
def feature_lags(df, lags, lag_col):
tmp = df[['date_block_num', 'shop_id', 'item_id', lag_col]]
for lag in lags:
shifted = tmp.copy()
shifted.columns = ['date_block_num', 'shop_id', 'item_id', str(lag_col) + '_lag_' + str(lag)]
shifted['date_block_num'] += lag
shifted = shifted[shifted['date_block_num'] <= 34]
df = pd.merge(df, shifted, on=['date_block_num', 'shop_id', 'item_id'], how='left', copy=False)
return df
#target_lags
data_set = feature_lags(df=data_set, lags=[1, 2, 3, 6, 12], lag_col='item_cnt_month')
# date_avg_item_cnt
date_avg_item_cnt = data_set.groupby(['date_block_num'], as_index=False)['item_cnt_month'].agg('mean').rename(columns={'item_cnt_month': 'date_avg_item_cnt'})
data_set = pd.merge(data_set, date_avg_item_cnt, on=['date_block_num'], how='left')
data_set = feature_lags(df=data_set, lags=[1], lag_col='date_avg_item_cnt')
data_set.drop(['date_avg_item_cnt'], axis=1, inplace=True)
# date_item_avg_item_cnt
date_item_avg_item_cnt = data_set.groupby(['date_block_num', 'item_id'], as_index=False)['item_cnt_month'].agg('mean').rename(columns={'item_cnt_month': 'date_item_avg_item_cnt'})
data_set = pd.merge(data_set, date_item_avg_item_cnt, on=['date_block_num', 'item_id'], how='left')
data_set = feature_lags(df=data_set, lags=[1], lag_col='date_item_avg_item_cnt')
data_set.drop(['date_item_avg_item_cnt'], axis=1, inplace=True)
# date_shop_avg_item_cnt
date_shop_avg_item_cnt = data_set.groupby(['date_block_num', 'shop_id'], as_index=False)['item_cnt_month'].agg('mean').rename(columns={'item_cnt_month': 'date_shop_avg_item_cnt'})
data_set = pd.merge(data_set, date_shop_avg_item_cnt, on=['date_block_num', 'shop_id'], how='left')
data_set = feature_lags(df=data_set, lags=[1], lag_col='date_shop_avg_item_cnt')
data_set.drop(['date_shop_avg_item_cnt'], axis=1, inplace=True)
# date_category_avg_item_cnt
date_category_avg_item_cnt = data_set.groupby(['date_block_num', 'item_category_id'], as_index=False)['item_cnt_month'].agg('mean').rename(columns={'item_cnt_month': 'date_category_avg_item_cnt'})
data_set = pd.merge(data_set, date_category_avg_item_cnt, on=['date_block_num', 'item_category_id'], how='left')
data_set = feature_lags(df=data_set, lags=[1], lag_col='date_category_avg_item_cnt')
data_set.drop(['date_category_avg_item_cnt'], axis=1, inplace=True)
# date_shop_category_avg_item_cnt
date_shop_category_avg_item_cnt = data_set.groupby(['date_block_num', 'shop_id', 'item_category_id'], as_index=False)['item_cnt_month'].agg('mean').rename(columns={'item_cnt_month': 'date_shop_category_avg_item_cnt'})
data_set = pd.merge(data_set, date_shop_category_avg_item_cnt, on=['date_block_num', 'shop_id', 'item_category_id'], how='left')
data_set = feature_lags(df=data_set, lags=[1], lag_col='date_shop_category_avg_item_cnt')
data_set.drop(['date_shop_category_avg_item_cnt'], axis=1, inplace=True)
# date_shop_bigcat_avg_item_cnt
date_shop_bigcat_avg_item_cnt = data_set.groupby(['date_block_num', 'shop_id', 'big_category_id'], as_index=False)['item_cnt_month'].agg('mean').rename(columns={'item_cnt_month': 'date_shop_bigcat_avg_item_cnt'})
data_set = pd.merge(data_set, date_shop_bigcat_avg_item_cnt, on=['date_block_num', 'shop_id', 'big_category_id'], how='left')
data_set = feature_lags(df=data_set, lags=[1], lag_col='date_shop_bigcat_avg_item_cnt')
data_set.drop(['date_shop_bigcat_avg_item_cnt'], axis=1, inplace=True)
# date_shop_subcat_avg_item_cnt
date_shop_subcat_avg_item_cnt = data_set.groupby(['date_block_num', 'shop_id', 'sub_category_id'], as_index=False)['item_cnt_month'].agg('mean').rename(columns={'item_cnt_month': 'date_shop_subcat_avg_item_cnt'})
data_set = pd.merge(data_set, date_shop_subcat_avg_item_cnt, on=['date_block_num', 'shop_id', 'sub_category_id'], how='left')
data_set = feature_lags(df=data_set, lags=[1], lag_col='date_shop_subcat_avg_item_cnt')
data_set.drop(['date_shop_subcat_avg_item_cnt'], axis=1, inplace=True)
# date_bigcat_avg_item_cnt
date_bigcat_avg_item_cnt = data_set.groupby(['date_block_num', 'big_category_id'], as_index=False)['item_cnt_month'].agg('mean').rename(columns={'item_cnt_month': 'date_bigcat_avg_item_cnt'})
data_set = pd.merge(data_set, date_bigcat_avg_item_cnt, on=['date_block_num', 'big_category_id'], how='left')
data_set = feature_lags(df=data_set, lags=[1], lag_col='date_bigcat_avg_item_cnt')
data_set.drop(['date_bigcat_avg_item_cnt'], axis=1, inplace=True)
# date_subcat_avg_item_cnt
date_subcat_avg_item_cnt = data_set.groupby(['date_block_num', 'sub_category_id'], as_index=False)['item_cnt_month'].agg('mean').rename(columns={'item_cnt_month': 'date_subcat_avg_item_cnt'})
data_set = pd.merge(data_set, date_subcat_avg_item_cnt, on=['date_block_num', 'sub_category_id'], how='left')
data_set = feature_lags(df=data_set, lags=[1], lag_col='date_subcat_avg_item_cnt')
data_set.drop(['date_subcat_avg_item_cnt'], axis=1, inplace=True)
# date_city_avg_item_cnt
date_city_avg_item_cnt = data_set.groupby(['date_block_num', 'city_id'], as_index=False)['item_cnt_month'].agg('mean').rename(columns={'item_cnt_month': 'date_city_avg_item_cnt'})
data_set = pd.merge(data_set, date_city_avg_item_cnt, on=['date_block_num', 'city_id'], how='left')
data_set = feature_lags(df=data_set, lags=[1], lag_col='date_city_avg_item_cnt')
data_set.drop(['date_city_avg_item_cnt'], axis=1, inplace=True)
# date_item_city_avg_item_cnt
date_item_city_avg_item_cnt = data_set.groupby(['date_block_num', 'item_id', 'city_id'], as_index=False)['item_cnt_month'].agg('mean').rename(columns={'item_cnt_month': 'date_item_city_avg_item_cnt'})
data_set = pd.merge(data_set, date_item_city_avg_item_cnt, on=['date_block_num', 'item_id', 'city_id'], how='left')
data_set = feature_lags(df=data_set, lags=[1], lag_col='date_item_city_avg_item_cnt')
data_set.drop(['date_item_city_avg_item_cnt'], axis=1, inplace=True)
# item_avg_item_price
item_avg_item_price = day_train_data.groupby(['item_id'], as_index=False)['item_price'].agg('mean').rename(columns={'item_price': 'item_avg_item_price'})
data_set = pd.merge(data_set, item_avg_item_price, on=['item_id'], how='left')
# date_item_avg_item_price
date_item_avg_item_price = day_train_data.groupby(['date_block_num', 'item_id'], as_index=False)['item_price'].agg('mean').rename(columns={'item_price': 'date_item_avg_item_price'})
data_set = pd.merge(data_set, date_item_avg_item_price, on=['date_block_num', 'item_id'], how='left')
lags = [1, 2, 3, 4, 5, 6]
data_set = feature_lags(df=data_set, lags=lags, lag_col='date_item_avg_item_price')
for lag in lags:
data_set['delta_price_lag_' + str(lag)] = (data_set['date_item_avg_item_price_lag_' + str(lag)] - data_set['item_avg_item_price'] / data_set['item_avg_item_price'])
def select_trend(row):
for lag in lags:
if row['delta_price_lag_' + str(lag)]:
return row['delta_price_lag_' + str(lag)]
return 0
data_set['delta_price_lag'] = data_set.apply(select_trend, axis=1)
data_set['delta_price_lag'].fillna(0, inplace=True)
#Delete unused data
features_to_drop = ['item_avg_item_price', 'date_item_avg_item_price']
for lag in lags:
features_to_drop += ['date_item_avg_item_price_lag_' + str(lag)]
features_to_drop += ['delta_price_lag_' + str(lag)]
data_set.drop(features_to_drop, axis=1, inplace=True)
date_shop_sales = day_train_data.groupby(['date_block_num', 'shop_id'], as_index=False)['item_sales_day'].agg('sum').rename(columns={'item_sales_day': 'date_shop_sales'})
data_set = pd.merge(data_set, date_shop_sales, on=['date_block_num', 'shop_id'], how='left')
shop_avg_sales = day_train_data.groupby(['shop_id'], as_index=False)['item_sales_day'].agg('mean').rename(columns={'item_sales_day': 'shop_avg_sales'})
data_set = pd.merge(data_set, shop_avg_sales, on=['shop_id'], how='left')
data_set['delta_sales'] = (data_set['date_shop_sales'] - data_set['shop_avg_sales']) / data_set['shop_avg_sales']
data_set = feature_lags(df=data_set, lags=[1], lag_col='delta_sales')
data_set.drop(['date_shop_sales', 'shop_avg_sales', 'delta_sales'], axis=1, inplace=True)
data_set['month'] = data_set['date_block_num'] % 12
days = pd.Series([31,28,31,30,31,30,31,31,30,31,30,31])
data_set['day'] = data_set['month'].map(days)
data_set['day'] = data_set['day'].astype(np.uint8)
Put 0 in NaN by merge before separating data from data_set.
data_set.fillna(0, inplace=True)
This completes the data_set. I will post the information of data_set. If you don't cast this, you'll have more than double the amount of data.
data_set.info()
""" out
<class 'pandas.core.frame.DataFrame'>
Int64Index: 11128004 entries, 0 to 11128003
Data columns (total 29 columns):
date_block_num uint8
shop_id uint8
item_id uint16
item_cnt_month float16
ID uint32
city_id uint8
item_category_id uint8
big_category_id uint8
sub_category_id uint8
item_cnt_month_lag_1 float16
item_cnt_month_lag_2 float16
item_cnt_month_lag_3 float16
item_cnt_month_lag_6 float16
item_cnt_month_lag_12 float16
date_avg_item_cnt_lag_1 float16
date_item_avg_item_cnt_lag_1 float16
date_shop_avg_item_cnt_lag_1 float16
date_category_avg_item_cnt_lag_1 float16
date_shop_category_avg_item_cnt_lag_1 float16
date_shop_bigcat_avg_item_cnt_lag_1 float16
date_shop_subcat_avg_item_cnt_lag_1 float16
date_bigcat_avg_item_cnt_lag_1 float16
date_subcat_avg_item_cnt_lag_1 float16
date_city_avg_item_cnt_lag_1 float16
date_item_city_avg_item_cnt_lag_1 float16
delta_price_lag float64
delta_sales_lag_1 float64
month uint8
day uint8
dtypes: float16(17), float64(2), uint16(1), uint32(1), uint8(8)
memory usage: 764.1 MB
"""
Next, create train_data and test_data.
target_col = 'item_cnt_month'
exclude_cols = ['ID', target_col]
feature_cols = []
for col in train_data.columns:
if col not in exclude_cols:
feature_cols.append(col)
x_train = train_data[train_data['date_block_num'] <= 32][feature_cols]
y_train = train_data[train_data['date_block_num'] <= 32][target_col]
x_val = train_data[train_data['date_block_num'] == 33][feature_cols]
y_val = train_data[train_data['date_block_num'] == 33][target_col]
x_test = test_data[feature_cols]
First of all, I would like to do it by default without tuning.
ts = time.time()
gbm = LGBMRegressor(n_jobs=-1)
gbm.fit(x_train, y_train,
eval_metric='rmse',
eval_set=(x_val, y_val))
time.time() - ts
# rmse: 0.948226
# time:23.153272
It was 0.948226 without tuning. It took less than 30 seconds to learn much faster than Random Forest.
Next, I would like to change max_depth and n_estimators.
ts = time.time()
gbm = LGBMRegressor(n_estimators=10000, max_depth=10, n_jobs=-1)
gbm.fit(x_train, y_train,
early_stopping_rounds=20,
eval_metric='rmse',
eval_set=(x_val, y_val),
verbose=1)
time.time() - ts
# Early stopping, best iteration is: [105] valid_0's rmse: 0.943153
# time30.228255
The best score was when n_estimators was 105. There is not much change in rmse.
Let's visualize feature_importances. date_item_avg_item_cnt_lag_1 works best.
I submitted the results of the prediction with this model to kaggle.
The result was 0.96079.
This time, I made data by referring to Kaggle's Notebooks and made a prediction.
I haven't tuned the hyperparameters yet, so Next time, I would like to tune with LightGBM lunar, a library called optuna.