[For beginners] kaggle exercise (merucari)

This time, as part of the training, I worked on the past kaggle competition. I tried to summarize it briefly.

From Mercari's product information, we will use Ridge regression to predict the price.

1. Preparation of module


import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from scipy.sparse import csr_matrix, hstack
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_log_error

## 2. Data preparation Read the data.


train = pd.read_csv('train.tsv', sep='\t')
test = pd.read_csv('test.tsv', sep='\t')

Check the number of data.


print(train.shape)
print(test.shape)

# (1482535, 8)
# (693359, 7)

Combine train and test data.


all_data = pd.concat([train, test])
all_data.head()

Check the basic information of the data.


all_data.info(null_counts=True)

'''
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175894 entries, 0 to 693358
Data columns (total 9 columns):
brand_name           1247687 non-null object
category_name        2166509 non-null object
item_condition_id    2175894 non-null int64
item_description     2175890 non-null object
name                 2175894 non-null object
price                1482535 non-null float64
shipping             2175894 non-null int64
test_id              693359 non-null float64
train_id             1482535 non-null float64
dtypes: float64(3), int64(2), object(4)
memory usage: 166.0+ MB
'''

Examine the unique number of each column data (do not count duplicates).


print(all_data.brand_name.nunique())
print(all_data.category_name.nunique())
print(all_data.name.nunique())
print(all_data.item_description.nunique())

# 5289
# 1310
# 1750617
# 1862037

## 3. Pretreatment Preprocess the data for each column.

Since there is a lot of character data this time, we will arrange the data using BoW vector and TF-IDF.

At that time, the amount of data for other label-encoded features becomes too large. Convert it to a sparse matrix (matrix with many 0s = sparse matrix) and compress it.

# name

cv = CountVectorizer()
name = cv.fit_transform(all_data.name)


# item_description

all_data.item_description.fillna(value='null', inplace=True)

tv = TfidfVectorizer()
item_description = tv.fit_transform(all_data.item_description)


# category_name

all_data.category_name.fillna(value='null', inplace=True)

lb = LabelBinarizer(sparse_output=True)
category_name = lb.fit_transform(all_data.category_name)

# brand_name

all_data.brand_name.fillna(value='null', inplace=True)

brand_name = lb.fit_transform(all_data.brand_name)


# item_condition_id, shipping

onehot_cols = ['item_condition_id', 'shipping']
onehot_data = csr_matrix(pd.get_dummies(all_data[onehot_cols], sparse=True))

Finally, combine these data and convert them to sparse matrix data.


X_sparse = hstack((name, item_description, category_name, brand_name, onehot_data)).tocsr()

## 4. Creating a model About join data all_data The train data has an objective variable, but the test data does not Keep the amount of data in X the same size as y (= the number of rows of tran data).

nrows = train.shape[0]
X = X_sparse[:nrows]

Because y (price data) has variations in the data, it affects the forecast results. Standardization is fine, but this time we will do logarithmic conversion.

In addition, conversion is performed with $ \ log (y + 1) $ so that there is no problem even if the value of y is 0.


y = np.log1p(train.price)
y[:5]

'''
0    2.397895
1    3.970292
2    2.397895
3    3.583519
4    3.806662
Name: price, dtype: float64
'''


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)


ridge = Ridge()
ridge.fit(X_train, y_train)

'''
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)
'''

## 5. Performance evaluation


y_pred = ridge.predict(X_test)

This time, we will evaluate using the RMSE (slightly improved for competition) index.

I logarithmically transformed y before modeling, so I need to undo it after modeling. The processing is performed in the evaluation formula.


def rmse(y_test, y_pred):
    return np.sqrt(mean_squared_log_error(np.expm1(y_test), np.expm1(y_pred)))


rmse(y_test, y_pred)

# 0.4745184301527575

From the above, we were able to predict and evaluate prices from Mercari's product information.

This time, I have compiled an article for beginners. If you find it helpful, I would appreciate it if you could do LGBT.

Thank you for reading.