When studying Kaggle, I decided to study from the code of the person who won the first place in the past competition, so this time Mercari Competition -challenge "Qiita") 1st place code Was the subject of study.
・ Time measurement using context manager ・ Pipeline and Function Transformer ・ TF-IDF, itemgetter, TfidfVectorizer ・ Accuracy can be obtained even with 4-layer MLP (Multilayer perceptron). -Use partial to fix y_train and change only x_train
Creating a model that predicts a reasonable price at the time of listing
By automatically presenting an appropriate price from the product information at the time of listing, the time and effort at the time of listing is reduced. Listing is easy.
If you sell at a high price outside the market price of Mercari, it will not sell On the contrary, if you sell at a price lower than the market price of Mercari, the customer will lose.
Kernel competition: Submit the source code itself to Kaggle. Once submitted, it will be run on Kaggle to calculate your score. There are restrictions on computer resources and calculation time
CPU: 4 cores Memory: 16GB Disk: 1GB Time limit: 1 hour GPU: None
RMLSE:Root Mean Squared Logarithmic Error The lower the score, the smaller the error in estimating the price.
The first model is RMLSE: 0.3875
Column name | Description |
---|---|
name | Product name |
item_condition_id | The condition of the product, such as used or new.(1~5)The larger one is in better condition. |
category_name | Rough category/Detailed category/よりDetailed category |
brand_name | brand name. Example: Nike, Apple |
price | Past selling price(USD) |
shipping | Whether the seller or the buyer pays the shipping cost. 1->Seller pays, 0 ->The purchaser pays. |
item_description | Product details |
Test_id and price
・ As short as 100 lines. simple. ・ 4-layer MLP. It is accurate. Wasn't neural networks used yet in this era? ・ TF-IDF. df ['name']. Fillna ('') +'''+ df ['brand_name']. Fillna ('') is used to combine strings to improve accuracy? ・ Standardization of y_train ・ Learn 4 models with 4 cores-> Ensemble
Since there is a limit of one hour, some measures have been taken to measure how much time is spent in which process. With timer is put in the place of each process. Description of timer.
qiita.rb
with timer('process train'):
#Road
train = pd.read_table('../input/train.tsv')
#It's repelling because there is a $ 0 price
train = train[train['price'] > 0].reset_index(drop=True)
#Preparing to split the data for training and validation
cv = KFold(n_splits=20, shuffle=True, random_state=42)
#Divide the data into training and validation
#.split()The iterable object is returned. "Index for learning and index for verification can be retrieved.
#next()Get elements from within an iterator with
train_ids, valid_ids = next(cv.split(train))
#Split for training and validation with the obtained index
train, valid = train.iloc[train_ids], train.iloc[valid_ids]
#Price converts 1 row n columns to n rows 1 column. log(a+1)Convert with. Normalization
y_train = y_scaler.fit_transform(np.log1p(train['price'].values.reshape(-1, 1)))
#Processed in pipeline
X_train = vectorizer.fit_transform(preprocess(train)).astype(np.float32)
print(f'X_train: {X_train.shape} of {X_train.dtype}')
del train
#Preprocessing of verification data as well
with timer('process valid'):
X_valid = vectorizer.transform(preprocess(valid)).astype(np.float32)
Since the brand name has a missing value, it is replaced with a blank. On top of that, the product name and brand name are combined. To make it easier to TF-IDF later. I am creating a new element called text. 'name','text','shipping', and'item_condition_id' will be used in the subsequent Pipeline processing.
qiita.rb
def preprocess(df: pd.DataFrame) -> pd.DataFrame:
df['name'] = df['name'].fillna('') + ' ' + df['brand_name'].fillna('')
df['text'] = (df['item_description'].fillna('') + ' ' + df['name'] + ' ' + df['category_name'].fillna(''))
return df[['name', 'text', 'shipping', 'item_condition_id']]
It is Pipelined so that character extraction and TF-IDF calculation can be performed in a series of steps.
qiita.rb
def on_field(f: str, *vec) -> Pipeline:
return make_pipeline(FunctionTransformer(itemgetter(f), validate=False), *vec)
def to_records(df: pd.DataFrame) -> List[Dict]:
return df.to_dict(orient='records')
vectorizer = make_union(
on_field('name', Tfidf(max_features=100000, token_pattern='\w+')),
on_field('text', Tfidf(max_features=100000, token_pattern='\w+', ngram_range=(1, 2))),
on_field(['shipping', 'item_condition_id'],
FunctionTransformer(to_records, validate=False), DictVectorizer()),
n_jobs=4)
y_scaler = StandardScaler()
X_train = vectorizer.fit_transform(preprocess(train)).astype(np.float32)
The output is the total score of 200002 for the score (Bag of Words) for the character type (200000) and the scores for'shipping'and'item_condition_id'.
It learns with 4 cores and 4 threads, and then averages the ensemble. When learning, y_train is fixed at partial and only xs is changed.
qiita.rb
def fit_predict(xs, y_train) -> np.ndarray:
X_train, X_test = xs
config = tf.ConfigProto(
intra_op_parallelism_threads=1, use_per_session_threads=1, inter_op_parallelism_threads=1)
with tf.Session(graph=tf.Graph(), config=config) as sess, timer('fit_predict'):
ks.backend.set_session(sess)
model_in = ks.Input(shape=(X_train.shape[1],), dtype='float32', sparse=True)#MLP design
out = ks.layers.Dense(192, activation='relu')(model_in)
out = ks.layers.Dense(64, activation='relu')(out)
out = ks.layers.Dense(64, activation='relu')(out)
out = ks.layers.Dense(1)(out)
model = ks.Model(model_in, out)
model.compile(loss='mean_squared_error', optimizer=ks.optimizers.Adam(lr=3e-3))
for i in range(3):#3 epoch
with timer(f'epoch {i + 1}'):
model.fit(x=X_train, y=y_train, batch_size=2**(11 + i), epochs=1, verbose=0)#Batch size increases exponentially
return model.predict(X_test)[:, 0]#Return expectations
with ThreadPool(processes=4) as pool: #4 threads
Xb_train, Xb_valid = [x.astype(np.bool).astype(np.float32) for x in [X_train, X_valid]]
xs = [[Xb_train, Xb_valid], [X_train, X_valid]] * 2
y_pred = np.mean(pool.map(partial(fit_predict, y_train=y_train), xs), axis=0)#Average of what you learned in 4 cores
y_pred = np.expm1(y_scaler.inverse_transform(y_pred.reshape(-1, 1))[:, 0])#Return what was converted by log to price
print('Valid RMSLE: {:.4f}'.format(np.sqrt(mean_squared_log_error(valid['price'], y_pred))))
[Reference ①](https://copypaste-ds.hatenablog.com/entry/2019/02/15/170121#1-%E3%82%B7%E3%83%B3%E3%83%97%E3% 83% AB% E3% 81% AAMLP "Qiita") Reference ② Reference ③ Reference ④ BRONZE acquirer's method Mercari HP
Recommended Posts