Kaggle --Quora Question Pairs [^ 1] 5th place solution [^ 2] research article.
Title: [5th] 5th Place Solution Summary Author: Faron, KazAnova Discussion URL: https://www.kaggle.com/c/quora-question-pairs/discussion/34349 Code (attached to forum): https://kaggle2.blob.core.windows.net/forum-message-attachments/190488/6625/mark_dodgie_qs_in_test.py
Summary
--Learning more than 600 features with XGBoost --The feature quantity of 25 or more is the predicted value of the model on the right (LightGBM, NN, LSTM, SGD). --Over sampling of positive examples to about 0.13.
NLP
Extract features by various methods
--Extracted from preprocessed text --Original --Question --Stemming process --Text cleaning --Stop word only --Stop word removal --Extracted by aggregating tokens --Common / non-common tokens --Number of tokens --The longest substring common to both questions --Mistaken grammar and punctuation
--Learned GloVe
The feature of the graph with each question as a node and questions 1 and 2 as edges was valuable.
--Number of questions common to both questions --Number of unique questions --Number of paths of length n between questions 1 and 2 --Maximum number of creeks --Number of ingredients --If y (q1, q3) = y (q2, q3) = a, then y (q1, q2) = a
Since the method of making negative examples was artificial, the following features also led to improvement.
References
Recommended Posts