[Survey] Kaggle --Quora 5th place solution summary

Kaggle --Quora Question Pairs [^ 1] 5th place solution [^ 2] research article.

Title: [5th] 5th Place Solution Summary Author: Faron, KazAnova Discussion URL: https://www.kaggle.com/c/quora-question-pairs/discussion/34349 Code (attached to forum): https://kaggle2.blob.core.windows.net/forum-message-attachments/190488/6625/mark_dodgie_qs_in_test.py

Summary

--Learning more than 600 features with XGBoost --The feature quantity of 25 or more is the predicted value of the model on the right (LightGBM, NN, LSTM, SGD). --Over sampling of positive examples to about 0.13.

NLP

Extract features by various methods

--Extracted from preprocessed text --Original --Question --Stemming process --Text cleaning --Stop word only --Stop word removal --Extracted by aggregating tokens --Common / non-common tokens --Number of tokens --The longest substring common to both questions --Mistaken grammar and punctuation

--Learned GloVe

Word2Vec
Spacy --Doc2Vec didn't help --PyLucene [^ 3] was also useless --Spell check helped remove random words in test data (prepared to make it difficult to determine if they are duplicates?) --The sentence with this random word was grammatically and logically wrong.

Graph structure

The feature of the graph with each question as a node and questions 1 and 2 as edges was valuable.

--Number of questions common to both questions --Number of unique questions --Number of paths of length n between questions 1 and 2 --Maximum number of creeks --Number of ingredients --If y (q1, q3) = y (q2, q3) = a, then y (q1, q2) = a

Other features

Since the method of making negative examples was artificial, the following features also led to improvement.

min(qid1, qid2)
abs(qid1 - qid2) --Difference from index when all texts are sorted

References