Kaggle competition process from the perspective of score transitions

This article is the 6th day of the Advent calendar of The Road to AI Dojo "Kaggle" by Nikkei xTECH Business AI ① Advent Calendar 2019 This is an article.

This article is for ** beginners to Kaggle who don't know how to work on Kaggle **. I will write an article about what the kagglers are doing at a certain time while watching the score transition of the competition. As for the level, What to do next after registering with Kaggle-If you do this, you can fight enough! After learning the basics of machine learning and Kaggle around Titanic's Introductory 10 Kernel ~, try to challenge the competition that is actually being held. It is intended for people who are thinking about it.

The content of this article is a story that I have seen and heard about my own experience, and not all people do so.

Score transition

First, use kaggle api to extract the participant's score transition from the leaderboard.

kaggle competitions leaderboard competition name--download

With the above command, the Submission Date and Public Score when each participant updates the score can be downloaded as a csv file.

This is my score transition for the NFL Competition that ended the other day. Unfortunately my process in this competition, which I didn't notice anything about, can be divided into four periods.

This is the score transition of the top 5 public teams. Is this also like this if you divide the process by imagination?

Some teams are constantly improving their scores.

Baseline construction period

It's time to understand the data, do EDA lightly, and create a plain model without feature creation or other ingenuity. Build the appropriate cross validation here (if possible). I think many people don't submit here, but for comparison I always ** submit **. One of the findings is how much difference there is between the plain model and the upper model. In the case of participation in the middle of the war, kernel may be used as a baseline.

The golden age that goes up no matter what

In the case of a table competition, feature creation starts from here. I can think of it relatively easily here, and I will give priority to the features that I think will increase. The first parameter tuning is also done here (by the way, I'm a warm manual tuning enthusiast). From here, I will describe it separately for the table and the image.

table

--Create features that anyone can think of based on domain knowledge (generally listed in the kernel) --Parameter tuning (1st time) --Aggregate features

frequency encoding
target encoding
clipping
binning --Time series shift, diff, averaging

image

--Training on a relatively light network (often using resnet34) --Adjustment of learning rate and batch size --Try some scheduling --Augmentation that seems to work according to the image (I will not try the one that seems to change the label visually after conversion) --Ingenuity of pre-processing (resizing, noise removal, background processing, etc.) --Threshold optimization (segmentation)

I don't know anything

It's a time when nothing goes wrong. The time when you don't understand anything, such as the feature amount that you think works does not work, cv goes up but LB does not go up, cv does not go up but LB goes up. When features are created to some extent, features that have already been considered are created and tend to overfit (I feel).

table

――Search for hints by patrol the kernel and discussion to squeeze. --Try to make a lot of interaction features. --Selection of features --Try to recreate the model from scratch ――Thinking with the feeling of a decision tree --Find a magic feature --Catch past competition solutions

image

Images can take a long time to learn once, and I feel like I notice something or enter the final adjustment period before the period when I don't understand anything.

――Search for hints by patrol the kernel and discussion to squeeze. ――Try mixing-type Augmentations such as mixup and cutmix and those that appear in the latest papers. ――Try Augmentation, which doesn't seem to work intuitively. --Tweaking the network around (← images usually don't make sense) --Pseudo Labeling (Often works, but timing is difficult because the original accuracy must be high to some extent) --Use Grad-CAM etc. to check what the NN is based on. --Try changing loss a little --Try another variation of image resizing ――Think with the feelings of NN

It's hard at this time because it's pulled out on the LeaderBoard. In the gold medal-winning Freesound Audio Tagging 2019, a scrutiny of the puclic kernel was a breakthrough.

The period when I noticed something

(The arrow above is a delusion)

Unfortunately I didn't have this time in the NFL competition, but when I look at the Leaderboard, there are quite a few people who jump up suddenly. I think there are various reasons, but when I read the solutions, I think that the common thing is that I often look at the data **.

--Discover Leak --Feature creation and optimization based on Leak --Creation of features based on deep insight --Normalization to absorb the difference in train / test distribution in train, etc. ――I understand the feelings of the decision tree ――I understand the feelings of NN --Find the magic feature

"Creating features based on deep insight" is generally difficult to express because it is a competition, but I think this article will be very helpful. (Reference: Differences between ordinary data scientists and world-class data scientists)

Ensemble & final adjustment period

In this NFL competition, I had no choice but to start ensemble and final adjustments early, but I get the impression that many people usually do it about a week ago.

Basically, I know that ** the score will go up, but I think that I often do it last for things that increase the amount of calculation **. The ensemble will definitely increase the score, so unless you hit the time limit in a kernel competition etc., I will do my best until it hits. The second parameter tuning is also here. In the case of a table, a large amount of features are usually added, so it should be adjusted again here. It also drops the learning rate. If you are teaming up and creating different models, the ensemble will often be very effective.

table

--Parameter tuning (2nd time) --Decrease the learning rate of GBDT --Ensemble (seed averaging, mix with xgboost / catboost, mix with models with different features, mix with team member sub) --Stacking

image

Images take a long time to learn once, so I think it's often faster to start training for an ensemble.

--Migrate to a heavier network (ResNet-101, Densenet-121 ~, inceptionv3, ResNeXt-50-32x4d ~, Wide ResNet-50-2 ~) ――Ensemble with various variations of networks

tta(test time augmentation)
snapshot ensemble --Extend epoch (in case of insufficient learning) --Parameter fine adjustment

This concludes the process from the Kaggle competition start to the final submission. Of course, I don't think everyone is doing this process, and the order in which they work depends on the challenges of the competition, but I feel that the process will converge to some extent if you experience multiple competitions. I also want to know the process of stronger people.

Finally, it's a bonus.

Best practices for anyone starting Kaggle

In my opinion, this route is recommended for anyone starting Kaggle.

① What to do next after registering with Kaggle-If you do this, you can fight enough! Getting Started with Titanic 10 Kernel ~ This is scheduled to be published by Kodansha as an introductory book to Kaggle in March 2020 (https://upura.hatenablog.com/entry/2019/12/04/220200).

(2) Copy the kernel with a large number of votes in the past competition / current competition A great kernel is a treasure trove of knowledge. Especially for beginners, there is a tendency to get a lot of votes, so choose one that has a large number of votes and is likely to explain it carefully from the beginning. If you have a score, you can learn the flow of submission. I feel like I started with the Home Credit competition Start Here: A Gentle Introduction.

③ Data analysis technology that wins with Kaggle Needless to say, an iron plate book. It's not a book for beginners at all, so I think it's better to go through the above process. The code is also included, so if you are in a table competition, you can get stronger by participating in the current competition with this in one hand.

At the end

I think it's a great opportunity to get started with Kaggle, as the information scattered around and tacit knowledge in Kaggler has been put together in a book. I hope this article helps anyone who wants to get started with Kaggle.