Introduction

Data science / machine learning seems to be interesting, but how should I study? From that point, I started studying in February and participated in the Kaggle competition in March with the desire to practice as well as study input! As a result, I was able to win the *** silver medal (+ top 3%) ***! In this article, I'm going to summarize the process of my studies and the flow of the competition, so I hope you can see it as an example!

Overview

➀Introduction of competition ➁ Flow until the end of the competition (Before participating in the competition → After participating in the competition) ③ Other studies that I was doing during the competition

The competition I participated in this time

M5 Forecasting --Accuracy Competition (March-June 2020) The competition we worked on this time is a table competition of this time series data, and the content is "Product sales forecast" of Wal-Mart, a major retailer in the United States. After being given data for the past five years, it was like forecasting sales for the following month (28 days).

(As an impression, the amount of data was large ... It was also a competition that made me think of an efficient program, including how to use memory.)

Timeline

During the four-month competition from March to the end of June, the timeline proceeded as follows. 　 3/3 Competition started 　　　　↓ Participated in the mid-March competition 　　　　↓ 6/1 Public Leaderboard period correct answer data release (Even if you submit the prediction result to Kaggle, you will be blindfolded because you do not know your relative ranking!) 　　　　↓ 6/30 Competition ends

result

As I mentioned at the beginning of the article, The result is ... *** 114th (top 3%) ***: smiley: !! (purely happy!) ・ Number of entries: 88,742 ・ Number of teams: 5,558

Flow until you start the competition (beginning to study)

I had experience with programming languages other than python, but at this point I had never used python myself, so what is pandas? What is matplotlib? I started from a state where I didn't even know even a basic library like that. This period was mainly considered as a period to absorb the basic knowledge of machine learning and python. The contents are summarized in another article (Study method for learning machine learning from scratch), so please refer to this article for details. I hope you can get it, but here is a brief summary.

It's like finding a book / site that can hold down the following three points and studying.

➀ Acquire basic knowledge of machine learning (understanding of words and terms)

→ A textbook (book) that clearly understands the mechanism and technology of machine learning & deep learning in this one book

➁ Understand how to use the library (essential in data science such as numpy, pandas, matplotlib)

→ Introduction to Python for Data Science

➂ An introductory book to actually challenge the competition (Kaggle)

→ Practice Data Science Series Beginning with Python Kaggle Startbook (Book)

After participating in the M5 competition

It was a competition in which four people participated with a friend, but since it was the first time for all of Kaggle's medal-targeted competitions and it was a long-term competition, we proceeded by exploring. Here, with the help of other members, I was in charge of the task and progress management role of the team, so let's summarize the general flow that I proceeded and the tools I was using below. I think.

Overall, what the team was doing

・ Weekly meeting Sharing progress, sharing information such as notebooks, questions that you do not understand, consideration, etc. I think that being able to do this once a week is very effective in a long-term competition like this one. At this point, I was able to talk about future policies and task management, so it was very easy to proceed.

Slack Instead of waiting for a weekly meeting, if there was something I was interested in, I shared information and examined it on Slack. Also, since many members used Slack a lot, we linked Slack with tools such as the following to focus notifications so that we could focus more on Kaggle!

Trello One of the progress management tools. By linking with Slack, if you write progress in trello, you will be notified to Slack, and you can update trello from Slack.

GitHub Code sharing etc. By linking with Slack, when you push, the committed message is also notified to Slack (convenient because you do not have to report the push every time!) After that, share heavy data etc. with Drive.

(Early stage) EDA / pretreatment center (all)

I think that every competition is basic, but first I tried to "understand and understand the data". I participated in the competition a few weeks after the competition started, so other people's notebooks were gathering, so I felt like reading them and analyzing them myself.

(How to fight as a beginner) Personally, what I was keeping in mind at this time was that before I started kaggle, I set a goal of "to establish and put into practice what I was studying." With books and sites like the ones introduced above, I think you can learn the basic flow and usage using titanic as an example, but there are times when you can't grasp it unless you practice it. While thinking about what I should do to visualize the data I care about and to process it, I went back to the books and sites mentioned above and asked other members.

(Middle) EDA / pre-processing and division of labor in model making (2: 2)

After all the members proceeded with the pre-processing and got an overview of the data to some extent, it was decided to divide the roles between the members who continue the pre-processing and the members who proceed with model making. I did not know the pace distribution such as how much pretreatment should be done, and in the end I could not make use of the features considered here, so I thought that I would like to reflect on the next time, but I personally In terms of roles, I think that the division of roles made it easier to understand my tasks and allowed me to concentrate.

(How to fight as a beginner) In the middle of the game, I focused on pre-processing because I wasn't used to pre-processing yet and wanted to gain a little more experience. I think that the meaning of EDA / preprocessing is broad, but I mainly studied and practiced data visualization and analysis in the early stages and data processing in response to it from the middle stage. ..

(Late stage) Emphasis on model making while pre-processing (1: 3)

Entering the final stage, I was feeling time constraints, so after finalizing the policy, I, who was mainly doing pre-processing, also participated in model making. In particular, at the end of this period, I was in a blindfolded state where I couldn't understand the movement of the ranking at all, so I felt like I couldn't understand the relative position and made trial and error.

(How to fight as a beginner) As will be described later, although I had experience in making models through AtmaCup, I had never worked on time-series data, and there were many things I didn't understand, so it was great that I was able to work on it while being taught by the members. In addition, there was a problem with the period, and for this part, I relied heavily on the members who were working on it in advance, so I will do my best in the future so that I can make multiple models and take multiple verification periods. I felt like coming.

Summary

Compared to participating in a competition alone, I felt that the range of things I could do if I could participate in a competition as a team would expand, but on the other hand, I think that is the case, but on the other hand, it is difficult for the team to proceed. I also felt. It was my first competition for medals at Kaggle, so I think the following points are good points for forming a team.

――You can feel free to ask questions that you do not understand. ――Since you can divide roles, you can concentrate on what you want to study and do it in order + you can also divide roles in your specialty field (If you start by yourself, you have to implement them while studying preprocessing and model making, so I think it's quite difficult) ――Since approach proposals from different perspectives come out, you can think of more ideas and methods than when you are thinking alone.

Finally, I think I couldn't reach this medal by myself as a newcomer to data science, so I would like to thank my friends who participated in the Kaggle competition together and answered the questions. I think! Thank you again!

What I was studying in parallel during the competition

Since it is a little out of the main story, it will be in the form of "bonus", but finally I will introduce other studies (things) that I was working on during this competition. Also, as part of my output, I'm thinking of putting it together in an article, so I'll briefly summarize it here, but I've mainly worked on the following three things.

Machine Learning (Coursera) (April-May 2020)

I have come to understand how to use the library, and some of them have come to be used in practice, but there are some parts that I do not understand how machine learning is performed, so I received it with the intention of learning the theoretical part. Course. I was able to do theoretical learning using matlab, so I'm glad that I learned more about machine learning.

Deep Learning Basic Course (University of Tokyo, Matsuo Lab) (April 2020-)

I didn't use it in this competition, but I wanted to understand Deep Learning, so I took this course. Not only the basics of machine learning and deep learning, but also CNN, RNN, reinforcement learning, VAE, etc. are learned through lectures, and it is possible to understand while practicing in the subsequent exercises. (Present progressive form)

AtmaCup # 5 (May 29-June 6, 2020)

A competition that I participated in because I wanted to do a series of EDA, preprocessing, and model making by myself. The content is a table + signal data competition without time series. Although the content of the analysis was difficult, there were several Japanese Kaggler upper layers among the participants, and it was often a learning experience at the essence disclosure and retrospective meetings after the competition.

Until you win the silver medal (top 3%) in the competition you participated in within a month for the first time in data science!