The other day, we held an in-house training on Kaggle's Titanic competition. We will share the explanatory materials and the exercises that the participants did. The materials and assignments are in Kaggle's Notebook, so if you are interested, please check that as well.
By the way, this is Qiita's first post.
Since it is not possible to teach (learn) everything by training alone, I thought that it was necessary for each individual to continue working on it. When I tried to learn something, I sometimes stumbled upon building an environment, so I decided to use Kaggle, which makes it unnecessary.
--I'm interested in machine learning --I have never used Kaggle --Inexperienced in Python
--Experience the flow of machine learning ――Make people feel that they can write a program by themselves
Since I'm using Kaggle, I've also created an explanatory material as a Kaggle Notebook.
--Explanatory material: Let's try the Kaggle tutorial "Titanic Survivor Prediction"! https://www.kaggle.com/plasticgrammer/kaggle-titanic
--Practice: Titanic: Predict survivors (ΦωΦ) https://www.kaggle.com/plasticgrammer/titanic-predict-survivors
I wanted to combine explanations and exercises in a well-balanced manner, so I proceeded with the following flow.
Explain data analysis using materials --Python basics --How to use Kaggle, explanation of terms --Check the flow of machine learning (data reading, data analysis)
Data analysis exercises
Explain up to the forecast using materials
The following content is also described in the exercise notebook, but I will also describe it in this article for the time being.
--Check the number of rows and columns of training data and test data --Let's display the first 5 training data --Let's display the first 5 test data ――What is the difference between training data and test data? What exactly does machine learning predict survivors?
--Let's display the training data information with the info method --Let's check the missing value status of training data --Let's check the missing value status of the test data --Let's check the number of cases for each value of the target variable Survived --Let's check what value is set for the variable Pclass --Let's check the distribution of variable Age with a histogram --Let's check the maximum value, average value, and median value of the variable Age. --Let's check the distribution of variable Sex with value_counts + bar graph --Using pd.crosstab, let's check the number of variables Sex in [For each Survived].
-Let's check the number of variables Sex in [For each Survived] with a bar graph. Is there a correlation? If so, what are the trends? -Let's check the number of variables Pclass in [For each Survived] with a bar graph. Is there a correlation? If so, what are the trends?
Assumption) Age (0 missing value filled), flow to predict with Random Forest using Sex has been created
--Let's fill the missing value of Age with the median --Let's use Fare for prediction --Let's use Embarked for prediction --Let's add SibSp + Parch + 1 as FamilySize --Let's add FamilySize <= 1 as IsAlone --Let's add the first character of Cabin as a feature
This training took 5 hours. It took more time than I expected to proceed with the last task to improve the prediction accuracy. As a result, I got the impression that it was difficult. At a later date, it was conducted again in the form of additional training, but I felt that it would be better to proceed one by one with a lot of exercises.
There are many articles written about the Titanic competition, and I have referred to them in various ways. When I tried to make it a training task for Python beginners, I often compiled it as a material, so I shared it with you if it helps.
Recommended Posts