Kaggle Tutorial "Titanic"

Kaggle is the world's largest machine learning competition site. Titanic is a tutorial-like task.

This competition predicts the life and death of passengers on the Titanic. https://www.kaggle.com/c/titanic/overview

For the time being, I tried and errored until I crossed the 80% barrier. I will briefly introduce it this time. There is a code at the end (github)

Data overview

Predict this Survived. Train data is 892 pieces. The number of Test data (data to be submitted) is 417.

What you did in the top 7%

I looked at other kernels and articles and tried various things.

Data complement

-** Fare : Take the average value for each Pclass and use that average value to complement Fare. - Embarked **: Complemented using the mode (S).

I thought about complementing Age with an average value or using Mr, Miss, Master, etc., but in the end the accuracy did not improve, so this time I simply did not use Age.

Created features

-** Sex **, ** Embarked **: These are simply dummy variables. e.x.) Male -> 0, Female -> 1

-** P class **: Use as 1, 2 and 3.

-** FamilySize **: Calculate the number of family members with SibSp + Parch. Using FamilySize, we created dummy variables in the form of single (IsAlone), small, medium, and large families.

-** Title ** (honorific title): Extract information such as Mr, Miss, Mrs, etc. from the name. I thought that not only age but also marriage information could be extracted accurately. This Title was added to the features as a dummy variable.

-** Ticket_ini **: Extract the ticket acronym. A dummy variable was created using this acronym.

-** n_same_ticket **: How many people have the same ticket number? People with the same ticket number may be buying tickets at the same time with family or friends. Therefore, he expected that the ticket numbers would be the same. (Personal thought) With SibSp and Parch, I can only know the family information that accompanies me, but I thought that this would be an advantage because I could also know the information about my friends who accompanied me. Reference) https://yolo-kiyoshi.com/2018/12/16/post-951/

-** Cabin_ini **: Extract the acronym for Cabin. Convert this acronym information to a dummy variable.

Summary of features

The above 31 variables are used. 1 Pclass 2 Sex 3 Fare 4 n_same_ticket 5 Embarked_C 6 Embarked_Q 7 Embarked_S 8 Ticket_ini_1 9 Ticket_ini_2 10 Ticket_ini_3 11 Ticket_ini_A 12 Ticket_ini_C 13 Ticket_ini_Others 14 Ticket_ini_P 15 Ticket_ini_S 16 Ticket_ini_W 17 Title_Master. 18 Title_Miss. 19 Title_Mr. 20 Title_Mrs. 21 Title_Others 22 Cabin_Initial_B 23 Cabin_Initial_C 24 Cabin_Initial_D 25 Cabin_Initial_E 26 Cabin_Initial_N 27 Cabin_Initial_Others 28 IsAlone 29 Family_size_small 30 Family_size_mid 31 Family_size_big

Machine learning model

Use Random Forest. The hyperparameters were adjusted appropriately, and the predicted values were averaged by 10 equations of 10fold. The partition used a stratified k partition.

I also tried using LightGBM, XGBoost, and Catboost. However, the Public score was better in Random forest. (Is it overfitting?) I made some models and tried to get an ensemble, but in the end Random forest alone was the BEST, so I went with this.

Important variables

By the way, the important variables of Random forest were like this.

Github I posted it. If you want to see the details, please check here. Just move it and it should be ** 80.861% ** accurate. https://github.com/taruto1215/Kaggle_Titanic

The opportunity for this time

Actually, I am participating in a data science course called GCI2020summer at the Matsuo Laboratory of the University of Tokyo. At that time, I decided to participate in the Titanic competition. I think it will be about 80% in about 2 days, and I think it will be finished with an accuracy of 78-9% by the deadline. .. ..

I was disappointed, so I continued to challenge and achieved 80%. I'm still a beginner, so if you have any advice, please do not hesitate to contact me.

Kaggle Tutorial Titanic Accuracy 80.9% (Top 7% 0.80861)