Kaggle is like a fighting party that competes for machine learning skills. When I entered, there was content for beginners, so I will watch the guidance video immediately.
Super fast English! !! !! The contents included an overview of the Titanic accident, explanations of datasets, tutorials, and how to use Kaggle.
I can't hear it because it's too fast to hear, so the Japanese Wiki [Titanic wreck](https://ja.wikipedia.org/wiki/%E3%82%BF%E3%82%A4%E3%82%BF%E3 % 83% 8B% E3% 83% 83% E3% 82% AF% E5% 8F% B7% E6% B2% 88% E6% B2% A1% E4% BA% 8B% E6% 95% 85) Put.
Roughly summarized
・ Because it was an accident while I was sleeping at midnight, the initial action was delayed. ・ There were not enough life-saving tools. (It was thought to be safe) ・ Survival rates differ greatly between nobles and commoners, men and women, and age.
Looking at the figure, I think that the mortality rate is high in the area where the iceberg was hit and there was a hole.
A trailer that gives you a panoramic view of the ship. Although it is a movie, I think you can grasp the size of the ship, the number of people, and the atmosphere at that time. (These people are about to ...)
Titanic (dubbed version) --Trailer
There were 891 for training and 418 for test data. The data definition is as follows:
variable | Definition | Remarks |
---|---|---|
Survived | Whether it survived | 0 = No, 1 = Yes |
Pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
Name | name | |
Sex | sex | |
Age | age | |
SibSp | Number of brothers, sisters and spouses on board | |
Parch | Number of parents / children on board | |
Ticket | Ticket number | |
Fare | Ticket price | |
Cabin | Cabin number | |
embarked | The port you boarded | C = Cherbourg, Q = Queenstown, S = Southampton |
A lot of example programs are posted on "Notebook", so check some popular ones.
There was also a Japanese tutorial. Kaggle Titanic First Step \ (1st Step for Kaggle Titanic )
I read it roughly and my head got messed up, so first I made a survival model for everyone to make the story easier. All you have to do is make a row of "Survived" and upload it to Kaggle.
00.py
import pandas as pd
#Read CSV
test = pd.read_csv('test.csv')
#Added Survived column.
test["Survived"] = 1
#Verification
print(test["Survived"])
#Only PassengerId and Survived for submission.
test = test.loc[:,['PassengerId','Survived']]
#Output to CSV (no index required)
test.to_csv('titanic1-1.csv',index=False)
Check the resulting CSV and commit to Kaggle.
Public Score 0.37320 lederbord 15800th
The Public Score
is close to the actual survival rate (31.9%).
lederbord
seems to be ranked by the person's highest score, and I didn't know the exact ranking, but 0.37320
was around 15800th. There are so many people in the world with the same score, that is, people who are thinking about the same thing ... this is a little ... I was impressed.
The bottom was 0, and it was 70th from the bottom. A score of 0 means that all the correct answers are turned inside out, and this is the score you care about.
Upload CSV to Kaggle with [" Survived "] = 0
.
Since 1 --0.37320 = 0.6268
, I expected the same value, but it was Public Score: 0.62679
. It's almost right.
This time, I will simply allocate it as death for men and survival for women. Titanic had a high mortality rate for men and a high survival rate for women, so this should still be predictive.
01.py
#use pandas
import pandas as pd
#Read CSV
test = pd.read_csv('test.csv')
#Added Survived column
test["Survived"] = 0
#1 for women(Survival)Replace with
test.loc[test["Sex"] == 'female', "Survived"] = 1
#Only PassengerId and Survived for submission.
test = test.loc[:,['PassengerId','Survived']]
#Output to CSV (no index required)
test.to_csv('titanic1.csv',index=False)
Public Score:0.76555 lederbord: 12457th place / about 15,000 people?
It seems that the contents are the same as the CSV of Gender Based Model
.
Even a very simple model is 0.76555
, so how to improve the accuracy of prediction from here is a showcase of skill.
First of all, it is about checking the rules.
Recommended Posts