First posted article is a revenge because I couldn't write until the end because I was exhausted on the way. I'm editing it to make it a little easier to understand.
Since there are many missing values of Age of Titanic, I think that accuracy will improve if all are filled. I thought.
The libraries imported this time are as follows.
import pandas as pd
from sklearn.model_selection import train_test_split as tts
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.ensemble import RandomForestRegressor as RFR
I always omit long names on my own, so if you don't understand something on the way Please think that is the case.
Putting the original data in a DataFrame looks like this.
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
It's a little annoying to write, so only the beginning and the end. This is the learning data for the Titanic that everyone knows. See the information at df.info ().
df.info()
#result
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
All data are 12 × 891, Age, Cabin, Embarked and missing Name, Sex, Ticket, Cabin, Embarked are Object (character string), so they cannot be used as they are.
It's a process that everyone probably does. Sex and Embarked are Objects, but because they have only two or three elements Replace with a simple number. Also, Embarked has two defects, but since it is small and the numerical value is biased, it is supplemented with the mode.
#Count the number of elements in Embarked
df.Embarked.value_counts()
#result
S 644
C 168
Q 77
#Quantify Sex
df['Sex'] = df['Sex'].map({'male':0, 'female':1})
#Quantify Embarked
df['Embarked'] = df['Embarked'].map({'S':0, 'C':1, 'Q':2})
#Embarked is the most S for the time being(0)Complement with
df['Embarked'] = df['Embarked'].fillna(0)
This time, I would like to compare the age deficiency with the average value. Then run it into a random forest and use it as a reference.
#Actually df so as not to mess up later.copy()I'm making another DF.
#Age fills in the missing values, so for the time being, complement with the average value
df['Age'] = df['Age'].fillna(df['Age'].mean())
#Name as Object,Ticket,Cabin creates data without any time
df_data = df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
df_label = df[['Survived']]
(train_data, test_data, train_label, test_label) = tts(df_data, df_label, test_size=0.3, random_state=0)
# train_test_split=As tts
clf = RFC() # RandomForestClassifier=As RFC
clf.fit(train_data, train_label)
clf.score(test_data, test_label)
This time I want to see before and after only preprocessing the data Parameters and so on are left as default.
Result is···
0.8134328358208955
The standard was a little high. .. Can we aim for higher heights? .. ..
I would like to believe that life and death are different depending on age. For the time being, I thought it would be great if I could find out the approximate age from some data.
Let's look at the correlation coefficient with df.corr () after removing the missing value of Age with dropna.
PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|---|
Age | 0.033207 | -0.069809 | -0.331339 | -0.084153 | 1.000000 | -0.232625 | -0.179191 | 0.091566 | 0.007461 |
Looking at this, Pclass, SibSp, and Parch seem to have a high correlation. It is a negative correlation whether Pclass is upgraded as we get older. SibSp and Parch have a lot of children in a big family, so that's the relationship. The reason why Fare is not so highly correlated is that it is added up for each family.
At this stage, the character string data is still Name, Ticket, Cabin. In this, Name contains titles such as Mr and Miss. I learned Miss for unmarried women and Mrs for married women in English classes. Maybe if you analyze this, it will be related to age to some extent? ?? I thought.
The Name is sandwiched between "," and "." Like "Braund, Mr. Owen Harris". Take out only this sandwiched area and save it with the column name "Honorific".
When extracting a character string, I did it with apply () this time, but even if I did it with map (same in parentheses), the exact same result was obtained. Attention here! !! ** There is a half-width space between "," and the title **. What if I didn't put it in first and didn't recognize the characters? ?? I thought that I ate a lot of time there.
#Don't forget the half-width space!
df['Honorific'] = df['Name'].apply(lambda x: x.split(', ')[1].split('.')[0])
df['Honorific'].value_counts()
#result
Mr 517
Miss 182
Mrs 125
Master 40
Dr 7
Rev 6
Mlle 2
Col 2
Major 2
Lady 1
Ms 1
the Countess 1
Don 1
Mme 1
Jonkheer 1
Sir 1
Capt 1
Name: Honorific, dtype: int64
There seems to be only one honorific title. Let's look at the statistical data with describe ().
df.groupby('Honorific').describe()['Age']
The result looks like this.
Honorific | count | mean | std | min | 25% | 50% | 75% | max |
---|---|---|---|---|---|---|---|---|
Capt | 1.0 | 70.000000 | NaN | 70.00 | 70.000 | 70.0 | 70.00 | 70.0 |
Col | 2.0 | 58.000000 | 2.828427 | 56.00 | 57.000 | 58.0 | 59.00 | 60.0 |
Don | 1.0 | 40.000000 | NaN | 40.00 | 40.000 | 40.0 | 40.00 | 40.0 |
Dr | 6.0 | 42.000000 | 12.016655 | 23.00 | 35.000 | 46.5 | 49.75 | 54.0 |
Jonkheer | 1.0 | 38.000000 | NaN | 38.00 | 38.000 | 38.0 | 38.00 | 38.0 |
Lady | 1.0 | 48.000000 | NaN | 48.00 | 48.000 | 48.0 | 48.00 | 48.0 |
Major | 2.0 | 48.500000 | 4.949747 | 45.00 | 46.750 | 48.5 | 50.25 | 52.0 |
Master | 36.0 | 4.574167 | 3.619872 | 0.42 | 1.000 | 3.5 | 8.00 | 12.0 |
Miss | 146.0 | 21.773973 | 12.990292 | 0.75 | 14.125 | 21.0 | 30.00 | 63.0 |
Mlle | 2.0 | 24.000000 | 0.000000 | 24.00 | 24.000 | 24.0 | 24.00 | 24.0 |
Mme | 1.0 | 24.000000 | NaN | 24.00 | 24.000 | 24.0 | 24.00 | 24.0 |
Mr | 398.0 | 32.368090 | 12.708793 | 11.00 | 23.000 | 30.0 | 39.00 | 80.0 |
Mrs | 108.0 | 35.898148 | 11.433628 | 14.00 | 27.750 | 35.0 | 44.00 | 63.0 |
Ms | 1.0 | 28.000000 | NaN | 28.00 | 28.000 | 28.0 | 28.00 | 28.0 |
Rev | 6.0 | 43.166667 | 13.136463 | 27.00 | 31.500 | 46.5 | 53.25 | 57.0 |
Sir | 1.0 | 49.000000 | NaN | 49.00 | 49.000 | 49.0 | 49.00 | 49.0 |
the Countess | 1.0 | 33.000000 | NaN | 33.00 | 33.000 | 33.0 | 33.00 | 33.0 |
Master seems to be attached to a little boy. But there are Master children at the age of 12 and Mr children at the age of 11. .. Also, Mrs seems to be a married woman, but the youngest is 14 years old. ···Really? ?? Also, std (standard deviation) is NaN if you are alone. That's right.
All titles are quantified here for use as data. However, it is troublesome to quantify everything.
Looking at the table above, the titles of only one person are all ages, so I would like to ignore them this time. Put the title to be used in df_name, and save the one you don't use because you will need it later. It seems that if you put "~" in front of the element, it will be something else.
#Data to use
df_name = df[df['Honorific'].isin(['Mr', 'Miss', 'Mrs', 'Master', 'Dr', 'Rev', 'Major', 'Mlle', 'Col'])]
#Unused data
df_unneed_name = df[~df['Honorific'].isin(['Mr', 'Miss', 'Mrs', 'Master', 'Dr', 'Rev', 'Major', 'Mlle', 'Col'])]
#Quantify Honorific
df_name['Honorific'] = df_name['Honorific'].map({'Mr':0, 'Miss':1, 'Mrs':2, 'Master':3, 'Dr':4, 'Rev':5, 'Major':6, 'Mlle':7, 'Col':8})
#result
0 517
1 182
2 125
3 40
4 7
5 6
8 2
7 2
6 2
Name: Honorific, dtype: int64
Alright, I'm finally ready, so let's make a prediction. Use the data containing Age for training and the missing data for testing.
df_Agefill = df_name.dropna(subset=['Age'])
df_Agefill_data = df_Agefill[['Pclass', 'Sex', 'SibSp', 'Parch', 'Fare', 'Embarked','Honorific']] #Training data
df_Agefill_label = df_Agefill[['Age']] #Learning label
df_Agenull = df_name[df_name['Age'].isnull()]
df_Agenull_data = df_Agenull[['Pclass', 'Sex', 'SibSp', 'Parch', 'Fare', 'Embarked','Honorific']] #Test data
df_Agenull_label = df_Agenull[['Age']] #Test label
By the way, if you look at the correlation coefficient of the training data with df_Agefill.corr (), The correlation coefficient between Age and Honorific is
-0.095726
was. Even though I worked so hard ... No, I won't lose! It's linear until you get tired of it, so it may just not match the quantified numbers of the titles!
#Let's call RandomForestRegressor an RFR.
clf = RFR()
clf.fit(df_Agefill_data, df_Agefill_label)
#Store the answer in a label (assignment)
age_answer = clf.predict(df_Agenull_data)
df_Agenull_label['Age'] = age_answer
df_Agenull_label
#result
Age
5 37.987776
17 31.422079
19 26.808000
26 32.879936
28 20.253988
... ...
859 24.929030
863 15.495167
868 25.716969
878 27.344498
888 7.838333
Since Age is different from classification, we use the regression analysis tool Random Forest Regressor. It's floating point, but isn't it nice? ?? However, I don't know if it fits because there is no answer to this. ..
Combines the data that complements Age with the original data.
#Combine labels and data
df_Agenull['Age'] = df_Agenull_label
#Combine data that originally contained Age, newly added data, and data omitted by title
df_Age = pd.concat([df_Agefill, df_Agenull, df_unneed_name])
#If this is left as it is, the index will be different, so restore it
df_Age = df_Age.sort_index()
df_Age.info()
#result
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null int64
5 Age 891 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 891 non-null float64
12 Honorific 891 non-null object
dtypes: float64(3), int64(6), object(4)
memory usage: 97.5+ KB
It's back to normal except for the new Honorific. Age is also complete. Now, let's see the correct answer rate in Random Forest! !!
#[Criteria] Age is exactly the same as the average value complement processing
df_data = df_Age[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
df_label = df_Age[['Survived']]
(train_data, test_data, train_label, test_label) = tts(df_data, df_label, test_size=0.3, random_state=0)
clf = RFC()
clf.fit(train_data, train_label)
clf.score(test_data, test_label)
Result is······
0.832089552238806
Very delicate! !! But it went up a little ... ??
The rate of increase was not commensurate with this effort Well, I'm glad I just went up a little! (Lol
Emi-chan, it went up slightly! (Manzai King)
Recommended Posts