table of contents Chapter 1: Linear Regression Model [Chapter 2: Nonlinear Regression Model] (https://qiita.com/matsukura04583/items/baa3f2269537036abc57) [Chapter 3: Logistic Regression Model] (https://qiita.com/matsukura04583/items/0fb73183e4a7a6f06aa5) [Chapter 4: Principal Component Analysis] (https://qiita.com/matsukura04583/items/b3b5d2d22189afc9c81c) [Chapter 5: Algorithm 1 (k-nearest neighbor method (kNN))] (https://qiita.com/matsukura04583/items/543719b44159322221ed) [Chapter 6: Algorithm 2 (k-means)] (https://qiita.com/matsukura04583/items/050c98c7bb1c9e91be71) [Chapter 7: Support Vector Machine] (https://qiita.com/matsukura04583/items/6b718642bcbf97ae2ca8)
x=(x_1,x_2,・ ・ ・,x_m)^T \in R^m
y \in \left\{0,1\right\}
Google drive mount
from google.colab import drive
drive.mount('/content/drive')
#from module name import class name (or function name or variable name)
import pandas as pd
from pandas import DataFrame
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#Magic for displaying matplotlib inline(plt.show()You don't have to)
%matplotlib inline
In the following, the study_ai_ml folder is used directly under My Drive in Google Drive.
#Read titanic data csv file
titanic_df = pd.read_csv('/content/drive/My Drive/study_ai_ml/data/titanic_train.csv')
#View the beginning of the file and check the dataset
titanic_df.head(5)
I examined the meaning of variables.
Passenger ID: Passenger ID Survived: Survival result (1: Survival, 0: Death) Pclass: Passenger class 1 is the highest class Name: Passenger's name Sex: Gender Age: Age SibSp Number of siblings and spouses Parch Number of parents and children Ticket Ticket number Fare boarding fee Cabin room number Embarked Port on board Cherbourg, Queenstown, Southampton
#Drop the karau that you think is unnecessary for prediction
titanic_df.drop(['PassengerId','Pclass', 'Name', 'SibSp','Parch','Ticket','Fare','Cabin','Embarked'], axis=1, inplace=True)
#Display data with some columns dropped
titanic_df.head()
#Show lines containing null
titanic_df[titanic_df.isnull().any(1)].head(10)
#Complete null in Age column with median
titanic_df['AgeFill'] = titanic_df['Age'].fillna(titanic_df['Age'].mean())
#Show lines containing null again(Age null is complemented)
titanic_df[titanic_df.isnull().any(1)]
#titanic_df.dtypes
#titanic_df.head()
#Because I filled in the missing value of Age Fill
#titanic_df = titanic_df.drop(['Age'], axis=1)
#Set female 0 male 1 in Gender
titanic_df['Gender'] = titanic_df['Sex'].map({'female': 0, 'male': 1}).astype(int)
titanic_df.head()
np.random.seed = 0
xmin, xmax = -5, 85
ymin, ymax = -0.5, 1.3
index_survived = titanic_df[titanic_df["Survived"]==0].index
index_notsurvived = titanic_df[titanic_df["Survived"]==1].index
from matplotlib.colors import ListedColormap
fig, ax = plt.subplots()
cm = plt.cm.RdBu
cm_bright = ListedColormap(['#FF0000', '#0000FF'])
sc = ax.scatter(titanic_df.loc[index_survived, 'AgeFill'],
titanic_df.loc[index_survived, 'Gender']+(np.random.rand(len(index_survived))-0.5)*0.1,
color='r', label='Not Survived', alpha=0.3)
sc = ax.scatter(titanic_df.loc[index_notsurvived, 'AgeFill'],
titanic_df.loc[index_notsurvived, 'Gender']+(np.random.rand(len(index_notsurvived))-0.5)*0.1,
color='b', label='Survived', alpha=0.3)
ax.set_xlabel('AgeFill')
ax.set_ylabel('Gender')
ax.set_xlim(xmin, xmax)
ax.set_ylim(ymin, ymax)
ax.legend(bbox_to_anchor=(1.4, 1.03))
Since 1 is male, 0 is female, red is dead and blue is alive, it is distributed so that a relatively large number of females are alive.
#Create a list of age and gender only
data2 = titanic_df.loc[:, ["AgeFill", "Gender"]].values
data2
result
array([[22. , 1. ],
[38. , 0. ],
[26. , 0. ],
...,
[29.69911765, 0. ],
[26. , 1. ],
[32. , 1. ]])
Let's make a survival graph by age
split_data = []
for survived in [0,1]:
split_data.append(titanic_df[titanic_df.Survived==survived])
temp = [i["AgeFill"].dropna() for i in split_data ]
plt.hist(temp, histtype="barstacked", bins=16)
Since the missing values of age are filled in on average, the number in the middle is large. Try graphing again with the data excluding the missing values.
temp = [i["Age"].dropna() for i in split_data]
plt.hist(temp, histtype="barstacked", bins=16)
Check the survival rate of men and women with a pile map
temp = [i["Gender"].dropna() for i in split_data]
plt.hist(temp, histtype="barstacked", bins=16)
It became like that.
##Create a list of life and death flags only
label2 = titanic_df.loc[:,["Survived"]].values
from sklearn.linear_model import LogisticRegression
model2 = LogisticRegression()
model2.fit(data2, label2)
result
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=None, solver='warn', tol=0.0001, verbose=0,
warm_start=False)
Predict a 30-year-old man
model2.predict([[30,1]])
result
array([0])```
```python
model2.predict([[30,1]])
result
array([1])
model2.predict_proba([[30,1]])
Zero (death) prediction is returned
result
array([0])
I'm watching the establishment of that judgment
model2.predict_proba([[30,1]])
result
array([[0.80664059, 0.19335941]])
The percentage of death probability 80% and survival probability 20% can be seen.
Related Sites
Chapter 1: Linear Regression Model [Chapter 2: Nonlinear Regression Model] (https://qiita.com/matsukura04583/items/baa3f2269537036abc57) [Chapter 3: Logistic Regression Model] (https://qiita.com/matsukura04583/items/0fb73183e4a7a6f06aa5) [Chapter 4: Principal Component Analysis] (https://qiita.com/matsukura04583/items/b3b5d2d22189afc9c81c) [Chapter 5: Algorithm 1 (k-nearest neighbor method (kNN))] (https://qiita.com/matsukura04583/items/543719b44159322221ed) [Chapter 6: Algorithm 2 (k-means)] (https://qiita.com/matsukura04583/items/050c98c7bb1c9e91be71) [Chapter 7: Support Vector Machine] (https://qiita.com/matsukura04583/items/6b718642bcbf97ae2ca8)