Machine learning

table of contents Chapter 1: Linear Regression Model [Chapter 2: Nonlinear Regression Model] (https://qiita.com/matsukura04583/items/baa3f2269537036abc57) [Chapter 3: Logistic Regression Model] (https://qiita.com/matsukura04583/items/0fb73183e4a7a6f06aa5) [Chapter 4: Principal Component Analysis] (https://qiita.com/matsukura04583/items/b3b5d2d22189afc9c81c) [Chapter 5: Algorithm 1 (k-nearest neighbor method (kNN))] (https://qiita.com/matsukura04583/items/543719b44159322221ed) [Chapter 6: Algorithm 2 (k-means)] (https://qiita.com/matsukura04583/items/050c98c7bb1c9e91be71) [Chapter 7: Support Vector Machine] (https://qiita.com/matsukura04583/items/6b718642bcbf97ae2ca8)

Chapter 3: Logistic Regression Model

Description of logistic regression model

Classification problem (classification)
Problem of classifying a certain input (numerical value) into a class
Although it is named "regression", it is a matter of classification.
Data handled by classification
Input (each element is called an explanatory variable or feature)
m-dimensional vector (scalar if m = 1)
Output (objective variable) 0 or 1 value "
For example, Titanic, IRIS data, etc.
Explanatory variable

   x=(x_1,x_2,・ ・ ・,x_m)^T \in R^m

Objective variable

     y \in \left\{0,1\right\}

Logistic linear regression model
Supervised machine learning model for solving classification problems (learning from teacher data)
Input a linear combination of input and m-dimensional parameters into the sigmoid function
Output is the value of the probability that y = 1
Sigmoid function
Input is a real number and output is always a value from 0 to 1.
Represents the probability (classified as class 1)
Monotonically increasing function
The shape of the sigmoid function changes when the parameter changes Increasing + a increases the slope of the curve near x = 0 When + a is made extremely large, it approaches a unit step function (a function such that f (x) = 0 when x <0 and f (x) = 1 when x> 0).
Bias change is the position of the step
Properties of the sigmoid function
The derivative of the sigmoid function can be expressed by the sigmoid function itself
Easy to calculate by using this fact when differentiating the likelihood function
Data Y is predicted to be 0 if the probability is 0.5 or more and less than 1.

(Practice 3) Predict the survival rate of a 30-year-old man using the Titanic dataset

Google drive mount

from google.colab import drive
drive.mount('/content/drive')

0. Data display

#from module name import class name (or function name or variable name)
import pandas as pd
from pandas import DataFrame
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Magic for displaying matplotlib inline(plt.show()You don't have to)
%matplotlib inline

In the following, the study_ai_ml folder is used directly under My Drive in Google Drive.

#Read titanic data csv file
titanic_df = pd.read_csv('/content/drive/My Drive/study_ai_ml/data/titanic_train.csv')

#View the beginning of the file and check the dataset
titanic_df.head(5)

I examined the meaning of variables.

Passenger ID: Passenger ID Survived: Survival result (1: Survival, 0: Death) Pclass: Passenger class 1 is the highest class Name: Passenger's name Sex: Gender Age: Age SibSp Number of siblings and spouses Parch Number of parents and children Ticket Ticket number Fare boarding fee Cabin room number Embarked Port on board Cherbourg, Queenstown, Southampton

1. Logistic regression

Delete unnecessary data / complement missing values

#Drop the karau that you think is unnecessary for prediction
titanic_df.drop(['PassengerId','Pclass', 'Name', 'SibSp','Parch','Ticket','Fare','Cabin','Embarked'], axis=1, inplace=True)

#Display data with some columns dropped
titanic_df.head()

#Show lines containing null
titanic_df[titanic_df.isnull().any(1)].head(10)

#Complete null in Age column with median

titanic_df['AgeFill'] = titanic_df['Age'].fillna(titanic_df['Age'].mean())

#Show lines containing null again(Age null is complemented)
titanic_df[titanic_df.isnull().any(1)]

#titanic_df.dtypes
#titanic_df.head()

1. Logistic regression

Implementation (determine life or death from gender and age)

#Because I filled in the missing value of Age Fill
#titanic_df = titanic_df.drop(['Age'], axis=1)
#Set female 0 male 1 in Gender
titanic_df['Gender'] = titanic_df['Sex'].map({'female': 0, 'male': 1}).astype(int)
titanic_df.head()

Let's draw the distribution of life and death by gender and age

np.random.seed = 0

xmin, xmax = -5, 85
ymin, ymax = -0.5, 1.3

index_survived = titanic_df[titanic_df["Survived"]==0].index
index_notsurvived = titanic_df[titanic_df["Survived"]==1].index

from matplotlib.colors import ListedColormap
fig, ax = plt.subplots()
cm = plt.cm.RdBu
cm_bright = ListedColormap(['#FF0000', '#0000FF'])
sc = ax.scatter(titanic_df.loc[index_survived, 'AgeFill'],
                titanic_df.loc[index_survived, 'Gender']+(np.random.rand(len(index_survived))-0.5)*0.1,
                color='r', label='Not Survived', alpha=0.3)
sc = ax.scatter(titanic_df.loc[index_notsurvived, 'AgeFill'],
                titanic_df.loc[index_notsurvived, 'Gender']+(np.random.rand(len(index_notsurvived))-0.5)*0.1,
                color='b', label='Survived', alpha=0.3)
ax.set_xlabel('AgeFill')
ax.set_ylabel('Gender')
ax.set_xlim(xmin, xmax)
ax.set_ylim(ymin, ymax)
ax.legend(bbox_to_anchor=(1.4, 1.03))

Since 1 is male, 0 is female, red is dead and blue is alive, it is distributed so that a relatively large number of females are alive.

#Create a list of age and gender only
data2 = titanic_df.loc[:, ["AgeFill", "Gender"]].values
data2

`result`


array([[22.        ,  1.        ],
       [38.        ,  0.        ],
       [26.        ,  0.        ],
       ...,
       [29.69911765,  0.        ],
       [26.        ,  1.        ],
       [32.        ,  1.        ]])

Let's make a survival graph by age

split_data = []
for survived in [0,1]:
    split_data.append(titanic_df[titanic_df.Survived==survived])

temp = [i["AgeFill"].dropna() for i in split_data ]
plt.hist(temp, histtype="barstacked", bins=16)

Since the missing values of age are filled in on average, the number in the middle is large. Try graphing again with the data excluding the missing values.

temp = [i["Age"].dropna() for i in split_data]
plt.hist(temp, histtype="barstacked", bins=16)

Check the survival rate of men and women with a pile map

temp = [i["Gender"].dropna() for i in split_data]
plt.hist(temp, histtype="barstacked", bins=16)

It became like that.

1. Logistic regression

Implementation (determines life or death from 2 variables)

##Create a list of life and death flags only
label2 =  titanic_df.loc[:,["Survived"]].values
from sklearn.linear_model import LogisticRegression
model2 = LogisticRegression()
model2.fit(data2, label2)

`result`


/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Predict a 30-year-old man

model2.predict([[30,1]])

`result`


array([0])```


```python
model2.predict([[30,1]])

`result`


array([1])

model2.predict_proba([[30,1]])

Zero (death) prediction is returned

`result`


array([0])

I'm watching the establishment of that judgment

model2.predict_proba([[30,1]])

`result`


array([[0.80664059, 0.19335941]])

The percentage of death probability 80% and survival probability 20% can be seen.

Related Sites

Chapter 1: Linear Regression Model [Chapter 2: Nonlinear Regression Model] (https://qiita.com/matsukura04583/items/baa3f2269537036abc57) [Chapter 3: Logistic Regression Model] (https://qiita.com/matsukura04583/items/0fb73183e4a7a6f06aa5) [Chapter 4: Principal Component Analysis] (https://qiita.com/matsukura04583/items/b3b5d2d22189afc9c81c) [Chapter 5: Algorithm 1 (k-nearest neighbor method (kNN))] (https://qiita.com/matsukura04583/items/543719b44159322221ed) [Chapter 6: Algorithm 2 (k-means)] (https://qiita.com/matsukura04583/items/050c98c7bb1c9e91be71) [Chapter 7: Support Vector Machine] (https://qiita.com/matsukura04583/items/6b718642bcbf97ae2ca8)

<Course> Machine Learning Chapter 3: Logistic Regression Model

Machine learning

Chapter 3: Logistic Regression Model

Description of logistic regression model

(Practice 3) Predict the survival rate of a 30-year-old man using the Titanic dataset

0. Data display

1. Logistic regression

Delete unnecessary data / complement missing values

1. Logistic regression

Implementation (determine life or death from gender and age)

Let's draw the distribution of life and death by gender and age

result

1. Logistic regression

Implementation (determines life or death from 2 variables)

result

result

result

result

result

`result`

`result`

`result`

`result`

`result`

`result`