Premise

Me: A beginner who has never studied or touched machine learning Machine learning is a technology that you should know in the future, and I thought I would like to use it for a while. I will do it with a stance of trying to implement it aiming for a state where it moves for the time being, without digging deep into the details. (I feel really light and I feel like lowering the psychological hurdle to machine learning)

Environmental preparation

For the time being, make pandas and scikit-learn available in python If you install it with pip, it should be completed ...

$ pip install pandas
Traceback (most recent call last):
File "/home/myuser/.local/bin/pip", line 7, in <module>
from pip._internal import main
ImportError: No module named 'pip._internal'

I'm not sure about the details, but I can't talk about it unless it works for the time being. Download get-pip.py from Official Site

$ curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py

Run with python, python3

$ sudo python get-pip.py
$ sudo python3 get-pip.py

Check if the pip command is available

$ pip --version
pip 20.2.4 from /Library/Python/3.7/site-packages/pip (python 3.7)

You can now use pip safely Now you can install pandas, scikit-learn ↓ Confirm that the installation was successful

$ pip show pandas
Name: pandas
Version: 1.1.4
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: None
Author-email: None
License: BSD
Location:
Requires: python-dateutil, numpy, pytz
Required-by: 

$ pip show scikit-learn
Name: scikit-learn
Version: 0.23.2
Summary: A set of python modules for machine learning and data mining
Home-page: http://scikit-learn.org
Author: None
Author-email: None
License: new BSD
Location:
Requires: joblib, threadpoolctl, scipy, numpy
Required-by: sklearn

Rough procedure for machine learning

A quick look reveals that machine learning roughly follows the flow below.

Obtaining data
Data preprocessing
Method selection
Hyperparameter selection
Model learning
Evaluation (→ Return to 2 or 3 or 4 and try and error)

Titanic: Machine Learning from Disaster (Titanic survival prediction)

For the time being, I will try Kaggle's Titanic survival prediction that I often see in the introduction to machine learning

Obtaining data

Download data to use from Kaggle's site

Download the following data from Kaggle Site (You need to register an account with Kaggle to download the data)

gender_submission.csv: An example of submitting like this
test.csv: Test data
train.csv: Training data

When I check the contents, it looks like this

>>> import pandas as pd
>>> gender_submission = pd.read_csv("./Data/gender_submission.csv")
>>> test = pd.read_csv("./Data/test.csv")
>>> train = pd.read_csv("./Data/train.csv")
>>> 
>>> gender_submission.head(5)
   PassengerId  Survived
0          892         0
1          893         1
2          894         0
3          895         0
4          896         1
>>> test.head(5)
   PassengerId  Pclass                                          Name     Sex   Age  SibSp  Parch   Ticket     Fare Cabin Embarked
0          892       3                              Kelly, Mr. James    male  34.5      0      0   330911   7.8292   NaN        Q
1          893       3              Wilkes, Mrs. James (Ellen Needs)  female  47.0      1      0   363272   7.0000   NaN        S
2          894       2                     Myles, Mr. Thomas Francis    male  62.0      0      0   240276   9.6875   NaN        Q
3          895       3                              Wirz, Mr. Albert    male  27.0      0      0   315154   8.6625   NaN        S
4          896       3  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female  22.0      1      1  3101298  12.2875   NaN        S
>>> train.head(5)
   PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
0            1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S
1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85        C
2            3         1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
3            4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S
4            5         0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S

gender_submission(PassengerId, Survived) test(PassengerId, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked) train(PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin)

Check the correlation coefficient between each column

In pandas, the corr () method is used to find the correlation coefficient between each column of the data frame. By the way, it seems that the corr () method can specify three calculation methods by specifying the argument method. I'm not sure this time, so use the default

+'pearson': Pearson product moment correlation coefficient ← default +'kendall': Kendall rank correlation coefficient +'spearman': Spearman's rank correlation coefficient

>>> train_corr = train.corr()
>>> train_corr
             PassengerId  Survived    Pclass       Age     SibSp     Parch      Fare
PassengerId     1.000000 -0.005007 -0.035144  0.036847 -0.057527 -0.001652  0.012658
Survived       -0.005007  1.000000 -0.338481 -0.077221 -0.035322  0.081629  0.257307
Pclass         -0.035144 -0.338481  1.000000 -0.369226  0.083081  0.018443 -0.549500
Age             0.036847 -0.077221 -0.369226  1.000000 -0.308247 -0.189119  0.096067
SibSp          -0.057527 -0.035322  0.083081 -0.308247  1.000000  0.414838  0.159651
Parch          -0.001652  0.081629  0.018443 -0.189119  0.414838  1.000000  0.216225
Fare            0.012658  0.257307 -0.549500  0.096067  0.159651  0.216225  1.000000

Try to make a heat map

It seems that you can easily visualize it as a heat map by using a library called seaborn let's try it!

>>> import seaborn
>>> import matplotlib as mpl
>>> import matplotlib.pyplot as plt
>>> 
>>> seaborn.heatmap(train_corr,vmax=1, vmin=-1, center=0)
<AxesSubplot:>
>>> plt.show()

I see, it's easier to see Maybe there is a strong correlation between Pclass and Fare in this ...? Pclass and Fare are weak with Survived predicted this time, but there seems to be a correlation ...?

Data preprocessing

NA (missing value) completion, correction of character strings such as Sex, Embarked, Cabin to numerical values This time NA basically complements the mean, but Embarked complements the most "S" Cabin is corrected only to the acronym (probably representing the rank of the guest room), NA complements the most C

>>> train.Embarked.value_counts()
S    644
C    168
Q     77
Name: Embarked, dtype: int64

↓ Correction function used

def CorrectTitanicData(df):
    #Age : NA ->Average value
    df.Age = df.Age.fillna(df.Age.median())
    #Sex : male -> 0, female -> 1
    df.Sex = df.Sex.replace(['male', 'female'],[0,1])
    #Embarked : NA -> S, C -> 0, S -> 1, Q -> 2
    df.Embarked = df.Embarked.fillna("S")
    df.Embarked = df.Embarked.replace(['C', 'S', 'Q'], [0, 1, 2])
    #Fare : NA ->Average value
    df.Fare = df.Fare.fillna(df.Fare.median())
    #Cabin : NA -> C, A~G -> 0~6, T -> 7
    df.Cabin = df.Cabin.fillna('C')
    df.Cabin = df.Cabin.replace('A(.*)','A',regex=True)
    df.Cabin = df.Cabin.replace('B(.*)','B',regex=True)
    df.Cabin = df.Cabin.replace('C(.*)','C',regex=True)
    df.Cabin = df.Cabin.replace('D(.*)','D',regex=True)
    df.Cabin = df.Cabin.replace('E(.*)','E',regex=True)
    df.Cabin = df.Cabin.replace('F(.*)','F',regex=True)
    df.Cabin = df.Cabin.replace('G(.*)','G',regex=True)
    df.Cabin = df.Cabin.replace(['A','B','C','D','E','F','G','T'], [0,1,2,3,4,5,6,7])
    
    return df

Check the correlation between each column again after preprocessing

>>> train = CorrectTitanicData(train)
>>> train_corr = train.corr()
>>> train_corr
             PassengerId  Survived    Pclass       Sex       Age     SibSp     Parch      Fare     Cabin  Embarked
PassengerId     1.000000 -0.005007 -0.035144 -0.042939  0.034212 -0.057527 -0.001652  0.012658 -0.035748 -0.017443
Survived       -0.005007  1.000000 -0.338481  0.543351 -0.064910 -0.035322  0.081629  0.257307  0.080643 -0.125953
Pclass         -0.035144 -0.338481  1.000000 -0.131900 -0.339898  0.083081  0.018443 -0.549500  0.009851  0.305762
Sex            -0.042939  0.543351 -0.131900  1.000000 -0.081163  0.114631  0.245489  0.182333  0.070780 -0.022521
Age             0.034212 -0.064910 -0.339898 -0.081163  1.000000 -0.233296 -0.172482  0.096688 -0.032105 -0.040166
SibSp          -0.057527 -0.035322  0.083081  0.114631 -0.233296  1.000000  0.414838  0.159651  0.000224  0.030874
Parch          -0.001652  0.081629  0.018443  0.245489 -0.172482  0.414838  1.000000  0.216225  0.018232 -0.035957
Fare            0.012658  0.257307 -0.549500  0.182333  0.096688  0.159651  0.216225  1.000000 -0.098064 -0.268865
Cabin          -0.035748  0.080643  0.009851  0.070780 -0.032105  0.000224  0.018232 -0.098064  1.000000  0.069852
Embarked       -0.017443 -0.125953  0.305762 -0.022521 -0.040166  0.030874 -0.035957 -0.268865  0.069852  1.000000
>>> 
>>> seaborn.heatmap(train_corr,vmax=1, vmin=-1, center=0)
<AxesSubplot:>
>>> plt.show()

It became clear that Sex has a stronger correlation

Method selection

Eight items (other than Passenger ID) used as predictors this time are "P class", "Sex", "Age", "SibSp", "Parch", "Fare", "Cabin", and "Embarked". Cross-validation is performed by implementing the following seven learning methods.

LgisticRegression
SVC
LinearSVC
KNeighbors
DecisionTree
RandomForest
MLPClassifier

>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.svm import SVC, LinearSVC
>>> from sklearn.neighbors import KNeighborsClassifier
>>> from sklearn.tree import DecisionTreeClassifier
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.neural_network import MLPClassifier
>>> from sklearn.model_selection import cross_val_score
>>> 
>>> predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Cabin", "Embarked"]
>>> models = []
>>> models.append(("LogisticRegression",LogisticRegression()))
>>> models.append(("SVC",SVC()))
>>> models.append(("LinearSVC",LinearSVC()))
>>> models.append(("KNeighbors",KNeighborsClassifier()))
>>> models.append(("DecisionTree",DecisionTreeClassifier()))
>>> models.append(("RandomForest",RandomForestClassifier()))
>>> models.append(("MLPClassifier",MLPClassifier(solver='lbfgs', random_state=0)))
>>> 
>>> results = []
>>> names = []
>>> 
>>> for name,model in models:
...     result = cross_val_score(model, train[predictors], train["Survived"],  cv=3)
...     names.append(name)
...     results.append(result)
... 

>>> for i in range(len(names)):
...     print(names[i],results[i].mean())
... 
LogisticRegression 0.7811447811447811
SVC 0.6554433221099888
LinearSVC 0.7317620650953985
KNeighbors 0.7070707070707071
DecisionTree 0.7721661054994389
RandomForest 0.7957351290684623
MLPClassifier 0.7901234567901234

Random Forest seems to have the best rating

Save prediction results to CSV file for submission

Learn training data in a random forest and make predictions with test data. Save the result in CSV format

>>> test = pd.read_csv("./Data/test.csv")
>>> test = CorrectTitanicData(test)
>>> algorithm = RandomForestClassifier()
>>> algorithm.fit(train[predictors], train["Survived"])
RandomForestClassifier()
>>> predictions = algorithm.predict(test[predictors])
>>> submission = pd.DataFrame({
...     "PassengerId":test["PassengerId"],
...     "Survived":predictions
... })
>>> submission.to_csv("submission.csv", index=False)

Submission result

I submitted it with Kaggle because it was a big deal Result Score is 0.74162

スクリーンショット 2020-11-07 19.28.55.png

I would like to increase the correct answer rate by trial and error from here, but this time it is up to here It seems that scikit-learn has GridSearchCV that searches hyperparameters, so If you use it, the percentage of correct answers is likely to increase ...

[Python] [Machine learning] Beginners without any knowledge try machine learning for the time being