Me: A beginner who has never studied or touched machine learning Machine learning is a technology that you should know in the future, and I thought I would like to use it for a while. I will do it with a stance of trying to implement it aiming for a state where it moves for the time being, without digging deep into the details. (I feel really light and I feel like lowering the psychological hurdle to machine learning)
For the time being, make pandas and scikit-learn available in python If you install it with pip, it should be completed ...
$ pip install pandas
Traceback (most recent call last):
File "/home/myuser/.local/bin/pip", line 7, in <module>
from pip._internal import main
ImportError: No module named 'pip._internal'
I'm not sure about the details, but I can't talk about it unless it works for the time being. Download get-pip.py from Official Site
$ curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
Run with python, python3
$ sudo python get-pip.py
$ sudo python3 get-pip.py
Check if the pip command is available
$ pip --version
pip 20.2.4 from /Library/Python/3.7/site-packages/pip (python 3.7)
You can now use pip safely Now you can install pandas, scikit-learn ↓ Confirm that the installation was successful
$ pip show pandas
Name: pandas
Version: 1.1.4
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: None
Author-email: None
License: BSD
Location:
Requires: python-dateutil, numpy, pytz
Required-by:
$ pip show scikit-learn
Name: scikit-learn
Version: 0.23.2
Summary: A set of python modules for machine learning and data mining
Home-page: http://scikit-learn.org
Author: None
Author-email: None
License: new BSD
Location:
Requires: joblib, threadpoolctl, scipy, numpy
Required-by: sklearn
A quick look reveals that machine learning roughly follows the flow below.
For the time being, I will try Kaggle's Titanic survival prediction that I often see in the introduction to machine learning
Download the following data from Kaggle Site (You need to register an account with Kaggle to download the data)
When I check the contents, it looks like this
>>> import pandas as pd
>>> gender_submission = pd.read_csv("./Data/gender_submission.csv")
>>> test = pd.read_csv("./Data/test.csv")
>>> train = pd.read_csv("./Data/train.csv")
>>>
>>> gender_submission.head(5)
PassengerId Survived
0 892 0
1 893 1
2 894 0
3 895 0
4 896 1
>>> test.head(5)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
>>> train.head(5)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
gender_submission(PassengerId, Survived) test(PassengerId, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked) train(PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin)
In pandas, the corr () method is used to find the correlation coefficient between each column of the data frame. By the way, it seems that the corr () method can specify three calculation methods by specifying the argument method. I'm not sure this time, so use the default
+'pearson': Pearson product moment correlation coefficient ← default +'kendall': Kendall rank correlation coefficient +'spearman': Spearman's rank correlation coefficient
>>> train_corr = train.corr()
>>> train_corr
PassengerId Survived Pclass Age SibSp Parch Fare
PassengerId 1.000000 -0.005007 -0.035144 0.036847 -0.057527 -0.001652 0.012658
Survived -0.005007 1.000000 -0.338481 -0.077221 -0.035322 0.081629 0.257307
Pclass -0.035144 -0.338481 1.000000 -0.369226 0.083081 0.018443 -0.549500
Age 0.036847 -0.077221 -0.369226 1.000000 -0.308247 -0.189119 0.096067
SibSp -0.057527 -0.035322 0.083081 -0.308247 1.000000 0.414838 0.159651
Parch -0.001652 0.081629 0.018443 -0.189119 0.414838 1.000000 0.216225
Fare 0.012658 0.257307 -0.549500 0.096067 0.159651 0.216225 1.000000
It seems that you can easily visualize it as a heat map by using a library called seaborn let's try it!
>>> import seaborn
>>> import matplotlib as mpl
>>> import matplotlib.pyplot as plt
>>>
>>> seaborn.heatmap(train_corr,vmax=1, vmin=-1, center=0)
<AxesSubplot:>
>>> plt.show()
I see, it's easier to see Maybe there is a strong correlation between Pclass and Fare in this ...? Pclass and Fare are weak with Survived predicted this time, but there seems to be a correlation ...?
NA (missing value) completion, correction of character strings such as Sex, Embarked, Cabin to numerical values This time NA basically complements the mean, but Embarked complements the most "S" Cabin is corrected only to the acronym (probably representing the rank of the guest room), NA complements the most C
>>> train.Embarked.value_counts()
S 644
C 168
Q 77
Name: Embarked, dtype: int64
↓ Correction function used
def CorrectTitanicData(df):
#Age : NA ->Average value
df.Age = df.Age.fillna(df.Age.median())
#Sex : male -> 0, female -> 1
df.Sex = df.Sex.replace(['male', 'female'],[0,1])
#Embarked : NA -> S, C -> 0, S -> 1, Q -> 2
df.Embarked = df.Embarked.fillna("S")
df.Embarked = df.Embarked.replace(['C', 'S', 'Q'], [0, 1, 2])
#Fare : NA ->Average value
df.Fare = df.Fare.fillna(df.Fare.median())
#Cabin : NA -> C, A~G -> 0~6, T -> 7
df.Cabin = df.Cabin.fillna('C')
df.Cabin = df.Cabin.replace('A(.*)','A',regex=True)
df.Cabin = df.Cabin.replace('B(.*)','B',regex=True)
df.Cabin = df.Cabin.replace('C(.*)','C',regex=True)
df.Cabin = df.Cabin.replace('D(.*)','D',regex=True)
df.Cabin = df.Cabin.replace('E(.*)','E',regex=True)
df.Cabin = df.Cabin.replace('F(.*)','F',regex=True)
df.Cabin = df.Cabin.replace('G(.*)','G',regex=True)
df.Cabin = df.Cabin.replace(['A','B','C','D','E','F','G','T'], [0,1,2,3,4,5,6,7])
return df
Check the correlation between each column again after preprocessing
>>> train = CorrectTitanicData(train)
>>> train_corr = train.corr()
>>> train_corr
PassengerId Survived Pclass Sex Age SibSp Parch Fare Cabin Embarked
PassengerId 1.000000 -0.005007 -0.035144 -0.042939 0.034212 -0.057527 -0.001652 0.012658 -0.035748 -0.017443
Survived -0.005007 1.000000 -0.338481 0.543351 -0.064910 -0.035322 0.081629 0.257307 0.080643 -0.125953
Pclass -0.035144 -0.338481 1.000000 -0.131900 -0.339898 0.083081 0.018443 -0.549500 0.009851 0.305762
Sex -0.042939 0.543351 -0.131900 1.000000 -0.081163 0.114631 0.245489 0.182333 0.070780 -0.022521
Age 0.034212 -0.064910 -0.339898 -0.081163 1.000000 -0.233296 -0.172482 0.096688 -0.032105 -0.040166
SibSp -0.057527 -0.035322 0.083081 0.114631 -0.233296 1.000000 0.414838 0.159651 0.000224 0.030874
Parch -0.001652 0.081629 0.018443 0.245489 -0.172482 0.414838 1.000000 0.216225 0.018232 -0.035957
Fare 0.012658 0.257307 -0.549500 0.182333 0.096688 0.159651 0.216225 1.000000 -0.098064 -0.268865
Cabin -0.035748 0.080643 0.009851 0.070780 -0.032105 0.000224 0.018232 -0.098064 1.000000 0.069852
Embarked -0.017443 -0.125953 0.305762 -0.022521 -0.040166 0.030874 -0.035957 -0.268865 0.069852 1.000000
>>>
>>> seaborn.heatmap(train_corr,vmax=1, vmin=-1, center=0)
<AxesSubplot:>
>>> plt.show()
It became clear that Sex has a stronger correlation
Eight items (other than Passenger ID) used as predictors this time are "P class", "Sex", "Age", "SibSp", "Parch", "Fare", "Cabin", and "Embarked". Cross-validation is performed by implementing the following seven learning methods.
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.svm import SVC, LinearSVC
>>> from sklearn.neighbors import KNeighborsClassifier
>>> from sklearn.tree import DecisionTreeClassifier
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.neural_network import MLPClassifier
>>> from sklearn.model_selection import cross_val_score
>>>
>>> predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Cabin", "Embarked"]
>>> models = []
>>> models.append(("LogisticRegression",LogisticRegression()))
>>> models.append(("SVC",SVC()))
>>> models.append(("LinearSVC",LinearSVC()))
>>> models.append(("KNeighbors",KNeighborsClassifier()))
>>> models.append(("DecisionTree",DecisionTreeClassifier()))
>>> models.append(("RandomForest",RandomForestClassifier()))
>>> models.append(("MLPClassifier",MLPClassifier(solver='lbfgs', random_state=0)))
>>>
>>> results = []
>>> names = []
>>>
>>> for name,model in models:
... result = cross_val_score(model, train[predictors], train["Survived"], cv=3)
... names.append(name)
... results.append(result)
...
>>> for i in range(len(names)):
... print(names[i],results[i].mean())
...
LogisticRegression 0.7811447811447811
SVC 0.6554433221099888
LinearSVC 0.7317620650953985
KNeighbors 0.7070707070707071
DecisionTree 0.7721661054994389
RandomForest 0.7957351290684623
MLPClassifier 0.7901234567901234
Random Forest seems to have the best rating
Learn training data in a random forest and make predictions with test data. Save the result in CSV format
>>> test = pd.read_csv("./Data/test.csv")
>>> test = CorrectTitanicData(test)
>>> algorithm = RandomForestClassifier()
>>> algorithm.fit(train[predictors], train["Survived"])
RandomForestClassifier()
>>> predictions = algorithm.predict(test[predictors])
>>> submission = pd.DataFrame({
... "PassengerId":test["PassengerId"],
... "Survived":predictions
... })
>>> submission.to_csv("submission.csv", index=False)
I submitted it with Kaggle because it was a big deal Result Score is 0.74162
I would like to increase the correct answer rate by trial and error from here, but this time it is up to here It seems that scikit-learn has GridSearchCV that searches hyperparameters, so If you use it, the percentage of correct answers is likely to increase ...
Recommended Posts