・ Those who know the outline of machine learning ・ Or, for those who read Machine learning from scratch (overview of machine learning)
・ Understand Kaggle ・ Understand the actual flow of machine learning ・ Practice using Kaggle tutorials ・ Practice using scikit-learn
This page is a re-edited version of the presentation. If you would like to see the original presentation, please click here. https://www.edocr.com/v/vlzyelxe/tflare/Kaggle_-Machine-learning-to-learn-at-Kaggle
If you put it together without fear of misunderstanding "Kaggle is a site where companies and researchers solve data science and machine learning related themes. Some of them have prize money (and we publish and explain the code to solve it. Explanation) There is also a function to communicate with comments etc.)
・ In explanations such as books, data sets for explanation are often used, and it is difficult to get a real feeling. ・ You can understand the actual flow of machine learning because it is necessary to carry out even the parts that are broken in the explanations such as books. ・ I get motivated because the ranking comes out. (You can compete with and collaborate with data analysts around the world) ・ Prize money will be given (some will be awarded $ 1.5 million)
Predict if passengers survived the sinking of Titanic -Training data (891 rows x 12 columns csv) Some data is missing ・ Test data (418 rows x 11 columns csv) Some data is missing ・ Learn with training data and predict whether or not you survived against the test data.
-PassengerId: The number attached to the data sequentially ・ Survived: Survival (0 = No, 1 = Yes) Exists only in training data ・ Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd) ・ Name: Name ・ Sex: Gender ・ Age: Age ・ SibSp: Number of siblings and spouses on the Titanic ・ Parch: Number of parents and children on the Titanic ・ Ticket: Ticket number ・ Fare: Passenger fare ・ Cabin: Cabin number ・ Embarked: Boarding area (C = Cherbourg, Q = Queenstown, S = Southampton)
Execution code
import numpy as np
import pandas as pd
train = pd.read_csv("train.csv", dtype={"Age": np.float64}, )
test = pd.read_csv("test.csv", dtype={"Age": np.float64}, )
train.head(10)
Execution code
train_corr = train.corr()
train_corr
It seems that you can use other than PassengerId. Since there is data that is not currently used for analysis, it will be converted to usable data (numerical value). In addition, there is missing data, so correct it.
Execution code
def correct_data(titanic_data):
titanic_data.Age = titanic_data.Age.fillna(titanic_data.Age.median())
titanic_data.Sex = titanic_data.Sex.replace(['male', 'female'], [0, 1])
titanic_data.Embarked = titanic_data.Embarked.fillna("S")
titanic_data.Embarked = titanic_data.Embarked.replace(['C', 'S', 'Q'], [0, 1, 2])
titanic_data.Fare = titanic_data.Fare.fillna(titanic_data.Fare.median())
return titanic_data
train_data = correct_data(train)
test_data = correct_data(test)
Execution code
train_corr = train.corr()
train_corr
This time, we will use the following items. ・ Ticket class ·sex ·age ・ Number of siblings and spouses on the Titanic ・ Number of parents and children on the Titanic ・ Passenger fare ・ Boarding area
・ Logistic regression ・ Support vector machine ・ K-nearest neighbor method ・ Decision tree ・ Random forest ·neural network
References See below for details on learning methods. Machine learning started with Python Features learned with scikit-learn Basics of engineering and machine learning https://www.oreilly.co.jp/books/9784873117980/
Execution code
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
models = []
models.append(("LogisticRegression",LogisticRegression()))
models.append(("SVC",SVC()))
models.append(("LinearSVC",LinearSVC()))
models.append(("KNeighbors",KNeighborsClassifier()))
models.append(("DecisionTree",DecisionTreeClassifier()))
models.append(("RandomForest",RandomForestClassifier()))
models.append(("MLPClassifier",MLPClassifier(solver='lbfgs', random_state=0)))
In cross-validation, the dataset is divided into training data and test data (here, 3 divisions). It is a method to stabilize the accuracy by evaluating each
Execution code
results = []
names = []
for name,model in models:
result = cross_val_score(model, train_data[predictors], train_data["Survived"], cv=3)
names.append(name)
results.append(result)
The results divided into three are averaged and evaluated. Random forest gave good results.
Execution code
for i in range(len(names)):
print(names[i],results[i].mean())
LogisticRegression 0.785634118967
SVC 0.687991021324
LinearSVC 0.58810325477
KNeighbors 0.701459034792
DecisionTree 0.766554433221
RandomForest 0.796857463524
MLPClassifier 0.785634118967
Based on what you learned in Random Forest Make a prediction with test data and send the result as csv.
Execution code
alg = RandomForestClassifier()
alg.fit(train_data[predictors], train_data["Survived"])
predictions = alg.predict(test_data[predictors])
submission = pd.DataFrame({
"PassengerId": test_data["PassengerId"],
"Survived": predictions
})
submission.to_csv('submission.csv', index=False)
Correct answer rate 0.74163 It was 7043th out of 7922 people. I'm a little disappointed, so I'll optimize it.
If you use grid search, hyperparameters will be optimized automatically. However, please note that it takes a long time to execute.
Execution code
parameters = {
'n_estimators' : [5, 10, 20, 30, 50, 100, 300],
'max_depth' : [3, 5, 10, 15, 20, 25, 30, 40, 50, 100]
'random_state' : [0],
}
gsc = GridSearchCV(RandomForestClassifier(), parameters,cv=3)
gsc.fit(train_data[predictors], train_data["Survived"])
Let's apply the result optimized by the above.
Correct answer rate 0.77990 I went to 4129th out of 7922 people.
I got a comment when I published the code to Kaggle. It was better to find the missing values from the test data rather than from the training data. I tried it. The modified code is shown below.
Execution code
def correct_data(train_data, test_data):
# Make missing values for training data from test data as well
train_data.Age = train_data.Age.fillna(test_data.Age.median())
train_data.Fare = train_data.Fare.fillna(test_data.Fare.median())
test_data.Age = test_data.Age.fillna(test_data.Age.median())
test_data.Fare = test_data.Fare.fillna(test_data.Fare.median())
train_data = correct_data_common(train_data)
test_data = correct_data_common(test_data)
return train_data, test_data
def correct_data_common(titanic_data):
titanic_data.Sex = titanic_data.Sex.replace(['male', 'female'], [0, 1])
titanic_data.Embarked = titanic_data.Embarked.fillna("S")
titanic_data.Embarked = titanic_data.Embarked.replace(['C', 'S', 'Q'], [0, 1, 2])
return titanic_data
train_data, test_data = correct_data(train, test)
** Correct answer rate 0.79426 ** ** I went to 2189th out of 7922 people. ** **
・ Analyze the name. (Can you guess from here because there is Mr. Mrs. Miss etc.)
· Use a different learning method (eg XGBoost, LightGBM)
grid search
** Cross-validation **
Prudential Life Insurance Assessment ・ Can you make buying life insurance easier? ・ Calculate the risk level from the attributes of the life insurance applicant ・ Prize of $ 30,000 ・ Already finished (code can be referenced) ・ Https://www.kaggle.com/c/prudential-life-insurance-assessment
Zillow Prize: Zillow’s Home Value Prediction (Zestimate) ・ Can you improve the algorithm that changed the world of real estate? · Predict the error between Zestimate and the actual selling price, taking into account all the features of your home ・ Prize of 1.2 million dollars ・ Ends after 4 months ・ Https://www.kaggle.com/c/zillow-prize-1
Recommended Posts