--Premise: Although it is not a big article, I have divided it, but it is a continuation of Modeling. --Purpose: Introduce the process of making predictions using test data and verifying the model. I will skip the mathematical details. (A major premise is to verify on Kaggle) --Environment: Kaggle Kernel Notebook
In the model created in Previous article, test data is used to predict whether or not it survived.
First, prepare the test data. When analyzing the actual data, it is necessary to separate the training data and the test data from the original data, Since Kaggle divides it, download the test data (test.csv) from the Competition page.
test_csv = pd.read_csv('../input/titanic/test.csv', sep=',')
test_csv.head()
Just in case, check the outline of the test data.
#Dimensional confirmation
test_csv.shape
#Output result
(418, 11)
#Check the number of missing data
test_csv.isnull().sum()
#Output result
PassengerId 0
Pclass 0
Name 0
Sex 0
Age 86
SibSp 0
Parch 0
Ticket 0
Fare 1
Cabin 327
Embarked 0
dtype: int64
There were no missing values for Fare in the training data, but there seems to be only one in the test data. As with Age when modeling, we will fill in the missing values with the mean value.
The test data will be formatted in the same way as the training data.
#Get rid of unnecessary columns
test = test_csv.drop(['Name', 'SibSp', 'Ticket', 'Cabin'] , axis=1)
#Make a female dummy
test['Female'] = test['Sex'].map(lambda x: 0 if x == 'male' else 1 ).astype(int)
#Make a dummy with Parch 0 and above
test['Parch_d'] = test['Parch'].map(lambda x: 0 if x == 0 else 1).astype(int)
#Embarked makes a dummy with S and others
test['Embarked_S'] = test['Embarked'].map(lambda x: 1 if x == 'S' else 0).astype(int)
#Fill in missing values for Age
test['Age'].fillna(test['Age'].mean(), inplace=True)
#Fill in missing values for Fare
test['Fare'].fillna(test['Fare'].mean(), inplace=True)
Now let's make a prediction using the model we created last time.
#Predict
predict = model.predict(test_x)
predict[:10]
#Output result
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0])
Note that the output is returned as an array of Numpy arrays.
Now, I would like to verify how good the created model is. On Kaggle, when you submit the predicted result, Score is returned and evaluated by that value. Therefore, we will create data for submission.
submit_csv = pd.concat([test['PassengerId'], pd.Series(predict)], axis=1)
submit_csv.columns = ['PassengerId', 'Survived']
submit_csv.to_csv('./submition.csv', index=False)
Create data like this and submit it from the Competition page. The result was like this.
Since Kaggle could not obtain the correct answer data, I will introduce Accuracy as an evaluation when there is correct answer data.
Accuracy indicates how well the forecast results match the actual data. Accuracy = (Number of samples that could be predicted correctly) / (Total number of samples) It is calculated as.
Of these, the number of samples that can be predicted correctly is that 1 sample can be predicted as 1 and 0 sample can be predicted as 0.
Let's assume that submit_csv contains the data of 'Survived_test'
, which indicates whether or not it actually survived.
pd.crosstab(submit_csv['Survived'], submit_csv['Survived_test'])
#Output result(Assumption)
Survived_test 0 1
Survived
0 a b
1 c d
The output result should look like the one above. Accuracy = a + d / (a + b + c + d) Is required at.
There are other evaluation indexes for the model, and we will use them properly according to the purpose of the model.
――I made a mistake on the way and modeled and evaluated without Embarked_S, but the result was better there. ――This time it's a trial, so it's intuitive, but I modeled it with a hypothesis. ――Next time, I would like to explain logistic regression.