A memo that challenged kaggle to study data analysis. I tried the tutorial "titanic", but there are too many things I don't understand, such as pandas and scikit. I thought I could use the code below, but the score is not good.
The question of thinking about who the Titanic crew members survived (roughly explained without fear of misunderstanding). Train data and test data are given, and both data contain data such as gender and age. However, the train data has survival data (0/1), but the test data does not. *** In other words, the problem of creating a survival model from train data and predicting the survival of test data. *** (The correctness can be confirmed by submitting the forecast data on the website.)
#Preparation
def df_cleaner(df):
#Make up for the missing parts
#age
median_age = np.median(df[(df['Age'].notnull())]['Age'])
for passenger in df[(df['Age'].isnull())].index: #.index =Null location in the array
df.loc[passenger, 'Age'] = median_age
# fare
median_fare = np.median(df[(df['Fare'].notnull())]['Fare'])
for passenger in df[(df['Fare'].isnull())].index:
df.loc[passenger, 'Fare'] = median_fare
#Convert character string data to numerical data
df.loc[(df['Sex'] == 'male'),'Sex'] = 0
df.loc[(df['Sex'] == 'female'),'Sex'] = 1
df.loc[(df['Sex'].isnull()),'Sex'] = 2
df.loc[(df['Embarked'] == 'S'),'Embarked'] = 0
df.loc[(df['Embarked'] == 'C'),'Embarked'] = 1
df.loc[(df['Embarked'] == 'Q'),'Embarked'] = 2
df.loc[(df['Embarked'].isnull()),'Embarked'] = 3
return df
#Let's make a csv for submission
def make_csv(file_path, passengerId, predicts):
f = open(file_path, "wb")
writer = csv.writer(f)
writer.writerow(["PassengerId", "Survived"])
for row, survived in zip(passengerId, predicts):
writer.writerow([row, survived])
#Let's see the performance of the model we made
def getScore(answer, predicts):
sum_p = 0.0
total = 0.0
for (row, predict) in zip(answer,predicts):
if row == predict:
sum_p += 1.0
total += 1.0
return sum_p/total
def main():
# Read in the training data.
train = pd.read_csv('./data/train.csv')
test = pd.read_csv("./data/test.csv")
#Unnecessary data(Expected)Let's erase
train.drop(['Name', 'PassengerId', 'Ticket', 'Cabin'], axis=1, inplace=True)
test.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
#Prepare
train = df_cleaner(train)
test = df_cleaner(test)
x_train = train[:][['Pclass', 'Sex','Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
y_train = train[:][['Survived']]
#Let's make a model in Random Forest
scores =[]
for trees in range(1,100):
model = RandomForestClassifier(n_estimators=trees)
model.fit(x_train, np.ravel(y_train))
#Let's see the match rate
pre = model.predict(x_train)
scores.append(getScore(y_train['Survived'],pre))
plt.plot(scores,'-r')
plt.show()
#Remake the actual test data
x_test = test[:][['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
label = test[:][['PassengerId']]
#Let's predict using model
output = model.predict(x_test)
#Let's make a csv for submission
make_csv("./output/random_forest.csv", label['PassengerId'], output.astype(int))
if __name__ == '__main__':
main()
github Source code
Code above: 0.75120 Tutorial copy and paste: 0.76555
The original is worse. .. .. I don't think it's wrong as an algorithm, so it seems necessary to investigate the random forest part of scikit a little more.
I created a site to visualize and understand the algorithm. There is no Titanic issue, but the San Francisco issue is relevant so I'll list it here. Library of Algorithms: A site that visually understands algorithms
Recommended Posts