This article

A memo that challenged kaggle to study data analysis. I tried the tutorial "titanic", but there are too many things I don't understand, such as pandas and scikit. I thought I could use the code below, but the score is not good.

problem

The question of thinking about who the Titanic crew members survived (roughly explained without fear of misunderstanding). Train data and test data are given, and both data contain data such as gender and age. However, the train data has survival data (0/1), but the test data does not. *** In other words, the problem of creating a survival model from train data and predicting the survival of test data. *** (The correctness can be confirmed by submitting the forecast data on the website.)

code

#Preparation
def df_cleaner(df):
    #Make up for the missing parts
    #age
    median_age = np.median(df[(df['Age'].notnull())]['Age'])
    for passenger in df[(df['Age'].isnull())].index: #.index =Null location in the array
    	df.loc[passenger, 'Age'] = median_age
    # fare
    median_fare = np.median(df[(df['Fare'].notnull())]['Fare'])
    for passenger in df[(df['Fare'].isnull())].index:
        df.loc[passenger, 'Fare'] = median_fare

    #Convert character string data to numerical data
    df.loc[(df['Sex'] == 'male'),'Sex'] = 0
    df.loc[(df['Sex'] == 'female'),'Sex'] = 1
    df.loc[(df['Sex'].isnull()),'Sex'] = 2
    df.loc[(df['Embarked'] == 'S'),'Embarked'] = 0
    df.loc[(df['Embarked'] == 'C'),'Embarked'] = 1
    df.loc[(df['Embarked'] == 'Q'),'Embarked'] = 2
    df.loc[(df['Embarked'].isnull()),'Embarked'] = 3

    return df

#Let's make a csv for submission
def make_csv(file_path, passengerId, predicts):
    f = open(file_path, "wb")
    writer = csv.writer(f)
    writer.writerow(["PassengerId", "Survived"])
    for row, survived in zip(passengerId, predicts):
        writer.writerow([row, survived])

#Let's see the performance of the model we made
def getScore(answer, predicts):
    sum_p = 0.0
    total = 0.0
    for (row, predict) in zip(answer,predicts):
        if row == predict:
            sum_p += 1.0
        total += 1.0
    return sum_p/total

def main():
    # Read in the training data.
    train = pd.read_csv('./data/train.csv')
    test = pd.read_csv("./data/test.csv")
    #Unnecessary data(Expected)Let's erase
    train.drop(['Name', 'PassengerId', 'Ticket', 'Cabin'], axis=1, inplace=True)
    test.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

    #Prepare
    train = df_cleaner(train)
    test = df_cleaner(test)
    x_train = train[:][['Pclass', 'Sex','Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
    y_train = train[:][['Survived']]
    #Let's make a model in Random Forest
    scores =[]
    for trees in range(1,100):
        model = RandomForestClassifier(n_estimators=trees)
        model.fit(x_train, np.ravel(y_train))
        #Let's see the match rate
        pre = model.predict(x_train)
        scores.append(getScore(y_train['Survived'],pre))
    plt.plot(scores,'-r')
    plt.show()
    
    #Remake the actual test data
    x_test = test[:][['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
    label = test[:][['PassengerId']]
    #Let's predict using model
    output = model.predict(x_test)
    #Let's make a csv for submission
    make_csv("./output/random_forest.csv", label['PassengerId'], output.astype(int))

if __name__ == '__main__':
    main()

github Source code

Score

Code above: 0.75120 Tutorial copy and paste: 0.76555

Impressions

The original is worse. .. .. I don't think it's wrong as an algorithm, so it seems necessary to investigate the random forest part of scikit a little more.

Another case

I created a site to visualize and understand the algorithm. There is no Titanic issue, but the San Francisco issue is relevant so I'll list it here. Library of Algorithms: A site that visually understands algorithms