Survival rate prediction

Data set details

Survived: Survived 0 = No, 1 = Yes
Pclass:

Ticket class1 = 1st, 2 = 2nd, 3 = 3rd 1st = Upper, 2nd = Middle, 3rd = Lower

Sex: Gender
Age: Age
Sibsp: Number of siblings and spouses on board
Parch: Number of parents, children and grandchildren on board together
Ticket: Ticket boarding number
Fare: Boarding costs
Cabin: Cabin number
Embarked: The name of the port of departure

C = Cherbourg, Q = Queenstown, S = Southampton

Data analysis workflow

We will proceed in the following 7 stages.

1, Questions you want to clarify and definitions of problems 2, acquisition of training and test data 3, Data shaping, creation, cleansing 4, pattern analysis, identification, and exploratory data analysis 5, Problem modeling, prediction and solution 6, Visualize and report problem-solving steps and final solutions 7, Submission of results The above flow is in general order, but there are exceptions.

You may combine stages from multiple workflows.
You may reach the stage sooner than suggested. You may also experience the same stage many times.
Sometimes the stage is skipped.
You may return to the previous stage. Now, let's proceed according to the above seven workflows.

Questions you want to clarify and definitions of issues

In competitions like Kaggle, define the problem to be solved Provides a dataset to train a model to solve a problem Test the model results against the test dataset. This issue is defined in Kaggle as follows:

(1) The training data set is given the state (result) of whether the passenger is "survived" or "died", and this data set can be used to create a model that predicts the survival rate of passengers. ..

(2) Passenger status is not given to the test data set. You can predict passenger status by applying the previous survival prediction model to your test dataset.

(3) For this task, we will create a model that predicts the survival rate of passengers from training data, and apply this model to the test data to predict the survival rate of passengers.

Workflow goal

The data analysis workflow will meet seven key goals.

① Classification

Classify the data into groups. A group is a collection of data with the same properties. If you check the group first rather than checking the data individually You can reduce the time it takes to understand the nature of your data.

② Correlating

How much influence each variable has on the result (does it influence the result)? Find out how much relation (collinearity) there is between variables. Variables are measurable data such as name, age, gender, and price in this dataset. Variables are also called features and are factors that influence the results.

③ Conversion

In order to create a model, you need to transform the data depending on the type of algorithm you are using. Some algorithms only accept datasets that consist only of numbers. Some datasets are also accepted as a mixture of numbers and category values (character strings). For example, in the former case, it is common to convert category values to binary or multi-valued.

④ Completing

In the process of data conversion, if the dataset contains missing values, it must be complemented in an appropriate manner. For example, complete with the mean value of the variable, complete with 0 (zero) For example, complement with values before and after. Another option is to exclude variables that have too many missing values. Proper performance of these operations will affect the accuracy of the model you create later.

⑤ Modifying

If the dataset contains outliers (outliers) in the process of data conversion You have to fix it in the right way. It can be fixed in the same way as missing value completion, If the number of outliers is very small compared to the number of data, you may exclude the relevant data. Another option is to exclude variables that have too many outliers. Proper performance of these operations will affect the accuracy of the model you create later.

⑥ Creating

Creating new variables from existing variables or by combining existing variables is also an effective method. For example, you can combine pairs of highly correlated variables into one. By doing this, you may be able to reduce the time it takes to create the model.

⑦ Graphing (Charting)

Visual representation using graphs is an effective means for efficiently interpreting analysis results. It also has the potential to encourage decision-making on the problem you want to solve.

Workflow

Acquisition of training and test data

Now, let's get the data. First, import the various required libraries. (If the library is not installed, install it with a command like the following)

pip3 install package name
#At first!With commands, you can execute UNIX commands.

If the sklearn module is not installed, install it as follows.

pip3 install sklearn

Get the data

The Pandas library is useful for tasks such as converting data. First, use this library to get the training and test datasets in dataframe format.

To see what files exist in a directory (folder)

Execute the ls command

(This time, I will post the data I prepared as an example) With training data. / 8010_titanic_data / train.csv You can see that there is test data. / 8010_titanic_data / test.csv. You can check the files in the current directory by running the ls command.

2010_ml_introduction         README.md
2010_ml_introduction.1       Titanic Analysis Excercise.ipynb
4010_Pandas                  Titanic Analysis.ipynb
4040_data_visualization      titanic_data
4050_data_cleansing          kernel.ipynb
5010_regression              5020_classfication Kaggle

Understand the nature of the data

Pandas also help you understand the nature of your data. This is how to convert / complement / modify data In other words, it is necessary work to make a policy on whether to perform preprocessing.

Which features can be used in the dataset What kind of variables (features) exist using Pandas Let's take a look. To distinguish it from variables in programming Hereafter, the variables in the analysis will be referred to as "features".

Here is the code as an example.

It's hard to see everything, so let's look at the first 5 lines and the last 5 lines.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#1. train_Outputs df features (= column names).
print(train_df.columns.values)

#2. train_Output the first 5 lines of df.
train_df.head(5)

#3. train_Output 5 lines from the back of df.
train_df.tail(5)

#4. test_Output the first 5 lines of df.
test_df.head(5)

#5. test_Output 5 lines from the back of df.
test_df.tail(5)

Types of features

By the work so far, we have confirmed the value of the feature amount. Next, about the value taken by the feature quantity.

For features

With category values (qualitative data)
Numerical value(Quantitative data) exists.

Category value

For the medium value of categorical data

With Nominal Scale
Ordinal Scale)there is.

['PassengerId' ,'Survived' ,'Pclass' ,'Name' ,'Sex' ,'Age' ,'SibSp' ,'Parch','Ticket' ,'Fare' ,'Cabin' ,'Embarked']

Nominal data

For this dataset, the nominal scales are Survived, Sex, and Emberked.

These values are stored as strings. For example, if it is Name This is Ware, Mr. Frederick. Survived is written as 1 or 0 depending on the thing Originally, it can be expressed as Yes or No.

Order data

Order data is data that indicates the order. In terms of this data Pclass corresponds to the order data.

Numerical value

There are two types of numerical values: discrete data and continuous data.

Discrete Variable

Discrete variable corresponds to SibSp, Parch, etc.

Continuous Variable

Continuous data corresponds to Age, Fare, etc.

Count the number of missing values

Each feature may contain missing values. So let's check the missing values.

To calculate the missing value, use the info method.

#1. train_Use the info method to check for missing values for all features of df.
train_df.info()

Which feature has a missing value

The following data is given as an example.

train_In df, it is known that there are 891 data (cases) in total.
That is, the data

With Age, which has only 714
With only 204 Cabins
Embarked with only 889
A missing value is included in the feature quantity.

train_df.info()

PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object

Besides, test_In df, the total is 418 data

Age is 332
Fare is 417
You can see that Cabin is 91.

PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object

Check for duplicate data

To check for duplicates

Use the describe method.

train_df.describe()

If you narrow down only the necessary data

train_df.describe(include=['O'])
#count is the number of data (number of records), unique is the number of data without duplication, top is the most contained value, and freq is the number of top values.
#For example, if you focus on the unique feature quantity Sex (gender), you can see that it takes two values.
#This is because Sex is male (male) and female (female).

Data analysis policy

From these data, I would like to achieve the seven goals I mentioned earlier.

correlation

The feature quantity Survived stores the passenger status (result). To what extent are other features such as Sex and Age related to this feature? Investigate To find out the relationship, look at the correlation.

Complement

From variables that are likely to affect Survived (likely to have a strong relationship) We will prioritize the storage of data. First of all, Age data seems to be related to Survived It needs to be complemented. Embarked may also be involved.

Fix

1,Ticket has a high duplication rate (22%), and there may be no correlation between Ticket and Survived. Tickets may be excluded from the dataset used for analysis.

2, Is Cabin very incomplete on both the training and test datasets? Or it contains many null values and may be removed.

3, PassengerId takes a unique value that identifies the passenger. This feature does not contribute to Survived and may be removed from the training dataset.

4, Name also takes a unique value that identifies the passenger, so it may not contribute to Survived.

Create

Create a new feature called Family based on Parch and SibSp You may want to get the total number of families.

You can use Name to extract Title as a new feature. For example, titles such as Mr and Mrs can change the survival rate.

You may be able to create a new feature by dividing Age by specifying a range. This is because it is easier to predict if continuous data is converted to discrete data.

Like Age, Fare may be able to specify a range and divide it to create a new feature.

Classification

Also, let's review the data based on the hypothesis at the beginning.

1, Female (Sex = female) is likely to have been alive (Survived = 1). 2, It is highly possible that the child (Age <?) Was alive (Survived = 1). 3, Higher class passengers (Pclass = 1) are likely to survive (Survived = 1).

Aggregate features in a pivot table

Let's create a pivot table to understand the correlation between features. This can only be done for features that do not contain missing values.

Also, only for features with category (gender), order (Pclass) or discrete (SibSp, Parch) values. It makes sense to do.

#First of all, create a PivotTable for Pclass and Survived.

#1.Create PivotTables for Pclass and Survived.
#Pclass is 1-The order data is up to 3. Calculate the average value of Survived accordingly.
train_df[["Pclass", "Survived"]].groupby(["Pclass"], as_index=False).mean().sort_values(by="Survived", ascending=False)

#1.Create Sex and Survived PivotTables. Sex is Female,There are two data for male. Calculate the average value of Survived accordingly.
train_df[["Sex", "Survived"]].groupby(["Sex"], as_index=False).mean().sort_values(by="Survived", ascending=False)

#1.Create PivotTables for Parch and Survived. Parch is 1~There are 8 data. Calculate the average value of Survived accordingly.
train_df[["Parch", "Survived"]].groupby(["Parch"], as_index=False).mean().sort_values(by="Survived", ascending=False)

# 1.Created SibSp and Survived PivotTables. SibSp is 1~There are 8 data. Calculate the average value of Survived accordingly.
train_df[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Click here for reference materials [Summary of grouping / aggregation / conversion processing in Python pandas] (http://sinhrks.hatenablog.com/entry/2014/10/13/005327) [pandas.DataFrame, sort_values, sort_index to sort Series] (https://note.nkmk.me/python-pandas-sort-values-sort-index/)

Python: Ship Survival Prediction Part 1