Ticket class1 = 1st, 2 = 2nd, 3 = 3rd 1st = Upper, 2nd = Middle, 3rd = Lower
C = Cherbourg, Q = Queenstown, S = Southampton
We will proceed in the following 7 stages.
1, Questions you want to clarify and definitions of problems 2, acquisition of training and test data 3, Data shaping, creation, cleansing 4, pattern analysis, identification, and exploratory data analysis 5, Problem modeling, prediction and solution 6, Visualize and report problem-solving steps and final solutions 7, Submission of results The above flow is in general order, but there are exceptions.
In competitions like Kaggle, define the problem to be solved Provides a dataset to train a model to solve a problem Test the model results against the test dataset. This issue is defined in Kaggle as follows:
(1) The training data set is given the state (result) of whether the passenger is "survived" or "died", and this data set can be used to create a model that predicts the survival rate of passengers. ..
(2) Passenger status is not given to the test data set. You can predict passenger status by applying the previous survival prediction model to your test dataset.
(3) For this task, we will create a model that predicts the survival rate of passengers from training data, and apply this model to the test data to predict the survival rate of passengers.
The data analysis workflow will meet seven key goals.
Classify the data into groups. A group is a collection of data with the same properties. If you check the group first rather than checking the data individually You can reduce the time it takes to understand the nature of your data.
How much influence each variable has on the result (does it influence the result)? Find out how much relation (collinearity) there is between variables. Variables are measurable data such as name, age, gender, and price in this dataset. Variables are also called features and are factors that influence the results.
In order to create a model, you need to transform the data depending on the type of algorithm you are using. Some algorithms only accept datasets that consist only of numbers. Some datasets are also accepted as a mixture of numbers and category values (character strings). For example, in the former case, it is common to convert category values to binary or multi-valued.
In the process of data conversion, if the dataset contains missing values, it must be complemented in an appropriate manner. For example, complete with the mean value of the variable, complete with 0 (zero) For example, complement with values before and after. Another option is to exclude variables that have too many missing values. Proper performance of these operations will affect the accuracy of the model you create later.
If the dataset contains outliers (outliers) in the process of data conversion You have to fix it in the right way. It can be fixed in the same way as missing value completion, If the number of outliers is very small compared to the number of data, you may exclude the relevant data. Another option is to exclude variables that have too many outliers. Proper performance of these operations will affect the accuracy of the model you create later.
Creating new variables from existing variables or by combining existing variables is also an effective method. For example, you can combine pairs of highly correlated variables into one. By doing this, you may be able to reduce the time it takes to create the model.
Visual representation using graphs is an effective means for efficiently interpreting analysis results. It also has the potential to encourage decision-making on the problem you want to solve.
Now, let's get the data. First, import the various required libraries. (If the library is not installed, install it with a command like the following)
pip3 install package name
#At first!With commands, you can execute UNIX commands.
If the sklearn module is not installed, install it as follows.
pip3 install sklearn
The Pandas library is useful for tasks such as converting data. First, use this library to get the training and test datasets in dataframe format.
To see what files exist in a directory (folder)
Execute the ls command
(This time, I will post the data I prepared as an example) With training data. / 8010_titanic_data / train.csv You can see that there is test data. / 8010_titanic_data / test.csv. You can check the files in the current directory by running the ls command.
2010_ml_introduction README.md
2010_ml_introduction.1 Titanic Analysis Excercise.ipynb
4010_Pandas Titanic Analysis.ipynb
4040_data_visualization titanic_data
4050_data_cleansing kernel.ipynb
5010_regression 5020_classfication Kaggle
Pandas also help you understand the nature of your data. This is how to convert / complement / modify data In other words, it is necessary work to make a policy on whether to perform preprocessing.
Which features can be used in the dataset What kind of variables (features) exist using Pandas Let's take a look. To distinguish it from variables in programming Hereafter, the variables in the analysis will be referred to as "features".
Here is the code as an example.
It's hard to see everything, so let's look at the first 5 lines and the last 5 lines.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
#1. train_Outputs df features (= column names).
print(train_df.columns.values)
#2. train_Output the first 5 lines of df.
train_df.head(5)
#3. train_Output 5 lines from the back of df.
train_df.tail(5)
#4. test_Output the first 5 lines of df.
test_df.head(5)
#5. test_Output 5 lines from the back of df.
test_df.tail(5)
By the work so far, we have confirmed the value of the feature amount. Next, about the value taken by the feature quantity.
For features
With category values (qualitative data)
Numerical value(Quantitative data) exists.
For the medium value of categorical data
With Nominal Scale
Ordinal Scale)there is.
['PassengerId' ,'Survived' ,'Pclass' ,'Name' ,'Sex' ,'Age' ,'SibSp' ,'Parch','Ticket' ,'Fare' ,'Cabin' ,'Embarked']
For this dataset, the nominal scales are Survived, Sex, and Emberked.
These values are stored as strings. For example, if it is Name This is Ware, Mr. Frederick. Survived is written as 1 or 0 depending on the thing Originally, it can be expressed as Yes or No.
Order data is data that indicates the order. In terms of this data Pclass corresponds to the order data.
There are two types of numerical values: discrete data and continuous data.
Discrete variable corresponds to SibSp, Parch, etc.
Continuous data corresponds to Age, Fare, etc.
Each feature may contain missing values. So let's check the missing values.
To calculate the missing value, use the info method.
#1. train_Use the info method to check for missing values for all features of df.
train_df.info()
The following data is given as an example.
train_In df, it is known that there are 891 data (cases) in total.
That is, the data
With Age, which has only 714
With only 204 Cabins
Embarked with only 889
A missing value is included in the feature quantity.
train_df.info()
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
Besides, test_In df, the total is 418 data
Age is 332
Fare is 417
You can see that Cabin is 91.
PassengerId 418 non-null int64
Pclass 418 non-null int64
Name 418 non-null object
Sex 418 non-null object
Age 332 non-null float64
SibSp 418 non-null int64
Parch 418 non-null int64
Ticket 418 non-null object
Fare 417 non-null float64
Cabin 91 non-null object
Embarked 418 non-null object
To check for duplicates
Use the describe method.
train_df.describe()
If you narrow down only the necessary data
train_df.describe(include=['O'])
#count is the number of data (number of records), unique is the number of data without duplication, top is the most contained value, and freq is the number of top values.
#For example, if you focus on the unique feature quantity Sex (gender), you can see that it takes two values.
#This is because Sex is male (male) and female (female).
From these data, I would like to achieve the seven goals I mentioned earlier.
The feature quantity Survived stores the passenger status (result). To what extent are other features such as Sex and Age related to this feature? Investigate To find out the relationship, look at the correlation.
From variables that are likely to affect Survived (likely to have a strong relationship) We will prioritize the storage of data. First of all, Age data seems to be related to Survived It needs to be complemented. Embarked may also be involved.
1,Ticket has a high duplication rate (22%), and there may be no correlation between Ticket and Survived. Tickets may be excluded from the dataset used for analysis.
2, Is Cabin very incomplete on both the training and test datasets? Or it contains many null values and may be removed.
3, PassengerId takes a unique value that identifies the passenger. This feature does not contribute to Survived and may be removed from the training dataset.
4, Name also takes a unique value that identifies the passenger, so it may not contribute to Survived.
Create a new feature called Family based on Parch and SibSp You may want to get the total number of families.
You can use Name to extract Title as a new feature. For example, titles such as Mr and Mrs can change the survival rate.
You may be able to create a new feature by dividing Age by specifying a range. This is because it is easier to predict if continuous data is converted to discrete data.
Like Age, Fare may be able to specify a range and divide it to create a new feature.
Also, let's review the data based on the hypothesis at the beginning.
1, Female (Sex = female) is likely to have been alive (Survived = 1). 2, It is highly possible that the child (Age <?) Was alive (Survived = 1). 3, Higher class passengers (Pclass = 1) are likely to survive (Survived = 1).
Let's create a pivot table to understand the correlation between features. This can only be done for features that do not contain missing values.
Also, only for features with category (gender), order (Pclass) or discrete (SibSp, Parch) values. It makes sense to do.
#First of all, create a PivotTable for Pclass and Survived.
#1.Create PivotTables for Pclass and Survived.
#Pclass is 1-The order data is up to 3. Calculate the average value of Survived accordingly.
train_df[["Pclass", "Survived"]].groupby(["Pclass"], as_index=False).mean().sort_values(by="Survived", ascending=False)
#1.Create Sex and Survived PivotTables. Sex is Female,There are two data for male. Calculate the average value of Survived accordingly.
train_df[["Sex", "Survived"]].groupby(["Sex"], as_index=False).mean().sort_values(by="Survived", ascending=False)
#1.Create PivotTables for Parch and Survived. Parch is 1~There are 8 data. Calculate the average value of Survived accordingly.
train_df[["Parch", "Survived"]].groupby(["Parch"], as_index=False).mean().sort_values(by="Survived", ascending=False)
# 1.Created SibSp and Survived PivotTables. SibSp is 1~There are 8 data. Calculate the average value of Survived accordingly.
train_df[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)
Click here for reference materials [Summary of grouping / aggregation / conversion processing in Python pandas] (http://sinhrks.hatenablog.com/entry/2014/10/13/005327) [pandas.DataFrame, sort_values, sort_index to sort Series] (https://note.nkmk.me/python-pandas-sort-values-sort-index/)
Recommended Posts