This is the story of participating in the Kaggle </ b> competition for the first time. In Last "Learning with Kaggle's Titanic" I went through all the steps to learn and submit, but the result was that I lost to the sample data (76%). This time, I would like to scrutinize the data in order to raise the score in the "Titanic competition".
reference History
Last time picked up the training data appropriately and trained it using "pclass", "sex", "Age", and "fare". This time, I would like to select data related to learning based on the grounds.
A scatter plot is useful for checking the correlation. The scatter plot is a graph like the one below.
If the scatter plot looks like the one above, you can see that it is related to the right shoulder rise. For example, if the horizontal axis is height and the vertical axis is weight, the taller the person, the heavier the weight, so the scatter plot is as shown above. (I think it will be.)
However, the scatter plot is effective when both the horizontal axis and the vertical axis are quantitative variables. When creating a scatter plot of Titanic's "gender" (qualitative variable) and "survival" (qualitative variable) There are two types of gender on the horizontal axis, "male" and "female", and two types of survival on the vertical axis, "0" and "1". The scatter plot consists of only four points, which is not very meaningful. The method of examining the correlation depends on the type of data (quantitative variable, qualitative variable).
Earlier, the terms quantitative and qualitative variables came up. The data (variables) to be handled are classified into the following scales according to their properties.
Variable type td> | Scale type td> | Meaning td> | Example td> tr> |
Qualitative variables td> | Nominal scale td> | Scale used to distinguish td> | Gender, prefecture td > tr> |
Ordinal scale td> | Scale that is meaningful only for magnitude relations td> | Rank, seismic intensity td> tr> | |
Quantitative variable td> | Interval scale td> | Scales are evenly spaced and the intervals are meaningful tr> td> | Temperature, AD td> tr> |
Proportional scale td> | Meaningful intervals and ratios (0 is the origin) td> | Height, price td> tr> |
Scatter plots are useful when both comparisons are "quantitative variables". I think that it is also effective for "quantitative variables" and "ordinal scales". It is not very effective when one is a nominal scale. For example, if you plot Titanic's "survival (0, 1)" and "fare" on a scatter plot, it will look like this:
The horizontal axis (survival) is a binary value of 0 and 1, and the correlation is not clear. For nominal scales, scatter plots are unlikely to be very effective.
In addition to scatter plots, correlations are sometimes referred to as "correlation coefficients." Let's take a look here.
The correlation coefficient is used as an indicator of how much the two sets of data are related. numpy has a function to find the correlation coefficient.
numpy.corrcoef(x, y)[0, 1]
numpy's corrcoef is correctly called "Pearson's product moment correlation coefficient". "Pearson's product moment correlation coefficient" is used for the correlation between "quantitative variables" described later. There are other correlation coefficients such as "Kramer's number of associations" and "correlation ratio", which are used according to the scale. I summarized the scale of the variable and the corresponding correlation coefficient.
Variable 1 td> | Variable 2 td> | Correlation coefficient used td> | Titanic variable td> tr> |
Nominal scale td> | Nominal scale td> | Number of Klamer correlations td> | Gender, ticket number, room number, Boarding port td> tr> |
Ordinal scale td> | Rank correlation td> | Ticket class td> tr> | |
Quantitative variables td> | Correlation ratio td> | Age, sibs, parch, fares td> tr> |
Below is a code sample of "Number of associations of Cramer" and "Correlation ratio". I think the "rank correlation ratio" is almost the same as the correlation ratio.
import numpy
import pandas
######################################
#Number of Klamer correlations
# Cramer's coefficient of association
# 0.5 >= :Very strong correlation
# 0.25 >= :Strong correlation
# 0.1 >= :Slightly weak correlation
# 0.1 < :No correlation
######################################
def cramersV(x, y):
"""
Calc Cramer's V.
Parameters
----------
x : {numpy.ndarray, pandas.Series}
y : {numpy.ndarray, pandas.Series}
"""
table = numpy.array(pandas.crosstab(x, y)).astype(numpy.float32)
n = table.sum()
colsum = table.sum(axis=0)
rowsum = table.sum(axis=1)
expect = numpy.outer(rowsum, colsum) / n
chisq = numpy.sum((table - expect) ** 2 / expect)
return numpy.sqrt(chisq / (n * (numpy.min(table.shape) - 1)))
######################################
#Correlation ratio
# Correlation ratio
# 0.5 >= :Very strong correlation
# 0.25 >= :Strong correlation
# 0.1 >= :Slightly weak correlation
# 0.1 < :No correlation
######################################
def CorrelationV(x, y):
"""
Calc Correlation ratio
Parameters
----------
x : nominal scale {numpy.ndarray, pandas.Series}
y : ratio scale {numpy.ndarray, pandas.Series}
"""
variation = ((y - y.mean()) ** 2).sum()
inter_class = sum([((y[x == i] - y[x == i].mean()) ** 2).sum() for i in numpy.unique(x)])
correlation_ratio = inter_class / variation
return 1 - correlation_ratio
The correlation coefficient is a number from -1 to 1. The weight of the value changes depending on each formula. The guideline for the value of the correlation coefficient is as follows.
〇 Number of associations and correlation ratio of Kramer
value | Correlation |
---|---|
0.5 >= | Very strongly correlated |
0.25 >= | There is a strong correlation |
0.1 >= | There is a slightly weak correlation |
0.1 < | No correlation |
〇 Pearson's product moment correlation coefficient
value | Correlation |
---|---|
0.7 >= | Very strongly correlated |
0.4 >= | There is a strong correlation |
0.2 >= | There is a slightly weak correlation |
0.1 < | No correlation |
In addition to the correlation coefficient, the nominal scale may show the correlation by graphing the "crosstab".
Now, let's look at the correlation between the correlation coefficient and the graph for each Titanic variable. Create a "New NoteBook" in Titanic and define the above "Number of Klamer associations" and "Correlation ratio". After that, read and prepare the training data with the following code.
import matplotlib.pyplot as plt
# train.load csv
# Load train.csv
df = pandas.read_csv('/kaggle/input/titanic/train.csv')
##############################
#Data preprocessing
#Handle missing values
# Data preprocessing
# Fill or remove missing values
##############################
#Age Nan-Convert to 1
# Convert age Nan to -1
df = df.fillna({'Age':-1})
#Embarked Nan-Convert to 1
# Convert Embarked Nan to -1
df = df.fillna({'Embarked':'null'})
##############################
#Data preprocessing
#Quantify the label (name)
# Data preprocessing
# Digitize labels
##############################
from sklearn.preprocessing import LabelEncoder
#Quantify gender using Label Encoder
# Digitize gender using LabelEncoder
encoder_sex = LabelEncoder()
df['Sex'] = encoder_sex.fit_transform(df['Sex'].values)
encoder_embarked = LabelEncoder()
df['Embarked'] = encoder_embarked.fit_transform(df['Embarked'].values)
The ticket class is an ordinal scale. Check with the correlation ratio.
######################################################
#Data analysis 1
#Examine the correlation between Survived and Pclass (nominal scale)
# Data analysis 1
# Examine the correlation between Survived and Pclass(nominal scale)
######################################################
CorrelationV(df['Survived'], df['Pclass'])
0.11456941170524215
It became "weakly correlated". Let's graph the crosstab.
cross_pclass = pandas.crosstab(df['Survived'], df['Pclass'])
cross_pclass.T.plot(kind='bar', stacked=True)
plt.show()
When the class is 1, the survival "1" exceeds 50%. When the class becomes 3, is the survival "1" about 1/4? The correlation coefficient is weak at 0.1, but looking at the graph, it seems that there is a fair correlation.
I'll skip the name now. The order may change, but it is also necessary to "observe the data". I would like to touch on my name in the "Observing Data" section, which I will discuss again.
Gender is a "nominal scale". Check by the number of correlations of Klamer.
######################################################
#Data analysis 2
#Examine the correlation between Survived and Sex (nominal scale)
# Data analysis 2
# Examine the correlation between Survived and Sex(nominal scale)
######################################################
cramersV(df['Survived'], df['Sex'])
0.5433513740027712
It became "very strong correlation". Let's graph the crosstab.
cross_sex = pandas.crosstab(df['Survived'], df['Sex'])
cross_sex.T.plot(kind='bar', stacked=True)
plt.show()
Certainly, the results differ greatly between men and women.
Age is an "ordinal scale". Check with the correlation ratio.
######################################################
#Data analysis 3
#Examine the correlation between Survived and Age (proportional scale)
# Data analysis 3
# Examine the correlation between Survived and Age(ratio scale)
######################################################
CorrelationV(df['Survived'], df['Age'])
0.0001547299039139638
"No correlation". Let's graph the crosstab. Graph in 10-year increments.
cross_age = pandas.crosstab(df['Survived'], round(df['Age'],-1))
cross_age.T.plot(kind='bar', stacked=True)
plt.show()
I feel that there are few survival "1" after 50s, but the result is "no correlation". It was surprising that age had little effect.
SibSp is a "quantitative variable (proportional scale)". Check with the correlation ratio.
######################################################
#Data analysis 4
#Examine the correlation between Survived and SibSp (proportional scale)
# Data analysis 4
# Examine the correlation between Survived and SibSp(ratio scale)
######################################################
CorrelationV(df['Survived'], df['SibSp'])
0.0012476789275327471
"No correlation". Let's graph the crosstab.
cross_age = pandas.crosstab(df['Survived'], df['SibSp'])
cross_age.T.plot(kind='bar', stacked=True)
plt.show()
No significant correlation can be seen by looking at the graph.
Parch is a "quantitative variable (proportional scale)". Check with the correlation ratio.
######################################################
#Data analysis 5
#Examine the correlation between Survived and Parch (proportional scale)
# Data analysis 5
# Examine the correlation between Survived and Parch(ratio scale)
######################################################
CorrelationV(df['Survived'], df['Parch'])
0.006663360100801152
"No correlation". Let's graph the crosstab.
cross_age = pandas.crosstab(df['Survived'], df['Parch'])
cross_age.T.plot(kind='bar', stacked=True)
plt.show()
No significant correlation can be seen here either.
I will also skip the ticket number this time. I would like to touch on this again in the "Observing Data" section, which I will discuss again.
Fares are "quantitative variables (proportional scale)". Check with the correlation ratio. After standardizing, check the correlation ratio. (I think the result will be the same without standardization)
##############################
#Data preprocessing
#Standardize numbers
# Data preprocessing
# Standardize numbers
##############################
from sklearn.preprocessing import StandardScaler
#Standardization
# Standardize numbers
standard = StandardScaler()
df_std = pandas.DataFrame(standard.fit_transform(df[['Pclass', 'Sex', 'Fare']]), columns=['Pclass', 'Sex', 'Fare'])
#Standardize Fare
# Standardize Fare
df['Fare'] = df_std['Fare']
######################################################
#Data analysis 6
#Examine the correlation between Survived and Fare (proportional scale)
# Data analysis 6
# Examine the correlation between Survived and Fare(ratio scale)
######################################################
CorrelationV(df['Survived'], df['Fare'])
0.06620664646184327
"No correlation". Let's graph the crosstab. If it is left as it is, the scale will be finer, so put it together at 0.2 intervals.
######################################
# -1.0 < x < -0.8 ⇒-1.0
# -0.8 < x < -0.6 ⇒-0.8
# -0.6 < x < -0.4 ⇒-0.6
# -0.4 < x < -0.2 ⇒-0.4
# -0.2 < x < 0 ⇒-0.2
# 0 < x < 0.2 ⇒ 0.0
# 0.2 < x < 0.4 ⇒ 0.2
# 0.4 < x < 0.6 ⇒ 0.4
# 0.6 < x < 0.8 ⇒ 0.6
# 0.8 < x < 1.0 ⇒ 0.8
# 1.0 < x ⇒ 1.0
######################################
def one_fifth(x):
if x < -0.8:
return -1.0
elif -0.8 <= x and x < -0.6:
return -0.8
elif -0.6 <= x and x < -0.4:
return -0.6
elif -0.4 <= x and x < -0.2:
return -0.4
elif -0.2 <= x and x < 0:
return -0.2
elif 0 <= x and x < 0.2:
return 0.0
elif 0.2 <= x and x < 0.4:
return 0.2
elif 0.4 <= x and x < 0.6:
return 0.4
elif 0.6 <= x and x < 0.8:
return 0.6
elif 0.8 <= x and x < 1.0:
return 0.8
else:
return 1.0
df['Fare_convert'] = df['Fare'].apply(one_fifth)
cross_age = pandas.crosstab(df['Survived'], df['Fare_convert'])
cross_age.T.plot(kind='bar', stacked=True)
plt.show()
When the fare is left behind, the number of survival "1" increases. The coefficient is low, but there may be a correlation.
The room number will also be skipped this time. I would like to touch on this again in the "Observing Data" section, which I will discuss again.
The port of embarkation is a "nominal scale". Check by the number of correlations of Klamer.
######################################################
#Data analysis 7
#Examine the correlation between Survived and Embarked (nominal scale)
# Data analysis 7
# Examine the correlation between Survived and Embarked(nominal scale)
######################################################
cramersV(df['Survived'], df['Embarked'])
0.18248384812341217
It's good, but it's now "no correlation". Let's graph the crosstab.
cross_embarked = pandas.crosstab(df['Survived'], df['Embarked'])
cross_embarked.T.plot(kind='bar', stacked=True)
plt.show()
There seems to be a correlation, there seems to be no ...
Correlated are Pclass (ticket class) and Sex (gender). Fare also does not reach the standard value, but there seems to be a little correlation in the graph.
I would like to use Pclass (ticket class), Sex (gender), and Fare (fare) as input parameters based on the correlation and crosstab graph. Next is the selection of models, but this is Next time.
Calculate the relationship between variables of various scales (Python) https://qiita.com/shngt/items/45da2d30acf9e84924b7
Calculation of the number of Klamer correlations https://qiita.com/canard0328/items/5ea4115d964b448903ba
2019/12/25 First edition released 2019/12/29 Next link addition
Recommended Posts