The first thing to do when data is given in data analysis is to grasp the outline of the contents. We try to understand what the features are and what the types of data are by converting the data into a table. The work to be done at the same time is the investigation of missing values. Since the presence or absence of data loss affects data manipulation, first check the presence or absence, and then check the frequency of missing values. In this article, we'll see how to do this in Python and R.
(The programming environment is Jupyter Notebook + Python 3.5.2 and Jupyter Notebook + IRkernel (R 3.2.3).)
We decided to use "Titanic" provided by Kaggle as the data set. Many people may have seen the data, but this classifies "survived" / "not survived" based on the passenger characteristics. As will be described later, this is a dataset that contains missing values.
First, load the data into the pandas DataFrame with Python.
def load_data(fn1='./data/train.csv', fn2='./data/test.csv'):
train = pd.read_csv(fn1)
test = pd.read_csv(fn2)
return train, test
train, test = load_data()
train.head()
train.csv
As shown in the above figure, you can already see the indication of'NaN'in the first 5 rows of data, the Cabin column. Let's look at test.csv as well.
test.csv
Similarly, it was found that'NaN' was lined up in Cabin. By the way, the size (shape) of the data is not so large as a data set because train is (891, 12) and test is (418, 11).
Next, let's check which features (columns) contain data defects.
# check if NA exists in each column
train.isnull().any(axis=0)
# output
'''
PassengerId False
Survived False
Pclass False
Name False
Sex False
Age True
SibSp False
Parch False
Ticket False
Fare False
Cabin True
Embarked True
dtype: bool
'''
isnull () is a function that determines whether or not it is NA, and any () is a function that comprehensively looks at multiple locations and returns the truth. Since any () takes the axis option, if you want to enclose it in "column" units (= scan in the "row" direction), set axis = 0 (it can be omitted because it is the default value). Similarly, let's look at test.
# check if NA exists in each column
test.isnull().any()
# output
'''
PassengerId False
Pclass False
Name False
Sex False
Age True
SibSp False
Parch False
Ticket False
Fare True
Cabin True
Embarked False
dtype: bool
'''
As mentioned above, in train data, ['Age','Cabin','Embarked'] contains data loss, and in test data, ['Age','Fare','Cabin'] contains data loss. It turned out. Based on this, "Let's make a model that does not use features with data loss ('Age','Fare','Cabin','Embarked') as a prototype of the target classifier." , "But'Age'(age) seems to affect the classification (survival or not)" and so on.
Next, count the number of data loss. First, train data.
# count NA samples
train.isnull().sum()
# output
'''
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
'''
Similarly, test data.
# count NA samples
test.isnull().sum()
# output
'''
PassengerId 0
Pclass 0
Name 0
Sex 0
Age 86
SibSp 0
Parch 0
Ticket 0
Fare 1
Cabin 327
Embarked 0
dtype: int64
'''
From the above results, the following points could be grasped.
---'Age' contains data loss at a certain rate in both train and test. --'Cabin'has a large data loss in train / test. --'Embarked' is only train and 2 defects. --'Fare' is only one missing in test.
As a method of data classification, it is possible to decide the policy such as "For the time being, let's model'Cabin'by removing it from the features!" And "Dropna () because there are two cases of'Embarked'".
Do the same thing you did in Python with R. First, read the file into the data frame.
# Reread not to conver sting to factor
train <- read.csv("./data/train.csv", header=T, stringsAsFactors=F)
test <- read.csv("./data/test.csv", header=T, stringsAsFactors=F)
header
is an option to specify the handling of header lines, and stringsAsFactors
is an option to specify whether to convert a character string to a factor type (factor). If you input the file above, the train data frame will be as follows.
train.csv
The age of'Moran, Mr. James'with ID = 6 is'NA'. Next, check for missing values in each column.
# any() usage:
is_na_train <- sapply(train, function(y) any(is.na(y)))
is_na_test <- sapply(test, function(y) any(is.na(y)))
Here, any () is used in the same way as Python. Next, count the number of missing values.
# count na
na_count_train <- sapply(train, function(y) sum(is.na(y)))
na_count_train
# output
# PassengerId 0
# Survived 0
# Pclass 0
# Name 0
# Sex 0
# Age 177
# SibSp 0
# Parch 0
# Ticket 0
# Fare 0
# Cabin 0
# Embarked 0
Do you understand? It is different from the result obtained by Python above. Look at the test data as well.
# count na
na_count_test <- sapply(test, function(y) sum(is.na(y)))
# output
# PassengerId 0
# Pclass 0
# Name 0
# Sex 0
# Age 86
# SibSp 0
# Parch 0
# Ticket 0
# Fare 1
# Cabin 0
# Embarked 0
This is also significantly reduced (especially in'Cabin') compared to the number of NAs obtained in Python. Why?
In fact, the reason for this difference (Python vs. R) is that the treatment of blanks ("", blanks) is different.
train.csv (In Python, NaN was already in the red frame when reading.)
The ʻisnull () functions supported by Python pandas also determine that whitespace ("") is null, whereas R's ʻis.na ()
does not put spaces in na. Due to this, the count of na is low.
This time, Titanic's'Cabin'is data indicating the cabin ID, so the "blank" is probably because there is no record (although it is speculated). Also, for blanks, it is highly likely that the data analysis flow will be separated (from the processing with'Cabin'data), so it is preferable for the program to count the blanks as NA. Therefore, change to an R script that treats spaces ("") as NA like Python code.
# Reread data with na.string option
train <- read.csv("./data/train.csv", header=T, stringsAsFactors=F,
na.strings=(c("NA", "")))
test <- read.csv("./data/test.csv", header=T, stringsAsFactors=F,
na.strings=(c("NA", "")))
By specifying the na.strings option of read.csv () as na.strings = (c ("NA", ""))
, blanks ("") are converted to NA. After that, the NA was counted as follows.
# Counting NA
na_count_train <- sapply(train, function(y) sum(is.na(y)))
na_count_test <- sapply(test, function(y) sum(is.na(y)))
Output result:
# --- Train dataset ---
# PassengerId 0
# Survived 0
# Pclass 0
# Name 0
# Sex 0
# Age 177
# SibSp 0
# Parch 0
# Ticket 0
# Fare 0
# Cabin 687
# Embarked 2
# --- Test dataset ---
# PassengerId 0
# Pclass 0
# Name 0
# Sex 0
# Age 86
# SibSp 0
# Parch 0
# Ticket 0
# Fare 1
# Cabin 327
# Embarked 0
This is in agreement with the Python result. In this way, it was found that the definition of NA is different between Python (pandas) and R. As a comparison, R seems to have a stricter classification of null / NaN / NA. In pandas, blank / NA / NaN are all judged as NA by isnull (), but it seems that there is no problem in practical use. (To put it badly, the treatment of pandas is "ambiguous".)
The following is quoted from the Python pandas documentation (http://pandas.pydata.org/pandas-docs/version/0.18.1/missing_data.html).
Note: The choice of using NaN internally to denote missing data was largely for simplicity and performance reasons. It differs from the MaskedArray approach of, for example, scikits.timeseries. We are hopeful that NumPy will soon be able to provide a native NA type solution (similar to R) performant enough to be used in pandas.
Blank isn't written, but pandas says it's implemented as it is now for simplicity and performance. I myself have never encountered a case where NA data must be strictly divided into blank / NA / NaN, so how to handle it in Python described in this article (in R, how to convert blank to NA) ) I want to remember.
(For reference, it is a conversion from Blank to NA, but I confirmed that the conversion process is performed with the same option (na.strings) in fread () of R package {data.table}.)
Kaggle Titanic is Kaggle's Tutorial-like competition, but when you look at the Leader Board, the scores vary widely from excellent to mediocre. It is presumed that one of the points to raise the score here is the parameter adjustment of the classifier, and the other is the method of interpolating the missing values of the data, especially the'Age'. At the moment, there are still days before the deadline (12/31/2016), so I would like to take the opportunity to try the Titanic competition again. (The Top group has achieved an accuracy rate of 1.0, how do you do it ...)
Recommended Posts