Understand the status of data loss --Python vs. R

Introduction

The first thing to do when data is given in data analysis is to grasp the outline of the contents. We try to understand what the features are and what the types of data are by converting the data into a table. The work to be done at the same time is the investigation of missing values. Since the presence or absence of data loss affects data manipulation, first check the presence or absence, and then check the frequency of missing values. In this article, we'll see how to do this in Python and R.

(The programming environment is Jupyter Notebook + Python 3.5.2 and Jupyter Notebook + IRkernel (R 3.2.3).)

Checking data loss status in Python

We decided to use "Titanic" provided by Kaggle as the data set. Many people may have seen the data, but this classifies "survived" / "not survived" based on the passenger characteristics. As will be described later, this is a dataset that contains missing values.

First, load the data into the pandas DataFrame with Python.

def load_data(fn1='./data/train.csv', fn2='./data/test.csv'):
    train = pd.read_csv(fn1)
    test = pd.read_csv(fn2)
    
    return train, test

train, test = load_data()
train.head()

train.csv

As shown in the above figure, you can already see the indication of'NaN'in the first 5 rows of data, the Cabin column. Let's look at test.csv as well.

test.csv

Similarly, it was found that'NaN' was lined up in Cabin. By the way, the size (shape) of the data is not so large as a data set because train is (891, 12) and test is (418, 11).

Next, let's check which features (columns) contain data defects.

# check if NA exists in each column
train.isnull().any(axis=0)

# output
'''
PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool
'''

isnull () is a function that determines whether or not it is NA, and any () is a function that comprehensively looks at multiple locations and returns the truth. Since any () takes the axis option, if you want to enclose it in "column" units (= scan in the "row" direction), set axis = 0 (it can be omitted because it is the default value). Similarly, let's look at test.

# check if NA exists in each column
test.isnull().any()

# output
'''
PassengerId    False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare            True
Cabin           True
Embarked       False
dtype: bool
'''

As mentioned above, in train data, ['Age','Cabin','Embarked'] contains data loss, and in test data, ['Age','Fare','Cabin'] contains data loss. It turned out. Based on this, "Let's make a model that does not use features with data loss ('Age','Fare','Cabin','Embarked') as a prototype of the target classifier." , "But'Age'(age) seems to affect the classification (survival or not)" and so on.

Next, count the number of data loss. First, train data.

# count NA samples
train.isnull().sum()

# output
'''
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
'''

Similarly, test data.

# count NA samples
test.isnull().sum()

# output
'''
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64
'''

From the above results, the following points could be grasped.

---'Age' contains data loss at a certain rate in both train and test. --'Cabin'has a large data loss in train / test. --'Embarked' is only train and 2 defects. --'Fare' is only one missing in test.

As a method of data classification, it is possible to decide the policy such as "For the time being, let's model'Cabin'by removing it from the features!" And "Dropna () because there are two cases of'Embarked'".

Checking data loss status in R

Do the same thing you did in Python with R. First, read the file into the data frame.

# Reread not to conver sting to factor
train <- read.csv("./data/train.csv", header=T, stringsAsFactors=F)
test <- read.csv("./data/test.csv", header=T, stringsAsFactors=F)

header is an option to specify the handling of header lines, and stringsAsFactors is an option to specify whether to convert a character string to a factor type (factor). If you input the file above, the train data frame will be as follows.

train.csv

The age of'Moran, Mr. James'with ID = 6 is'NA'. Next, check for missing values in each column.

# any() usage：
is_na_train <- sapply(train, function(y) any(is.na(y)))
is_na_test <- sapply(test, function(y) any(is.na(y)))

Here, any () is used in the same way as Python. Next, count the number of missing values.

# count na
na_count_train <- sapply(train, function(y) sum(is.na(y)))
na_count_train

# output
# PassengerId   0
# Survived      0
# Pclass        0
# Name          0
# Sex           0
# Age         177
# SibSp         0
# Parch         0
# Ticket        0
# Fare          0
# Cabin         0
# Embarked      0

Do you understand? It is different from the result obtained by Python above. Look at the test data as well.

# count na
na_count_test <- sapply(test, function(y) sum(is.na(y)))

# output
# PassengerId   0
# Pclass        0
# Name          0
# Sex           0
# Age          86
# SibSp         0
# Parch         0
# Ticket        0
# Fare          1
# Cabin         0
# Embarked      0

This is also significantly reduced (especially in'Cabin') compared to the number of NAs obtained in Python. Why?

In fact, the reason for this difference (Python vs. R) is that the treatment of blanks ("", blanks) is different.

train.csv (In Python, NaN was already in the red frame when reading.)

The ʻisnull () functions supported by Python pandas also determine that whitespace ("") is null, whereas R's ʻis.na () does not put spaces in na. Due to this, the count of na is low.

This time, Titanic's'Cabin'is data indicating the cabin ID, so the "blank" is probably because there is no record (although it is speculated). Also, for blanks, it is highly likely that the data analysis flow will be separated (from the processing with'Cabin'data), so it is preferable for the program to count the blanks as NA. Therefore, change to an R script that treats spaces ("") as NA like Python code.

# Reread data with na.string option
train <- read.csv("./data/train.csv", header=T, stringsAsFactors=F, 
    na.strings=(c("NA", "")))
test <- read.csv("./data/test.csv", header=T, stringsAsFactors=F,
    na.strings=(c("NA", "")))

By specifying the na.strings option of read.csv () as na.strings = (c ("NA", "")) , blanks ("") are converted to NA. After that, the NA was counted as follows.

# Counting NA
na_count_train <- sapply(train, function(y) sum(is.na(y)))
na_count_test <- sapply(test, function(y) sum(is.na(y)))

Output result:

# --- Train dataset ---
# PassengerId   0
# Survived      0
# Pclass        0
# Name          0
# Sex           0
# Age         177
# SibSp         0
# Parch         0
# Ticket        0
# Fare          0
# Cabin       687
# Embarked      2

# --- Test dataset ---
# PassengerId   0
# Pclass        0
# Name          0
# Sex           0
# Age          86
# SibSp         0
# Parch         0
# Ticket        0
# Fare          1
# Cabin       327
# Embarked      0

This is in agreement with the Python result. In this way, it was found that the definition of NA is different between Python (pandas) and R. As a comparison, R seems to have a stricter classification of null / NaN / NA. In pandas, blank / NA / NaN are all judged as NA by isnull (), but it seems that there is no problem in practical use. (To put it badly, the treatment of pandas is "ambiguous".)

The following is quoted from the Python pandas documentation (http://pandas.pydata.org/pandas-docs/version/0.18.1/missing_data.html).

Note: The choice of using NaN internally to denote missing data was largely for simplicity and performance reasons. It differs from the MaskedArray approach of, for example, scikits.timeseries. We are hopeful that NumPy will soon be able to provide a native NA type solution (similar to R) performant enough to be used in pandas.

Blank isn't written, but pandas says it's implemented as it is now for simplicity and performance. I myself have never encountered a case where NA data must be strictly divided into blank / NA / NaN, so how to handle it in Python described in this article (in R, how to convert blank to NA) ) I want to remember.

(For reference, it is a conversion from Blank to NA, but I confirmed that the conversion process is performed with the same option (na.strings) in fread () of R package {data.table}.)

Finally

Kaggle Titanic is Kaggle's Tutorial-like competition, but when you look at the Leader Board, the scores vary widely from excellent to mediocre. It is presumed that one of the points to raise the score here is the parameter adjustment of the classifier, and the other is the method of interpolating the missing values of the data, especially the'Age'. At the moment, there are still days before the deadline (12/31/2016), so I would like to take the opportunity to try the Titanic competition again. (The Top group has achieved an accuracy rate of 1.0, how do you do it ...)

References / Web site

Pandas documentation, Working with missing data
http://pandas.pydata.org/pandas-docs/version/0.18.1/missing_data.html
Python pandas: check if any value is NaN in DataFrame - Stack overflow
http://stackoverflow.com/questions/29530232/python-pandas-check-if-any-value-is-nan-in-dataframe --Python pandas Missing / Outlier / Discretizing --Blog "StatsFragments" http://sinhrks.hatenablog.com/entry/2016/02/01/080859 --Handling when there is a missing value (NaN) in the data --Qiita http://qiita.com/gash717/items/df8aa9c7e771ed7539cb
Titanic: Machine Learning from Disaster - Kaggle
https://www.kaggle.com/c/titanic