Summary of pre-processing practices for Python beginners (Pandas dataframe)

I'm new to Python. Regarding data frame operation in Pandas, although there are abundant articles on the operation explanation alone, I felt that there was no article that explained the points and purposes of preprocessing. I decided to create it as a learning memo.

Assumed reader

--Python beginner ∋ I --Those who have begun to touch Pandas

What you can do after reading this article

--When reading a data frame using the pandas library, you can understand both the purpose of preprocessing and the specific procedure of what to do first. ――In particular, you will be able to easily perform processing after reading the CSV file.

Premise

――The code in this paper is written after writing the following. Please replace df with your data frame as appropriate. ――Imagine the passenger data of the Titanic, which is often used in the introductory content of statistics, but the data that comes out is the fiction for creating this article. --There is no mention of how to create or read the data frame itself, or how to edit the matrix. I plan to publish it at a later date.

import pandas as pd
df = pd.read_csv("hogehoge/test.csv", usecols = ['PassengerId','Sex','Age'], header = 1)

Main article I | Overview of data

1. Visual confirmation

--Visually check the contents of the data using the head method and tail method --Check the row and column names using the columns method and index method. --Purpose: Check if the wrong file is read and if the data is read as expected.

#lead/Enumerate the last two lines. Specify the number of lines you want to check in 2(If omitted, 6 is specified.)
print(df.head(2))
print(df.tail(2))
print("Column name:",df.columns)
print("Line name(index):"df.index)

"""
Displayed as ↓:
# head
   PassengerId     Sex   Age
0            1  female  23.0
1            2    male  48.0

# tail
     PassengerId     Sex   Age
998          999  female  41.0
999         1000    male  15.0

Column name: Index(['PassengerId', 'Sex', 'Age'], dtype='object')

Line name: RangeIndex(start=0, stop=1000, step=1)

"""

--From this result, for example, the following can be confirmed: --Sex is stored as a string, --The line name is returned as RangeIndex, so the line name only has a serial number index (it doesn't have a specific name), and there are 1000 pieces of data. --RangeIndex (start = 0, stop = 1000, step = 1) is "starting from 0 and indexing each 1 with less than 1000 numbers", so the number of data (number of rows) is indexed from 0 to 999. 1000 pieces

2. Data type confirmation

--Use the dtypes attribute --Attribute-> Attach `` `.hoge``` after the data frame like a method --Purpose: Depending on the library used, the calculation with mixed data types may cause an error, so to remove it later (described later).

print(df.dtypes)

"""
It will be displayed as below
PassengerId      int64
Sex             object
Age            float64
"""

――From this result, I think you can create the following issues, for example: ―― 1) Sex is stored as a character string such as male or female. Isn't it better to add a dummy value such as 0/1 to use in the calculation? ―― 2) Age is float (floating point type), while PassengerId is int (integer type). Both are used for calculation, and it would be better to unify them to either one.

3. Confirmation and replacement of missing values (NaN)

--Use a combination of isnull method and any method and exclude ――By combining these, you can detect "columns containing even one NaN". --Purpose: Missing values have an adverse effect on the overall calculation result, so they are excluded (described later).

print(df.isnull().any())

"""
The result will be displayed as below
PassengerId    False
Sex            False
Age             True
dtype: bool

"""

――The suggestion from here is that "NaN exists in the Age column, so it seems possible to remove it." --The processing method (whether to delete the row where NaN exists, replace NaN with 0, delete the Age column itself, etc.) depends on the case.

4. Confirmation of basic statistics

--Let's check the basic statistics using the describe method --Tells you the total value, arithmetic mean value, standard deviation, and quartile of each column. --Purpose: Overview of the data to be analyzed and check for outliers.

print(df.describe())
"""
       PassengerId         Age
count  1000.000000  884.000000
mean    446.000000   29.699118
std     257.353842   14.526497
min       1.000000    3.100000
25%     215.500000   20.125000
50%     430.000000   27.000000
75%     703.500000   39.000000
max    1000.000000   80.000000
"""

--Suggestions obtained: --Although the min of Age is 3.1, it seems that the age is recorded as an integer (though it is a floating point type) as confirmed by head / tail. Isn't this 3.1 a 31 mistake of the data acquirer? Confirmation is required. ――Be careful how to read the statistics --PassengerId (passenger number) statistics are meaningless --Since the Sex column is an object type, it is automatically excluded.

Main article II | Perform basic processing

1. Handle missing values

――In this case, for example, "Let's set NaN of age to 0. When calculating the average value of age in the future, let's analyze values other than 0", and convert NaN to 0. --In loc, extract "all Age columns in the row where the value of Age column is NaN" (although it is complicated in Japanese) and substitute 0.

#Perform a transformation on the column where the presence of NaN was confirmed in the previous chapter.
df.loc[df['Age'].isnull(), 'Age'] = 0

#Check if the process was done correctly
print(df.isnull().any())

"""
It will be displayed as follows. Compare with the previous chapter c.
PassengerId    False
Sex            False
Age            False
dtype: bool
"""

2. Unify data types and data types

--Based on the previous chapter, work to unify the data types --Convert data type column by column using astype method --In this case, you need to (1) change PassengerId to float64 type, and (2) assign 0/1 as a dummy variable to Sex (and also make it float64 type).

#PassengerId type change
df.PassengerId = df.PassengerId.astype('float64')

#Sex dummy value assignment(0 for male and 1 for female) &float64
df.Sex[df.Sex=='male'] = 0
df.Sex[df.Sex=='female'] = 1
df.Sex = df.Sex.astype('float64')

#Check if the process was done correctly
print(df.dtypes)

"""
It should look like this:
PassengerId    float64
Sex            float64
Age            float64

"""

in conclusion

--The basic pre-processing flow and procedure are summarized. No matter what data you analyze, the need for such pre-processing will surely emerge. We would appreciate it if you could send us your feedback. I am also a beginner, so I will study further. --3/27 postscript: I actually tried this pre-processing procedure here -titanic). Please have a look if you like!

reference

-Get a specific row / column from a dataframe in Pandas -Check Pandas dataframes -[Python beginner's memo] Importance and method of confirming missing value NaN before data analysis