I'm new to Python. Regarding data frame operation in Pandas, although there are abundant articles on the operation explanation alone, I felt that there was no article that explained the points and purposes of preprocessing. I decided to create it as a learning memo.
--Python beginner ∋ I --Those who have begun to touch Pandas
--When reading a data frame using the pandas library, you can understand both the purpose of preprocessing and the specific procedure of what to do first. ――In particular, you will be able to easily perform processing after reading the CSV file.
――The code in this paper is written after writing the following. Please replace df with your data frame as appropriate. ――Imagine the passenger data of the Titanic, which is often used in the introductory content of statistics, but the data that comes out is the fiction for creating this article. --There is no mention of how to create or read the data frame itself, or how to edit the matrix. I plan to publish it at a later date.
import pandas as pd
df = pd.read_csv("hogehoge/test.csv", usecols = ['PassengerId','Sex','Age'], header = 1)
--Visually check the contents of the data using the head method and tail method --Check the row and column names using the columns method and index method. --Purpose: Check if the wrong file is read and if the data is read as expected.
#lead/Enumerate the last two lines. Specify the number of lines you want to check in 2(If omitted, 6 is specified.)
print(df.head(2))
print(df.tail(2))
print("Column name:",df.columns)
print("Line name(index):"df.index)
"""
Displayed as ↓:
# head
PassengerId Sex Age
0 1 female 23.0
1 2 male 48.0
# tail
PassengerId Sex Age
998 999 female 41.0
999 1000 male 15.0
Column name: Index(['PassengerId', 'Sex', 'Age'], dtype='object')
Line name: RangeIndex(start=0, stop=1000, step=1)
"""
--From this result, for example, the following can be confirmed: --Sex is stored as a string, --The line name is returned as RangeIndex, so the line name only has a serial number index (it doesn't have a specific name), and there are 1000 pieces of data. --RangeIndex (start = 0, stop = 1000, step = 1) is "starting from 0 and indexing each 1 with less than 1000 numbers", so the number of data (number of rows) is indexed from 0 to 999. 1000 pieces
--Use the dtypes attribute --Attribute-> Attach `` `.hoge``` after the data frame like a method --Purpose: Depending on the library used, the calculation with mixed data types may cause an error, so to remove it later (described later).
print(df.dtypes)
"""
It will be displayed as below
PassengerId int64
Sex object
Age float64
"""
――From this result, I think you can create the following issues, for example: ―― 1) Sex is stored as a character string such as male or female. Isn't it better to add a dummy value such as 0/1 to use in the calculation? ―― 2) Age is float (floating point type), while PassengerId is int (integer type). Both are used for calculation, and it would be better to unify them to either one.
--Use a combination of isnull method and any method and exclude ――By combining these, you can detect "columns containing even one NaN". --Purpose: Missing values have an adverse effect on the overall calculation result, so they are excluded (described later).
print(df.isnull().any())
"""
The result will be displayed as below
PassengerId False
Sex False
Age True
dtype: bool
"""
――The suggestion from here is that "NaN exists in the Age column, so it seems possible to remove it." --The processing method (whether to delete the row where NaN exists, replace NaN with 0, delete the Age column itself, etc.) depends on the case.
--Let's check the basic statistics using the describe method --Tells you the total value, arithmetic mean value, standard deviation, and quartile of each column. --Purpose: Overview of the data to be analyzed and check for outliers.
print(df.describe())
"""
PassengerId Age
count 1000.000000 884.000000
mean 446.000000 29.699118
std 257.353842 14.526497
min 1.000000 3.100000
25% 215.500000 20.125000
50% 430.000000 27.000000
75% 703.500000 39.000000
max 1000.000000 80.000000
"""
--Suggestions obtained: --Although the min of Age is 3.1, it seems that the age is recorded as an integer (though it is a floating point type) as confirmed by head / tail. Isn't this 3.1 a 31 mistake of the data acquirer? Confirmation is required. ――Be careful how to read the statistics --PassengerId (passenger number) statistics are meaningless --Since the Sex column is an object type, it is automatically excluded.
――In this case, for example, "Let's set NaN of age to 0. When calculating the average value of age in the future, let's analyze values other than 0", and convert NaN to 0. --In loc, extract "all Age columns in the row where the value of Age column is NaN" (although it is complicated in Japanese) and substitute 0.
#Perform a transformation on the column where the presence of NaN was confirmed in the previous chapter.
df.loc[df['Age'].isnull(), 'Age'] = 0
#Check if the process was done correctly
print(df.isnull().any())
"""
It will be displayed as follows. Compare with the previous chapter c.
PassengerId False
Sex False
Age False
dtype: bool
"""
--Based on the previous chapter, work to unify the data types --Convert data type column by column using astype method --In this case, you need to (1) change PassengerId to float64 type, and (2) assign 0/1 as a dummy variable to Sex (and also make it float64 type).
#PassengerId type change
df.PassengerId = df.PassengerId.astype('float64')
#Sex dummy value assignment(0 for male and 1 for female) &float64
df.Sex[df.Sex=='male'] = 0
df.Sex[df.Sex=='female'] = 1
df.Sex = df.Sex.astype('float64')
#Check if the process was done correctly
print(df.dtypes)
"""
It should look like this:
PassengerId float64
Sex float64
Age float64
"""
--The basic pre-processing flow and procedure are summarized. No matter what data you analyze, the need for such pre-processing will surely emerge. We would appreciate it if you could send us your feedback. I am also a beginner, so I will study further. --3/27 postscript: I actually tried this pre-processing procedure here -titanic). Please have a look if you like!
-Get a specific row / column from a dataframe in Pandas -Check Pandas dataframes -[Python beginner's memo] Importance and method of confirming missing value NaN before data analysis
Recommended Posts