CSV

Read CSV with Pandas

CSV is a file that lists values separated by commas. It is commonly used because it is very easy to handle in data analysis.

To read a csv file with Pandas, use the read_csv () function.

read_csv("File path where the csv file is located", header=)

As the law of header

If the header option is omitted, the first line of the read file will be the column name.
If you specify header = None, Pandas will assign an appropriate column name.
If you specify a line number like header = 1, the data in the second line of the read file will be used as the column name. Reading starts from the line after that.
Since the line number starts from 0, the line number of the first line of data will be 0.

For example, read a wine dataset that does not have column name information from a directory. Since we do not know what the numerical value represents as it is, add a column name (column) that indicates the content of the value.

import pandas as pd

df = pd.read_csv("./4050_data_cleansing_data/wine.csv", header=None)

#Add a column to see what each number represents
df.columns = ["", "Alcohol", "Malic acid", "Ash", "Alcalinity of ash", "Magnesium","Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins","Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline"]
df

The ./ in the file path indicates the current directory. The current directory is the working directory where Python is currently running.

Creating CSV using CSV library

Create a CSV data file using the CSV library that comes standard with Python3.

Outputs the data of the cities, years, and seasons of the past 10 Olympic Games as a CSV data file.

import csv

#Use the with statement
# csv0.Write mode with the csv file as the variable csvfile ("w") To open
with open("./4050_data_cleansing_data/csv0.csv", "w") as csvfile:
    #The variable csvfile and line feed code (as arguments to the writer method)\n)
    writer = csv.writer(csvfile, lineterminator="\n")
    #Add a row using writerow (list)
    writer.writerow(["city", "year", "season"])
    writer.writerow(["Nagano", 1998, "winter"])
    writer.writerow(["Sydney", 2000, "summer"])
    writer.writerow(["Salt Lake City", 2002, "winter"])
    writer.writerow(["Athens", 2004, "summer"])
    writer.writerow(["Torino", 2006, "winter"])
    writer.writerow(["Beijing", 2008, "summer"])
    writer.writerow(["Vancouver", 2010, "winter"])
    writer.writerow(["London", 2012, "summer"])
    writer.writerow(["Sochi", 2014, "winter"])
    writer.writerow(["Rio de Janeiro", 2016, "summer"])

#output
# csv0.Read mode with csv file as variable csvfile("r")Open with
with open("./4050_data_cleansing_data/csv0.csv", "r") as csvfile:
    print(csvfile.read())

When executed, a CSV data file called csv0.csv will be created and the contents of the data will be displayed. To find out where the CSV data file was created

import os #Do
print(os.getcwd()) #To execute.

Creating CSV with Pandas

Without using CSV library You can also create CSV data using Pandas. It is more convenient to use Pandas when converting PandasDataFrame format data to CSV data.

To create a csv file in Pandas, use the to_csv () function.

to_csv("Csv file name to create")

As an example of DataFrame, as in the previous sample, the data of the city, year, and season of the past 10 Olympic games are output as a CSV data file.

import pandas as pd

data = {"city": ["Nagano", "Sydney", "Salt Lake City", "Athens", "Torino", "Beijing", "Vancouver", "London", "Sochi", "Rio de Janeiro"], 
        "year": [1998, 2000, 2002, 2004, 2006, 2008, 2010, 2012, 2014, 2016],
        "season": ["winter", "summer", "winter", "summer", "winter", "summer", "winter", "summer", "winter", "summer"]}

df = pd.DataFrame(data)

df.to_csv("4050_data_cleansing_data/csv1.csv")
#Doing this csv1.A file called csv is cleansing_It will be created in the data directory.

Missing value

Listwise delete / pairwise delete

If there is a space in the read data, it will be recognized as a missing value NaN (Not a Number). To remove missing NaNs to improve data accuracy

dropna()Use a function.

#First, randomly create a table with a part of the table missing on purpose.

import numpy as np
from numpy import nan as NA
import pandas as pd

sample_data_frame = pd.DataFrame(np.random.rand(10, 4))

#I intentionally lose some data
sample_data_frame.iloc[1, 0] = NA
sample_data_frame.iloc[2, 2] = NA
sample_data_frame.iloc[5:, 3] = NA

sample_data_frame

Listwise delete

Erase the entire row or column (row containing NaN) with data loss This is called listwise deletion.

dropna()Using a function

Remove all rows that contain at least one NaN. Also If axis = 1 is specified as an argument, even one column containing NaN will be removed.

sample_data_frame.dropna()

Pairwise delete

If the listwise method removes all missing rows and there is too little data Another option is to use only the available data. Leave columns with few defects (for example, columns 0 and 1) Deleting lines containing NaN from there is called pairwise deletion.

sample_data_frame[[0, 1]].dropna()

Missing value completion

Other than removing missing values to improve the accuracy of the data You can also assign the alternate data to the missing value.

To substitute (replace) alternative data for missing value NaN

fillna()Use a function.

Randomly create a table with a part of the table missing on purpose.


import numpy as np
from numpy import nan as NA
import pandas as pd

sample_data_frame = pd.DataFrame(np.random.rand(10, 4))

#I intentionally lose some data
sample_data_frame.iloc[1, 0] = NA
sample_data_frame.iloc[2, 2] = NA
sample_data_frame.iloc[5:, 3] = NA
# fillna()When using a function, the number given as an argument is assigned to the NaN part. This time I will fill it with 0.

sample_data_frame.fillna(0)

#If you specify ffill for method, you can fill it with the value of the previous line.

sample_data_frame.fillna(method="ffill")

Completion of missing values (mean value substitution method)

The method of filling in the missing values with the average value of the column (or row) is called the average value substitution method.

Mean is mean()Calculate using a function.

import numpy as np
from numpy import nan as NA
import pandas as pd

sample_data_frame = pd.DataFrame(np.random.rand(10, 4))

#I intentionally lose some data
sample_data_frame.iloc[1, 0] = NA
sample_data_frame.iloc[2, 2] = NA
sample_data_frame.iloc[5:, 3] = NA

#Substitute the mean value of the column for the NaN part using fillna
sample_data_frame.fillna(sample_data_frame.mean())

Data aggregation

Calculation of statistics for each key

Statistics can be divided into representative values and dispersal degrees. A representative value is a value that represents the basic characteristics of data. For example, mean, maximum, minimum, and so on.

import pandas as pd

df = pd.read_csv("./4050_data_cleansing_data/wine.csv", header=None)
df.columns=["", "Alcohol", "Malic acid", "Ash", "Alcalinity of ash", "Magnesium","Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins","Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline"]

# DataFrame `df`key of"Alcohol"Calculate the average value of
df["Alcohol"].mean()
#Output result
13.000617977528091

Duplicate data

If there is a duplicate of the data, delete the data to improve the accuracy of the data. Prepare a DataFrame that actually has duplicate data Try to extract or delete duplicate data.

import pandas as pd
from pandas import DataFrame

dupli_data = DataFrame({"col1":[1, 1, 2, 3, 4, 4, 6, 6]
                       ,"col2":["a", "b", "b", "b", "c", "c", "b", "b"]})
dupli_data

duplicated()With the method

Generates Series type data that returns True for duplicate rows and extracts duplicate data.

#Extract duplicate data
dupli_data.duplicated()
#Output result
0    False
1    False
2    False
3    False
4    False
5     True
6    False
7     True
dtype: bool

dtype stands for "Data Type" and indicates the data type of the element.

drop_duplicates()Use the method to remove duplicate data.

dupli_data.drop_duplicates()

mapping

What is mapping? For data that has a common key It is a process to refer to the value corresponding to the key from another table. Actually prepare a DataFrame and try the mapping process.

import pandas as pd
from pandas import DataFrame

attri_data1 = {"ID": ["100", "101", "102", "103", "104", "106", "108", "110", "111", "113"]
        ,"city": ["Tokyo", "Osaka", "Kyoto", "Hokkaido", "Tokyo", "Tokyo", "Osaka", "Kyoto", "Hokkaido", "Tokyo"]
        ,"birth_year" :[1990, 1989, 1992, 1997, 1982, 1991, 1988, 1990, 1995, 1981]
        ,"name" :["Hiroshi", "Akiko", "Yuki", "Satoru", "Steeve", "Mituru", "Aoi", "Tarou", "Suguru", "Mitsuo"]}
attri_data_frame1 = DataFrame(attri_data1)

attri_data_frame1

Create dictionary-type data with the area name corresponding to city.

city_map ={"Tokyo":"Kanto"
          ,"Hokkaido":"Hokkaido"
          ,"Osaka":"Kansai"
          ,"Kyoto":"Kansai"}
city_map

Using the city column of attri_data_frame1 prepared at the beginning as a key, refer to the corresponding area name data from city_map and add it to the new column. This is the mapping process. If you are familiar with Excel, imagine a process like the vlookup function. Perform mapping processing using the map () function and add region to attri_data_frame1 as a new column.

attri_data_frame1["region"] = attri_data_frame1["city"].map(city_map)
attri_data_frame1

If you look at the output, you can see that the region name has been added to the region column. Elements whose corresponding data does not exist in city_map are filled with NaN.

Bin split

Bin division is the process of roughly dividing numerical data into categories. For example, it is a process to divide the age into 0-9 years old, 10-19 years old, 20-29 years old. Prepare a bin-divided list in advance

pandas cut()Process using a function.

import pandas as pd
from pandas import DataFrame

attri_data1 = {"ID": [100,101,102,103,104,106,108,110,111,113]
        ,"city": ["Tokyo", "Osaka", "Kyoto", "Hokkaido", "Tokyo", "Tokyo", "Osaka", "Kyoto", "Hokkaido", "Tokyo"]
        ,"birth_year" :[1990, 1989, 1992, 1997, 1982, 1991, 1988, 1990, 1995, 1981]
        ,"name" :["Hiroshi", "Akiko", "Yuki", "Satoru", "Steeve", "Mituru", "Aoi", "Tarou", "Suguru", "Mitsuo"]}
attri_data_frame1 = DataFrame(attri_data1)

Specify the granularity of the division in the list and perform the bin division. Here we focus on birth_year.

#Create a list of partition granularity
birth_year_bins = [1980,1985,1990,1995,2000]

#Do a bin split
birth_year_cut_data = pd.cut(attri_data_frame1.birth_year,birth_year_bins)
birth_year_cut_data
#Output result
0    (1985, 1990]
1    (1985, 1990]
2    (1990, 1995]
3    (1995, 2000]
4    (1980, 1985]
5    (1990, 1995]
6    (1985, 1990]
7    (1985, 1990]
8    (1990, 1995]
9    (1980, 1985]
Name: birth_year, dtype: category
Categories (4, interval[int64]): [(1980, 1985] < (1985, 1990] < (1990, 1995] < (1995, 2000]]

"()" Does not include that value, and "[]" means that it does. For example, (1985, 1990] does not include 1985, but includes 1990.

If you want to count the number of each bin

value_counts()Use the method.

pd.value_counts(birth_year_cut_data)
#Output result
(1985, 1990]    4
(1990, 1995]    3
(1980, 1985]    2
(1995, 2000]    1
Name: birth_year, dtype: int64

You can also give each bin a name.

group_names = ["first1980", "second1980", "first1990", "second1990"]
birth_year_cut_data = pd.cut(attri_data_frame1.birth_year,birth_year_bins,labels = group_names)
pd.value_counts(birth_year_cut_data)
#Output result
second1980    4
first1990     3
first1980     2
second1990    1
Name: birth_year, dtype: int64

It is also possible to specify the number of divisions in advance and divide. You can use it to create bins of about the same size. Specify the number of divisions in the second argument of the cut () function.

pd.cut(attri_data_frame1.birth_year,2)
#Output result
0      (1989.0, 1997.0]
1    (1980.984, 1989.0]
2      (1989.0, 1997.0]
3      (1989.0, 1997.0]
4    (1980.984, 1989.0]
5      (1989.0, 1997.0]
6    (1980.984, 1989.0]
7      (1989.0, 1997.0]
8      (1989.0, 1997.0]
9    (1980.984, 1989.0]
Name: birth_year, dtype: category
Categories (2, interval[float64]): [(1980.984, 1989.0] < (1989.0, 1997.0]]

Python application: Data cleansing # 2: Data cleansing with DataFrame