Aidemy 2020/9/21
Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. Last time was my first post, but I am very happy that many people have read it. Thank you! This is the second post of data cleansing. Nice to meet you.
What to learn this time ・ About CSV -Processing for missing values (NaN) -Process data for each key in the dictionary
1 CSV
-CSV is a data file that is often used in data analysis. First, load it using Pandas. __ · pd.read_csv ("data file name", header = specification of column to be header) __
import pandas as pd
# wine.loading csv
df = pd.read_csv("./4050_data_cleansing_data/wine.csv", header=None)
#Set the column name
df.columns = [abridgement]
#output
df
-Create CSV using the "CSV library" built into Python. __ ・ with open ("file name", "w") as csvfile: __, in it writer=csv.writer(csvfile, lineterminator="\n") Create (writer method that takes csvfile and line feed code as arguments) and use it Enter data with __writer.writerow ([data]) __.
import csv
with open("file name","w") as csvfile:
writer=csv.writer(csvfile, lineterminator="\n")
writer.writerow(["version","year","gen"]) #The first line is the column name
writer.writerow(["Red-green",1996,1])
writer.writerow(["Gold and silver",1999,2])
writer.writerow(["RS",2002,3])
#Read
with open("file name","r") as csvfile:
print(csvfile.read())
# version,year,gen
#Red-green,1996,1
#Gold and silver,1999,2
# RS,2003,3
・ This is easier when converting DataFrame to CSV. __ · DataFrame data .to_csv ("filename") __
data={"version":["Red-green","Gold and silver","RS"],"year":[1996,1999,2002],"gen":[1,2,3]}
df=pd.DataFrame(data)
df.to_csv("file name")
-The deletion of rows and columns containing NaN is called __ "listwise deletion" __. Listwise deletion is done with the dropna () function. If no argument is specified, the row is deleted, and if axis = 1 is specified, the column is deleted. __ · DataFrame data .dropna () __
-If the number of data is too small when using listwise deletion, it is better to leave the column with few missing and delete the row containing NaN from it. This is called __ "pairwise delete" __. __ · DataFrame data [[column to leave]] .dropna () __
data={"version":["Red-green","Gold and silver","RS"],"year":[1996,1999],"gen":[1]}
df=pd.DataFrame(data)
#Since there is little data in listwise deletion, version,Remove rows containing NaN, leaving only the year column
df[[0,1]].dropna()
#version year
#Red green 1996
#Gold and silver 1999
-Complement by substituting alternative data instead of deleting. Use the fillna () function. Set "alternative data" or "method =" ffill "" (apply the value of the previous line) to the argument. __ · DataFrame data .fillna () __
data={"version":["Red-green","Gold and silver","RS"],"year":[1996,1999],"gen":[1]}
df=pd.DataFrame(data)
#Replace NaN with previous data
df.fillna(method="ffill")
#version year gen
#Red-green 1996 1
#Gold and silver 1999 1
#RS 1999 1
-By using the mean () function as an argument of fillna (), the substitute data can be replaced with "mean value of the column". __ · DataFrame data .fillna (DataFrame data .mean ()) __
data={"version":["Red-green","Gold and silver","RS"],"year":[1996,1999],"gen":[1]}
df=pd.DataFrame(data)
#Replace NaN with previous data
df.fillna(method="ffill")
#version year gen
#Red-green 1996 1
#Gold and silver 1999 1
#RS 1997.5 1
-The average and maximum values for each key in a dictionary or CSV (DataFrame) can be calculated using statistical functions such as mean () and max (). __ · Dictionary and df [key]. Statistical function __
df = pd.read_csv("./4050_data_cleansing_data/wine.csv", header=None)
df.columns = ["","Alcohol",(abridgement)]
# "Alcohol"Average value calculation
df["Alcohol"].mean()
# 13.000617977528091
-Use the duplicated () method to extract the exact same data. Use the drop_duplicates () method to delete. __ · Duplicate data .duplicated () __ __ · Duplicate data .drop_duplicates () __
data={"version":["Red-green","Gold and silver","RS","Red-green"],"year":[1996,1999,2002,1996],"gen":[1,2,3,1]}
df=pd.DataFrame(data)
#Delete duplicate data
df.drop_duplicates()
# version,year,gen
#Red-green,1996,1
#Gold and silver,1999,2
# RS,2003,3
-Creating a new key corresponding to an existing key is called __ "mapping" __. __ · DataFrame data [new key] = DataFrame data [reference source key] .map (dictionary type correspondence table) __
data={"version":["Red-green","Gold and silver","RS"],"year":[1996,1999,2002],"gen":[1,2,3]}
df=pd.DataFrame(data)
#Map after creating a dictionary-type correspondence table
version_map={"Red-green":"Kanto","Gold and silver":"Johto","RS":"Hoenn"}
df["region"] = df["version"].map(version_map)
df
# version,year,gen,region
#Red-green,1996,1,Kanto
#Gold and silver,1999,2,Johto
# RS,2003,3,Hoenn
-Set a range such as "0-5" and "6-10", and divide or aggregate the data for each range. Such processing is called "bin division". Bin division is done with pd.cut (). __ · pd.cut (Data [key to specify range], [list of range], labels = [list of names for each range]) __ #bin split -There is also a method to specify the number of divisions and automatically divide the range. __ · pd.cut (data [key to specify range], number of divisions) __ __ · pd.value_counts (bin divided data) __ # Aggregation of the number of data for each divided range
data={"version":["Red-green","Gold and silver","RS","DP"],"year":[1996,1999,2002,2006],"gen":[1,2,3,4]}
df=pd.DataFrame(data)
#Specify the range by year, and create the range list first.
range = [1996,2000,2007]
range_name = ["1996-2000","2000-2007"]
cut_df = pd.cut(df["year"], range, labels=range_name)
pd.value_counts(cut_df)
#1996-2000 2
#2000-2007 2
-A data file called CSV is often used for data analysis. CSV is created using the standard python library or Pandas. -Missing values are processed by deleting them with dropna () or complementing them with fillna (). -In dictionaries and DataFlame, you can calculate value by referring to keys, delete duplicate data (drop_duplicates), create new keys (map), and divide by specifying a range (pd.cut). ..
that's all. Thank you for reading this far.
Recommended Posts