** Create and understand decision trees from scratch in Python ** 1. Overview-2. Python Program Basics --3 Data Analysis Library Pandas
I will explain how to use the Pandas library to create a decision tree.
#Import pamdas and declare it to be used in the program with the name pd.
import pandas as pd
3.2 DataFrame, Series pandas uses Data Frames and Series. When data is represented like an Excel table as shown in the following figure, when a row is one data and a column is an attribute of data, DataFrame represents the entire table and Series represents one row. I will.
Read an Excel file. read_excel [ExcelWriter](https://pandas.pydata.org/pandas-docs/stable/ reference / api / pandas.ExcelWriter.html)
#Upload the Excel file to the same location as this ipynb file.
df0 = pd.read_excel("data_golf.xlsx")
#Display the DataFrame as an HTML table.
from IPython.display import HTML
html = "<div style='font-family:\"Meiryo\";'>"+df0.to_html()+"</div>"
HTML(html)
#Save to Excel file(with is f.Something that automatically executes the close process)
with pd.ExcelWriter("data_golf2.xlsx") as f:
df0.to_excel(f)
How to generate from dictionary type (associative array): Dictionary type (associative array) organizes data in columns. DataFrame
#Generated from dictionary type: Collect data by columns.
d = {
"weather":["Fine","Fine","Cloudy","rain","rain","rain","Cloudy","Fine","Fine","rain","Fine","Cloudy","Cloudy","rain"],
"temperature":["Hot","Hot","Hot","Warm","Ryo","Ryo","Ryo","Warm","Ryo","Warm","Warm","Warm","Hot","Warm"],
"Humidity":["High","High","High","High","usually","usually","usually","High","usually","usually","usually","High","usually","High"],
"Wind":["Nothing","Yes","Nothing","Nothing","Nothing","Yes","Yes","Nothing","Nothing","Nothing","Yes","Yes","Nothing","Yes"],
"golf":["×","×","○","○","○","×","○","×","○","○","○","○","○","×"],
}
df0 = pd.DataFrame(d)
How to generate from an array: Organize the data in rows. DataFrame
#Generate from array: Organize data in rows.
d = [["Fine","Hot","High","Nothing","×"],
["Fine","Hot","High","Yes","×"],
["Cloudy","Hot","High","Nothing","○"],
["rain","Warm","High","Nothing","○"],
["rain","Ryo","usually","Nothing","○"],
["rain","Ryo","usually","Yes","×"],
["Cloudy","Ryo","usually","Yes","○"],
["Fine","Warm","High","Nothing","×"],
["Fine","Ryo","usually","Nothing","○"],
["rain","Warm","usually","Nothing","○"],
["Fine","Warm","usually","Yes","○"],
["Cloudy","Warm","High","Yes","○"],
["Cloudy","Hot","usually","Nothing","○"],
["rain","Warm","High","Yes","×"],
]
df0 = pd.DataFrame(d,columns=["weather","temperature","Humidity","Wind","golf"])
#Get table information, etc.
#Number of rows and columns
print(df0.shape) #output(14, 5)
#Get the number of lines
print(df0.shape[0]) #Output 14
#Get column name
print(df0.columns) #Output Index(['weather', 'temperature', 'Humidity', 'Wind', 'golf'], dtype='object')
#Get row name (The row name of df0 is an automatically assigned index)
print(df0.index) #Output RangeIndex(start=0, stop=14, step=1)
#Get value
#Get the value by specifying the row and column.
#Line number 1(Second data),Get the humidity of.
print(df0.loc[1,"Humidity"]) #Output high
#Specify multiple rows and columns in an array to get the value.
#Line number 1,2,The weather and golf values of 4 are acquired together, and the acquired data is also of DataFrame type.
df = df0.loc[[1,2,4],["weather","golf"]]
print(df)
#output
#Weather golf
#1 fine ×
#2 Cloudy ○
#4 Rain ○
print(type(df)) #output<class 'pandas.core.frame.DataFrame'>
#Slices (processes for extracting arrays) can also be used to specify rows and columns in arrays.
#Get the data for all columns in rows 1 to 4. loc specifies a name, so 1:If it is 4, it includes 4.
df = df0.loc[1:4,:]
print(df)
#output
#Weather Temperature Humidity Wind Golf
#1 Fine heat Yes ×
#2 Cloudy heat No high ○
#3 Rain, warmth, no height ○
#4 Rain Ryo Normal None ○
#iloc allows you to index rows and columns. The index is counted from 0.
#Get data other than the last column (golf) in rows 1 to 3. iloc specifies an index, so 1:If it is 4, it does not include 4.
df = df0.iloc[1:4,:-1]
print(df)
#output
#Weather Temperature Humidity Wind
#1 Sunny, hot and hot
#2 Cloudy, hot, high, no
#3 Rain, warmth, no height
#1 line(Series)Get value from
#Get the data in the first row. s is Series type
s = df0.iloc[0,:]
#Like the dictionary type, s["Column name"]You can get the value with.
print(s["weather"]) #Output fine
#Array all values(numpy.ndarray)Get in the format of.
print(df0.values)
#Let's look at the data loop and sequential data.
#Loop on a line. Look at the data line by line.
for i,row in df0.iterrows():
#i is the row name (row index), row is Series
print(i,row)
pass
#Loop in columns. Look at the data vertically, column by column.
for i,col in df0.iteritems():
#i is the column name, col is Series
print(i,col)
pass
#frequency(Number of data appearances)
#Get all the data for the weather column. s is Series
s = df0.loc[:,"weather"]
#Get what data and how many.
print(s.value_counts())
#output
#Fine 5
#Rain 5
#Cloudy 4
# Name:weather, dtype: int64
#For example, get the number of fine weather.
print(s.value_counts()["Fine"]) #Output 5
#Extraction of specific data
#Acquisition of data on fine weather
print(df0.query("weather=='Fine'"))
#output
#Weather Temperature Humidity Wind Golf
#0 Fine heat No high ×
#1 Fine heat Yes ×
#7 Fine, warm, high, no ×
#8 Sunny Ryo Normal None ○
#10 Sunny Warm Normal Yes ○
#Get data to go golf when the weather is fine
print(df0.query("weather=='Fine'and golf=='○'"))
#output
#Weather Temperature Humidity Wind Golf
#8 Sunny Ryo Normal None ○
#10 Sunny Warm Normal Yes ○
#Get data when the weather is fine or go golf
print(df0.query("weather=='Fine'or golf=='○'"))
#output
#Weather Temperature Humidity Wind Golf
#0 Fine heat No high ×
#1 Fine heat Yes ×
#2 Cloudy heat No high ○
#3 Rain, warmth, no height ○
#4 Rain Ryo Normal None ○
#6 Cloudy Ryo Normal Yes ○
#7 Fine, warm, high, no ×
#8 Sunny Ryo Normal None ○
#9 Rain Warm Normal None ○
#10 Sunny Warm Normal Yes ○
#11 Cloudy Warm High Yes ○
#12 Cloudy heat Normal None ○
Recommended Posts