I've been wandering around the author's github code each time, so I'll put it together for immediate use.
[Complete preprocessing [SQL / R / Python practice technique for data analysis]](https://www.amazon.co.jp/%E5%89%8D%E5%87%A6%E7%90%86 % E5% A4% A7% E5% 85% A8% EF% BC% BB% E3% 83% 87% E3% 83% BC% E3% 82% BF% E5% 88% 86% E6% 9E% 90% E3 % 81% AE% E3% 81% 9F% E3% 82% 81% E3% 81% AESQL-R-Python% E5% AE% 9F% E8% B7% B5% E3% 83% 86% E3% 82% AF % E3% 83% 8B% E3% 83% 83% E3% 82% AF% EF% BC% BD-% E6% 9C% AC% E6% A9% 8B-% E6% 99% BA% E5% 85% 89 -ebook / dp / B07C3JFK3V)
https://github.com/ghmagazine/awesomebook
Extract the Python part from github below. See the book for detailed explanations.
In addition, these source codes are licensed under the BSD 3 Terms.
BSD 3-Clause License
Copyright (c) 2018, Tomomitsu Motohashi All rights reserved.
#Load from library
from preprocess.load_data.data_loader import load_hotel_reserve
customer_tb, hotel_tb, reserve_tb = load_hotel_reserve()
#Read from CSV
reserve_tb = pd.read_csv('./data/reserve.csv', encoding='UTF-8')
#Verification
reserve_tb.dtypes
print(type(reserve_tb))
#conversion
reserve_tb['people_num'] = reserve_tb['people_num'].astype('float64')
#Line by iloc function/Avoid specifying in the column "Number" as it will be a hotbed of bugs.
#However, KFold uses iloc.
reserve_tb[['reserve_id','hotel_id','customer_id','reserve_datetime','checkin_date','checkin_time','checkout_date']]
reserve_tb.loc[:, ['reserve_id','hotel_id','customer_id','reserve_datetime','checkin_date','checkin_time','checkout_date']]
#Specify column deletion by setting axis to 1
#Reserve by specifying inplace as True_Specify tb rewrite
reserve_tb.drop(['people_num', 'total_price'], axis=1, inplace=True)
reserve_tb.query('"2016-10-13" <= checkout_date <= "2016-10-14"')
#When connecting conditions with and&
#When connecting conditions with or|
# @var_Like name@You can use variables in Python memory by writing the variable name you want to refer to after.
#The query function does not support in.
#Sampling the customer ID and extracting the transaction (reservation) corresponding to the extracted customer ID
# reserve_tb['customer_id'].unique()Is a customer that eliminates duplication_Returns id
#pandas to use sample function.Series(pandas list object)Conversion to
#Sample customer ID with sample function
target = pd.Series(reserve_tb['customer_id'].unique()).sample(frac=0.5)
#customer by isin function_Extract rows whose id matches any of the sampled customer IDs
reserve_tb[reserve_tb['customer_id'].isin(target)]
# reserve_tb to 50%sampling
reserve_tb.sample(frac=0.5)
hotel_When you want to perform aggregation processing for each id. customer_If you want to perform unique count processing for each id
#Use the agg function to specify aggregate processing collectively
# reserve_Apply count function for id
# customer_Apply nunique function for id
result = reserve_tb \
.groupby('hotel_id') \
.agg({'reserve_id': 'count', 'customer_id': 'nunique'})
# reset_Replace column number with index function=Since it is True, update the result directly)
result.reset_index(inplace=True)
result.columns = ['hotel_id', 'rsv_cnt', 'cus_cnt']
#reference
#What I want to do: reserve_id and customer_The number of records where both ids are duplicated['dup']Store in a column.
#When you want to add a duplicate row count to the end of a data frame. transform('count')Use
reserve_tb['dup'] = reserve_tb.groupby(['reserve_id','customer_id']).transform('count')
#If there is only one aggregation process, it is easier to write without using the agg function.
#Aggregate unit hotel_id and people_Specify a combination of num
#Total from the aggregated data_Calculate the total sales amount by taking out the price and applying it to the sum function
result = reserve_tb \
.groupby(['hotel_id', 'people_num'])['total_price'] \
.sum().reset_index()
#The column name of the total sales amount is total_Since it is price, price_Change to sum
result.rename(columns={'total_price': 'price_sum'}, inplace=True)
groupby-> agg combo
#In the agg function, specify the aggregation process of the percentage tile value in the character string(q=20 things)Since it cannot be done, it is specified using a lambda expression.
# total_max for price/min/mean/Apply median function
#Specify Python lambda expression for aggregation process of agg function
#Numpy for lambda expressions.Specify percentile and calculate the percentage tile value (percent is 20)
result = reserve_tb \
.groupby('hotel_id') \
.agg({'total_price': ['max', 'min', 'mean', 'median',
lambda x: np.percentile(x, q=20)]}) \
.reset_index()
result.columns = ['hotel_id', 'price_max', 'price_min', 'price_mean',
'price_median', 'price_20per']
# total_Apply the var and std functions to price to calculate the variance and standard deviation
result = reserve_tb \
.groupby('hotel_id') \
.agg({'total_price': ['var', 'std']}).reset_index()
result.columns = ['hotel_id', 'price_var', 'price_std']
#When the number of data is 1, the variance value and standard deviation value are na, so replace them with 0.
#The range to be replaced is all NAs in the DataFrame, so be careful not to replace irrelevant values.
result.fillna(0, inplace=True)
#After rounding with the round function, calculate the mode with the mode function
reserve_tb['total_price'].round(-3).mode()
#Convert data type from string to timestamp type to sort by rank function
#(Explanation in "Chapter 10 Date and Time Type")
reserve_tb['reserve_datetime'] = pd.to_datetime(
reserve_tb['reserve_datetime'], format='%Y-%m-%d %H:%M:%S'
)
# log_Add no as a new column
#Specify the aggregation unit is group_Use by
#Reserve organized by customer_Generate datetime and generate rank by rank function
#Set in ascending order by setting ascending to True(If False, set in descending order)
reserve_tb['log_no'] = reserve_tb \
.groupby('customer_id')['reserve_datetime'] \
.rank(ascending=True, method='first')
#Make the data as small as possible before the join process
pd.merge(reserve_tb.query('people_num == 1'),
hotel_tb.query('is_business'),
on='hotel_id', how='inner')
#When you want to shift n rows and combine (add past data to features, etc.)
#shift function:A function that can shift data up and down n lines
#Reserve for each customer_Sort by datetime
#Sort by group by applying the apply function after the groupby function
# sort_Sort the data with the values function, sort the rows if axis is 0, the columns if axis 1
result = reserve_tb \
.groupby('customer_id') \
.apply(lambda group:
group.sort_values(by='reserve_datetime', axis=0, inplace=False))
#result is already customer_Grouped by id
#Two previous total for each customer_price before_Save as price
#The shift function is a function that shifts the data row down by the number of arguments of periods.
result['before_price'] = \
pd.Series(result['total_price'].shift(periods=2))
-Python has a longer code than SQL, so it is better to use SQL when performing processing that requires the Window function. -SQL is overwhelmingly recommended when performing processing that combines with past data for a certain period of time. -The rolling function can only select based on its own data row.
from preprocess.load_data.data_loader import load_monthly_index
monthly_index_tb = load_monthly_index()
#This book starts from the line below
# train_window_Specify the start line number of the first training data in start
train_window_start = 1
# train_window_Specify the end line number of the first training data in end
train_window_end = 24
#Specify the number of verification data in horizon
horizon = 12
#Set the number of data to slide to skip
skip = 12
#Sort data based on year / month
monthly_index_tb.sort_values(by='year_month')
while True:
#Calculate the end line number of the verification data
test_window_end = train_window_end + horizon
#Get training data from the original data by specifying the line number
# train_window_If you fix the start part to 1, you can change to verification that increases learning data
train = monthly_index_tb[train_window_start:train_window_end]
#Get validation data from the original data by specifying the line number
test = monthly_index_tb[(train_window_end + 1):test_window_end]
#Determine if the end line number of the verification data is greater than or equal to the number of lines in the original data
if test_window_end >= len(monthly_index_tb.index):
#End when all data is targeted
break
#Slide the data
train_window_start += skip
train_window_end += skip
#Summarize the results of cross-validation
#apply function
reserve_tb['total_price_log'] = \
reserve_tb['total_price'].apply(lambda x: np.log(x / 1000 + 1))
#There is no package that summarizes major methods, you have to implement it yourself
reserve_tb = reserve_tb[
(abs(reserve_tb['total_price'] - np.mean(reserve_tb['total_price'])) /
np.std(reserve_tb['total_price']) <= 3)
].reset_index()
#Datetime64 as date type[D]You can also specify the type, but datetime64[ns]From datetime64[D]Because there are many inconveniences such as not being able to convert to
# datetime64[ns]It is more convenient to extract the date and time element after converting it to a type.
# to_datetime64 with datetime function[ns]Convert to type
pd.to_datetime(reserve_tb['reserve_datetime'], format='%Y-%m-%d %H:%M:%S')
pd.to_datetime(reserve_tb['checkin_date'] + reserve_tb['checkin_time'],
format='%Y-%m-%d%H:%M:%S')
# datetime64[ns]Get date information from type
pd.to_datetime(reserve_tb['reserve_datetime'],
format='%Y-%m-%d %H:%M:%S').dt.date
pd.to_datetime(reserve_tb['checkin_date'], format='%Y-%m-%d').dt.date
# reserve_datetime to datetime64[ns]Convert to type
reserve_tb['reserve_datetime'] = \
pd.to_datetime(reserve_tb['reserve_datetime'], format='%Y-%m-%d %H:%M:%S')
#Get the year
reserve_tb['reserve_datetime'].dt.year
#Get the month
reserve_tb['reserve_datetime'].dt.month
#Get the day
reserve_tb['reserve_datetime'].dt.day
#Day of the week (0=Get numerical value on Sunday, 1 = Monday)
reserve_tb['reserve_datetime'].dt.dayofweek
#Get the time of day
reserve_tb['reserve_datetime'].dt.hour
#Get the minute of the time
reserve_tb['reserve_datetime'].dt.minute
#Get the seconds of the time
reserve_tb['reserve_datetime'].dt.second
#Convert to a string in the specified format
reserve_tb['reserve_datetime'].dt.strftime('%Y-%m-%d %H:%M:%S')
# datetime64[ns]Subtraction between types
# reserve_datetime to datetime64[ns]Convert to type
reserve_tb['reserve_datetime'] = \
pd.to_datetime(reserve_tb['reserve_datetime'], format='%Y-%m-%d %H:%M:%S')
# checkin_datetime to datetime64[ns]Convert to type
reserve_tb['checkin_datetime'] = \
pd.to_datetime(reserve_tb['checkin_date'] + reserve_tb['checkin_time'],
format='%Y-%m-%d%H:%M:%S')
#Calculate year difference (do not consider date and time elements less than month)
reserve_tb['reserve_datetime'].dt.year - \
reserve_tb['checkin_datetime'].dt.year
#Get month diff (does not consider datetime elements less than day)
(reserve_tb['reserve_datetime'].dt.year * 12 +
reserve_tb['reserve_datetime'].dt.month) \
- (reserve_tb['checkin_datetime'].dt.year * 12 +
reserve_tb['checkin_datetime'].dt.month)
#Calculate the difference on a daily basis
(reserve_tb['reserve_datetime'] - reserve_tb['checkin_datetime']) \
.astype('timedelta64[D]')
#Calculate the difference on an hourly basis
(reserve_tb['reserve_datetime'] - reserve_tb['checkin_datetime']) \
.astype('timedelta64[h]')
#Calculate the difference in minutes
(reserve_tb['reserve_datetime'] - reserve_tb['checkin_datetime']) \
.astype('timedelta64[m]')
#Calculate the difference in seconds
(reserve_tb['reserve_datetime'] - reserve_tb['checkin_datetime']) \
.astype('timedelta64[s]')
#Note
# timedelta64[D/h/m/s]Difference by type/Time/Minutes/When converted to seconds, the result is "rounded up" after the decimal point (unlike SQL and R).
#For example, if you convert to daily when the difference is 2 days and 3 hours, 3(Day)Is returned
# ???
#Even if you look at the pandas site, it looks like it has been truncated.
# https://pandas.pydata.org/docs/user_guide/timedeltas.html
#Load the datetime library for timedelta
import datetime
# reserve_datetime to datetime64[ns]Convert to type
reserve_tb['reserve_datetime'] = \
pd.to_datetime(reserve_tb['reserve_datetime'], format='%Y-%m-%d %H:%M:%S')
# reserve_Extract date from datetime
reserve_tb['reserve_date'] = reserve_tb['reserve_datetime'].dt.date
# reserve_Add 1 day to datetime
reserve_tb['reserve_datetime'] + datetime.timedelta(days=1)
# reserve_Add 1 day to date
reserve_tb['reserve_date'] + datetime.timedelta(days=1)
# reserve_Add 1 hour to datetime
reserve_tb['reserve_datetime'] + datetime.timedelta(hours=1)
# reserve_Add 1 minute to datetime
reserve_tb['reserve_datetime'] + datetime.timedelta(minutes=1)
# reserve_Add 1 second to datetime
reserve_tb['reserve_datetime'] + datetime.timedelta(seconds=1)
# reserve_datetime to datetime64[ns]Convert to type
reserve_tb['reserve_datetime'] = pd.to_datetime(
reserve_tb['reserve_datetime'], format='%Y-%m-%d %H:%M:%S'
)
#A function that converts month numbers to seasons
def to_season(month_num):
season = 'winter'
if 3 <= month_num <= 5:
season = 'spring'
elif 6 <= month_num <= 8:
season = 'summer'
elif 9 <= month_num <= 11:
season = 'autumn'
return season
#Convert to season
reserve_tb['reserve_season'] = pd.Categorical(
reserve_tb['reserve_datetime'].dt.month.apply(to_season),
categories=['spring', 'summer', 'autumn', 'winter']
)
https://pandas.pydata.org/docs/user_guide/timedeltas.html
Recommended Posts