Continuing from the last time, I will explain 100 knocks of data science. Data science 100 knock commentary (P001 ~ 020) Data science 100 knock commentary (P021 ~ 040)
For introduction, please refer to this article (* Docker is handled on Mac)
Basically, it is an explanation of the answer, but it also describes another solution.
P-041: Aggregate the sales amount (amount) of the receipt detail data frame (df_receipt) for each date (sales_ymd), and calculate the increase or decrease in the sales amount from the previous day. It is sufficient to display 10 calculation results.
P-041
#Sales amount (amount) dated (sales)_ymd) Aggregate by(groupby method)
# reset_index()Re-index with.
df_sales_amount_by_date = df_receipt[['sales_ymd', 'amount']].groupby('sales_ymd').sum().reset_index()
#Date sales amount (amount) for comparison_Combine the copy of ymd) with one line moved down.
# concat([df1, df2], axis=1)Combine horizontally with. shift()Move down one line with
df_sales_amount_by_date = pd.concat([df_sales_amount_by_date, df_sales_amount_by_date.shift()], axis=1)
#Change the column name
df_sales_amount_by_date.columns = ['sales_ymd','amount','lag_ymd','lag_amount']
#Increase / decrease in sales amount(diff_amount)To add
df_sales_amount_by_date['diff_amount'] = df_sales_amount_by_date['amount'] - df_sales_amount_by_date['lag_amount']
df_sales_amount_by_date.head(10)
Reference: concat that concatenates pandas.DataFrame, Series Reference: Shift data in row / column (vertical / horizontal) direction with pandas
P-042: Aggregate the sales amount (amount) of the receipt detail data frame (df_receipt) for each date (sales_ymd), and combine the data of 1 day ago, 2 days ago, and 3 days ago with the data of each date. Only 10 results should be displayed.
P-042
#(Vertical case)
#Sales amount (amount) dated (sales)_ymd) Aggregate by(groupby)
df_sales_amount_by_date = df_receipt[['sales_ymd', 'amount']].groupby('sales_ymd').sum().reset_index()
#Repeat with a for statement. range(1, 4)>>>1, 2,3 enters i.
for i in range(1, 4):
# i==When it is 1, it is connected in the horizontal direction. Move down one line with shift (1 day ago)
if i == 1:
df_lag = pd.concat([df_sales_amount_by_date, df_sales_amount_by_date.shift()], axis=1)
#If i is other than 1, add it to the data frame.
else:
df_lag = df_lag.append(pd.concat([df_sales_amount_by_date, df_sales_amount_by_date.shift(i)], axis=1))
#Change the column name
df_lag.columns = ['sales_ymd', 'amount', 'lag_sales_ymd', 'lag_amount']
#Exclude missing values NaN(dropna())And sort(sort_values)。
df_lag.dropna().sort_values('sales_ymd').head(10)
P-042
#Horizontal case
#Sales amount (amount) dated (sales)_ymd) Aggregate by(groupby)
df_sales_amount_by_date = df_receipt[['sales_ymd', 'amount']].groupby('sales_ymd').sum().reset_index()
#Repeat with a for statement. range(1, 4)>>>1, 2,3 enters i.
for i in range(1, 4):
#When i is 1, df connected in the horizontal direction_Create a lag.
if i == 1:
df_lag = pd.concat([df_sales_amount_by_date, df_sales_amount_by_date.shift(1)], axis=1)
#If i is other than 1, it is already df_Since lag is created, df_Connect with lag.
else:
df_lag = pd.concat([df_lag, df_sales_amount_by_date.shift(i)], axis=1)
#Change the column name
df_lag.columns = ['sales_ymd', 'amount', 'lag1_sales_ymd', 'lag1_amount',
'lag2_sales_ymd', 'lag2_amount', 'lag3_sales_ymd', 'lag3_amount']
#Exclude missing values NaN(dropna())And sort(sort_values)。
df_lag.dropna().sort_values('sales_ymd').head(10)
Reference: Exclude (delete) / replace (fill in) / extract missing value NaN with pandas
P-043: Sales summary data frame (df_sales_summary) that combines the receipt detail data frame (df_receipt) and the customer data frame (df_customer) and totals the sales amount (amount) for each gender (gender) and age (calculated from age). ). Gender is 0 for male, 1 for female, and 9 for unknown.
However, the item composition should be four items: age, sales amount for women, sales amount for men, and sales amount for unknown gender (cross tabulation of age vertically and gender horizontally). Also, the age group should be every 10 years old.
P-043
#Receipt detail data frame (df_Receipt) and customer data frame (df_customer) join
# merge(df1, df2, on='Key name', how='inner')
df_tmp = pd.merge(df_receipt, df_customer, on='customer_id', how='inner')
#Make the age group every 10 years.
# math.floor:Truncate after the decimal point. ex)In case of 22 22/10 * 10 = 2(2.Truncation of 2) *10 = 20
df_tmp['era'] = df_tmp['age'].apply(lambda x: math.floor(x/10)*10)
#Create a pivot table(pivot_table()function)Details below
df_sales_summary = pd.pivot_table(df_tmp, index='era', columns='gender_cd', values='amount', aggfunc='sum').reset_index()
#Change the column name
df_sales_summary.columns = ['era', 'male', 'female', 'unknown']
df_sales_summary
__pivot_table () function __ -Data (first argument): Data frame to be referenced ・ Index: Specify the line name -Columns: Specify the column name -Values: If you specify the column name of the referenced data frame, only the result for that column is calculated. -Aggfunc: Specify the calculation method of the result value Reference: Calculate statistics for each category with the pandas pivot table
P-044: The sales summary data frame (df_sales_summary) created in the previous question was a horizontal sales of gender. Let's hold the gender vertically from this data frame and convert it into 3 items: age, gender code, and sales amount. However, the gender code is '00' for men, '01' for women, and '99' for unknowns.
P-044
# set_Assign an existing column to the index index (row name, row label) with index
# stack()Pivot from column to row.
# replace()Replace the string with
# rename()Change arbitrary row name / column name with method
df_sales_summary.set_index('era').stack().reset_index().replace(
{'female': '01', 'male': '00', 'unknown': '99'}).rename(
columns={'level_1': 'gender_cd', 0: 'amount'})
Reference: set_index assigning columns of pandas.DataFrame to index (row name) Reference: Format data using stack, unstack, pivot in pandas Reference: Replace string with Python (replace, translate, re.sub, re.subn) Reference: Change row name / column name of pandas.DataFrame
P-045: The date of birth (birth_day) of the customer data frame (df_customer) holds the data in date type (Date). Convert this to a character string in YYYYMMDD format and extract it together with the customer ID (customer_id). Only 10 data can be extracted.
P-045
#Customer ID (customer)_id) and date of birth (birth) converted to a character string in YYYYMMDD format_day) is combined.
# concat([df1, df2], axis=1)Join in the horizontal direction with.
# pd.to_Datetime64 for strings[ns]Convert to type.
# dt.strftime()Convert columns to strings in any format at once with.
pd.concat([df_customer['customer_id'], pd.to_datetime(df_customer['birth_day']) \
.dt.strftime('%Y/%m/%d')], axis=1).head(10)
Reference: Process date / time columns with pandas (string conversion, date extraction, etc.)
P-046: The application date (application_date) of the customer data frame (df_customer) holds the data in the character string type in YYYYMMDD format. Convert this to a date type (date or datetime) and extract it along with the customer ID (customer_id). Only 10 data can be extracted.
P-046
#Customer ID (customer)_application date (application) converted to id) and date type (date or datetime)_date)
# P-Refer to 045
pd.concat([df_customer['customer_id'], pd.to_datetime(df_customer['application_date'])], axis=1).head(10)
Reference: Process date / time columns with pandas (string conversion, date extraction, etc.)
P-047: The sales date (sales_ymd) of the receipt detail data frame (df_receipt) holds the data in the numeric type of YYYYMMDD format. Convert this to a date type (date or datetime) and extract it together with the receipt number (receipt_no) and receipt sub number (receipt_sub_no). Only 10 data can be extracted.
P-047
#Sales date_ymd) to_datetime()Convert to date type with
# pandas.concat()Combine horizontally with
# astype()Convert to string str type with method, convert to string in standard format
pd.concat([df_receipt[['receipt_no', 'receipt_sub_no']],
pd.to_datetime(df_receipt['sales_ymd'].astype('str'))],axis=1).head(10)
# (Another solution)
pd.concat([df_receipt[['receipt_no', 'receipt_sub_no'],
pd.to_datetime(df_receipt['sales_ymd']).dt.strftime('%Y-%m-%d')]], axis=1).head(10)
Reference: Process date / time columns with pandas (string conversion, date extraction, etc.)
P-048: The sales epoch seconds (sales_epoch) of the receipt detail data frame (df_receipt) holds the data in numeric UNIX seconds. Convert this to a date type (date or datetime) and extract it together with the receipt number (receipt_no) and receipt sub number (receipt_sub_no). Only 10 data can be extracted.
P-048
# pd.concat()Sales epoch seconds using sales_epoch) data frame and receipt number(receipt_no), Receipt sub number (receipt_sub_Combine no
#Sales epoch seconds (sales_epoch) to date type(to_datetime(df, unit='s')Convert with)
pd.concat([df_receipt[['receipt_no', 'receipt_sub_no']],
pd.to_datetime(df_receipt['sales_epoch'].astype(int), unit='s')], axis=1).head(10)
Reference: How to use the to_datetime function to convert time and date data with Pandas
P-049: Convert the sales epoch seconds (sales_epoch) of the receipt detail data frame (df_receipt) to the date type (timestamp type), extract only the "year" and extract it together with the receipt number (receipt_no) and receipt sub number (receipt_sub_no). Let's do it. Only 10 data can be extracted.
P-049
#Sales epoch seconds (sales_epoch) to date type (timestamp type)
# "Year"Just take out(dt.year)
pd.concat([df_receipt[['receipt_no', 'receipt_sub_no']],
pd.to_datetime(df_receipt['sales_epoch'], unit='s').dt.year], axis=1).head(10)
Reference: How to use the to_datetime function to convert time and date data with Pandas
P-050: Convert the sales epoch seconds (sales_epoch) of the receipt detail data frame (df_receipt) to the date type (timestamp type), extract only the "month" and extract it together with the receipt number (receipt_no) and receipt sub number (receipt_sub_no). Let's do it. In addition, "month" should be extracted with 0 padding and 2 digits. Only 10 data can be extracted.
P-050
#Sales epoch seconds (sales_epoch) to date type(to_datetime(, unit='s'))
# "Month"Just take out(Strftime to retrieve with 0 padded 2 digits('%m'))
# pd.Combine dataframes with concat
pd.concat([df_receipt[['receipt_no', 'receipt_sub_no']],
pd.to_datetime(df_receipt['sales_epoch'], unit='s') \
.dt.strftime('%m')], axis=1).head(10)
Reference: Python --Set date format
P-051: Convert the sales epoch seconds (sales_epoch) of the receipt detail data frame (df_receipt) to the date type (timestamp type), extract only the "day" and extract it together with the receipt number (receipt_no) and receipt sub number (receipt_sub_no). Let's do it. In addition, "day" should be extracted with 0 padding and 2 digits. Only 10 data can be extracted.
P-051
#Sales epoch seconds (sales_epoch) to date type(to_datetime(, unit='s'))
#Extract only the day(dt.strftime('%d'))
# pd.concat()Combine with
pd.concat([df_receipt[['receipt_no', 'receipt_sub_no']],
pd.to_datetime(df_receipt['sales_epoch'], unit='s') \
.dt.strftime('%d')], axis=1).head(10)
Reference: Python --Set date format
P-052: Total sales amount (amount) of receipt detail data frame (df_receipt) for each customer ID (customer_id), and 0 for 2000 yen or less and 1 for 2000 yen or more for the total sales amount. Display 10 items together with the customer ID and the total sales amount. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation.
P-052
#When using lambda
#Customer ID"Z"Exclude those starting with(Search by query, except by not, str.startswith('Z'))
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python')
#Customer ID (customer)_Group by id). Total sales amount (amount)(sum)
df_sales_amount = df_sales_amount[['customer_id', 'amount']].groupby('customer_id').sum().reset_index()
#Binarize 2000 yen or less to 0 and over 2000 yen to 1(apply(lambda)Apply the condition row by row to the specified column in)
df_sales_amount['sales_flg'] = df_sales_amount['amount'].apply(lambda x: 1 if x>2000 else 0)
df_sales_amount.head(10)
# (Another solution: np.where)
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python')
df_sales_amount = df_sales_amount[['customer_id', 'amount']].groupby('customer_id').sum().reset_index()
# np.where(Conditional expression, x(If true), y(If false))
df_sales_amount['sales_flg'] = np.where(df_sales_amount['amount']>2000, 1, 0)
df_sales_amount.head(10)
# (Another solution)
df_sales_amount = df_receipt[~df_receipt['customer_id'].str.startswith('Z')].groupby('customer_id').amount.sum().reset_index()
df_sales_amount.loc[df_sales_amount['amount']<=2000, 'threshold'] = 0
df_sales_amount.loc[df_sales_amount['amount']>2000, 'threshold'] = 1
df_sales_amount.head(10)
Reference: How to use apply + lambda to apply a function row by row to a specified column in pandas
P-053: Binar Tokyo (the first 3 digits are 100 to 209) to 1 and the other ones to 0 for the postal code (postal_cd) of the customer data frame (df_customer). In addition, combine it with the receipt detail data frame (df_receipt) and count the number of customers who have a shopping record for the entire period for each of the created two values.
P-053
#Postal code (postal)_cd) is binarized(Tokyo:1,Other:0)
df_tmp = df_customer[['customer_id', 'postal_cd']].copy()
df_tmp['postal_flg'] = df_tmp['postal_cd'].apply(lambda x: 1 if 100 <= int(x[0:3]) <=209 else 0)
#Receipt detail data frame (df_receipt) and combine(pd.merge(df1, df2, on='Key', how='inner'))
#Number of unique elements(Non-overlapping number)The customer_Calculated for each id(pandas.DataFrame.nunique())
pd.merge(df_tmp, df_receipt, on='customer_id', how='inner') \
.groupby('postal_flg').agg({'customer_id': 'nunique'})
# (Another solution) np.How to use where is P-Refer to 052
df_tmp = df_customer[['customer_id', 'postal_cd']].copy()
df_tmp['postal_flg'] = np.where(df_tmp['postal_cd'].str[0:3].astype(int)
.between(100, 209), 1, 0)
pd.merge(df_tmp, df_receipt, how='inner', on='customer_id'). \
groupby('postal_flg').agg({'customer_id':'nunique'})
P-054: The address of the customer data data frame (df_customer) is one of Saitama prefecture, Chiba prefecture, Tokyo metropolitan area, and Kanagawa prefecture. Create a code value for each prefecture and extract it along with the customer ID and address. The values should be 11 for Saitama prefecture, 12 for Chiba prefecture, 13 for Tokyo, and 14 for Kanagawa prefecture. You can display 10 results.
P-054
#Extract the prefecture part of the address and assign a price to each
# map()The argument of the dictionary dict ({key: value}) Is specified, the element that matches the key is replaced with value.
pd.concat([df_customer[['customer_id', 'address']], df_customer['address'].str[0:3] \
.map({'Saitama': '11', 'Chiba': '12', 'Tokyo': '13', 'Kanagawa': '14'})], axis=1).head(10)
Reference: Replace column elements with map method of pandas.Series
P-055: Sum the sales amount (amount) of the receipt detail data frame (df_receipt) for each customer ID (customer_id), and find the quartile of the total amount. Then, create a category value for the total sales amount for each customer based on the following criteria, and display both the customer ID and sales amount. Category values are 1 to 4 in order from the top. You can display 10 results.
P-055
#Customer ID (customer)_Group by id)(groupby)And add up the sales amount (amount)(sum)
df_sales_amount = df_receipt[['customer_id', 'amount']].groupby('customer_id').sum().reset_index()
#Find the quartile for each sales amount (amount).(25th percentile: 25pct,50th percentile: 50pct,75th percentile: 75pct)
pct25 = np.quantile(df_sales_amount['amount'], 0.25)
pct50 = np.quantile(df_sales_amount['amount'], 0.5)
pct75 = np.quantile(df_sales_amount['amount'], 0.75)
#Create and apply a function of category values
def pct_group(x):
if x < pct25:
return 1
elif pct25 <= x < pct50:
return 2
elif pct50 <= x < pct75:
return 3
elif pct75 <= x:
return 4
#pct using apply_Apply group to each row
df_sales_amount['pct_group'] = df_sales_amount['amount'].apply(lambda x: pct_group(x))
df_sales_amount
Reference: Binning process with pandas cut and qcut function (binning)
P-056: Calculate the age in 10-year increments based on the age of the customer data frame (df_customer), and extract it together with the customer ID (customer_id) and date of birth (birth_day). However, all people over the age of 60 should be in their 60s. The category name indicating the age is arbitrary. The first 10 items should be displayed.
P-056
#Calculate the age in 10-year increments
# math.Truncate on the floor. min(, 60)60 or more is not output.
df_customer_era = pd.concat([df_customer[['customer_id', 'birth_day']],
df_customer['age'].apply(lambda x: math.floor(x / 10) * 10, 60)], axis=1)
df_customer_era.head(10)
# (Another solution)
#Age(Under teens, teens, 20s, 30s, 40s, 50s, 60s and above)Divide by.(flg_age)
def age_group(x):
if x < 10:
return 'Under teens'
elif 10 <= x <20:
return '10's'
elif 20 <= x < 30:
return '20's'
elif 30 <= x < 40:
return '30s'
elif 40 <= x < 50:
return 'Forties'
elif 50 <= x < 60:
return '50s'
elif 60 <= x:
return '60s and over'
df_customer['flg_age'] = df_customer['age'].apply(lambda x: age_group(int(x)))
#Customer ID (customer)_id), date of birth (birth)_extract with day)
df_customer[['customer_id', 'birth_day', 'flg_age']].head(10)
P-057: Combine the extraction result of the previous question and gender, and create new category data that represents the combination of gender and age. The value of the category representing the combination is arbitrary. The first 10 items should be displayed.
P-057
#Category data(Gender x age)'gender_era'To create
#age is int type so astype(str)Convert with
df_customer_era['gender_era'] = df_customer['gender_cd'] + df_customer_era['age'].astype(str)
df_customer_era.head(10)
P-058: Make the gender code (gender_cd) of the customer data frame (df_customer) a dummy variable and extract it together with the customer ID (customer_id). You can display 10 results.
P-058
#Gender code (gender_cd) as a dummy variable(get_dummies)
#Specify the column name of the column you want to dummy in the argument columns in a list
pd.get_dummies(df_customer[['customer_id', 'gender_cd']], columns=['gender_cd'])
Reference: Convert categorical variables to dummy variables with pandas (get_dummies)
P-059: Total sales amount (amount) of receipt detail data frame (df_receipt) is totaled for each customer ID (customer_id), and the total sales amount is standardized to average 0 and standard deviation 1 to total customer ID and sales amount. Display with. The standard deviation used for standardization may be either unbiased standard deviation or sample standard deviation. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. You can display 10 results.
P-059
#Customer ID"Z"Exclude those starting with, customer ID (customer)_Add up the sales amount (amount) for each id)
df_sales_amount = df_receipt.query("not customer_id.str.startswith('Z')", engine='python') \
.groupby('customer_id').agg({'amount': 'sum'}).reset_index()
#Standardize the amount of sales
df_sales_amount['amount_ss'] = preprocessing.scale(df_sales_amount['amount'])
df_sales_amount.head(10)
# (Another solution)
#Code example 2 (By performing fit, different data can be standardized with the same mean and standard deviation)
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python'). \
groupby('customer_id').agg({'amount':'sum'}).reset_index()
scaler = preprocessing.StandardScaler()
scaler.fit(df_sales_amount[['amount']])
df_sales_amount['amount_ss'] = scaler.transform(df_sales_amount[['amount']])
df_sales_amount.head(10)
Reference: Normalize / Standardize with Python (list, NumPy array, pandas.DataFrame)
P-060: The sales amount (amount) of the receipt detail data frame (df_receipt) is totaled for each customer ID (customer_id), and the total sales amount is normalized to the minimum value 0 and the maximum value 1 to the customer ID and sales amount. Display with the total. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. You can display 10 results.
P-060
#The sales amount (amount) is the customer ID (customer)_id) total
df_sales_amount = df_receipt.query("not customer_id.str.startswith('Z')", engine='python') \
.groupby('customer_id').agg({'amount': 'sum'}).reset_index()
df_sales_amount
#Normalize sales amount to minimum value 0 and maximum value 1(preprocessing.minmax_scale)
df_sales_amount['amount_mm'] = preprocessing.minmax_scale(df_sales_amount['amount'])
df_sales_amount.head(10)
# (Another solution)
#Code example 2 (By performing fit, different data can be standardized with the same mean and standard deviation)
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python'). \
groupby('customer_id').agg({'amount':'sum'}).reset_index()
scaler = preprocessing.MinMaxScaler()
scaler.fit(df_sales_amount[['amount']])
df_sales_amount['amount_mm'] = scaler.transform(df_sales_amount[['amount']])
df_sales_amount.head(10)
Reference: Normalize / Standardize with Python (list, NumPy array, pandas.DataFrame)
100 knocks of data science How to run 100 data science knocks on Mac
Difficulty increased after 40. Since it is okay to copy sutras, verbalize what is written.
Recommended Posts