Following the previous, I will explain 100 knocks of data science. For introduction, please refer to this article (* Docker is handled on Mac)
Basically, it is an explanation of the answer, but it also describes another solution.
P-021: Count the number of receipt details data frame (df_receipt).
P-021
#The number is len()Use the method.
len(df_receipt)
P-022: Count the number of unique cases for the customer ID (customer_id) of the receipt detail data frame (df_receipt).
P-022
#The number of unique cases is unique()Use the method.
# df_receipt['customer_id'].unique() >>>Returns a list of unique element values in a NumPy array ndarray
len(df_receipt['customer_id'].unique())
Reference: https://note.nkmk.me/python-pandas-value-counts/
P-023: Sum the sales amount (amount) and sales quantity (quantity) for each store code (store_cd) for the receipt detail data frame (df_receipt).
P-023
#with the groupby method'store_cd'Grouping by
# agg()Aggregate the data with. Dictionary type sum of sales amount (amount) and sales quantity (quantity)
# reset_index()Re-index with
df_receipt.groupby('store_cd').agg({'amount': 'sum', 'quantity': 'sum'}).reset_index()
#(Another solution)
df_receipt[['amount', 'quantity', 'store_cd']].groupby('store_cd', as_index=False).sum()
Reference (groupby): https://note.nkmk.me/python-pandas-groupby-statistics/ Reference (agg): https://note.nkmk.me/python-pandas-agg-aggregate/
P-024: For the receipt detail data frame (df_receipt), find the newest sales date (sales_ymd) for each customer ID (customer_id) and display 10 items.
P-024
#Customer ID (customer) with groupby_Group by id).
#Sales date_ymd) newest(max()Get in)
df_receipt.groupby('customer_id').sales_ymd.max().reset_index().head(10)
# (Another solution)
#Customer ID (customer) with loc_Sales dates (sales) grouped by id)_ymd) newest(idxmax()Get in)
df_receipt[['customer_id', 'sales_ymd']].loc[df_receipt.groupby('customer_id').sales_ymd.idxmax()].head(10)
Reference (groupby): https://note.nkmk.me/python-pandas-groupby-statistics/
P-025: For the receipt detail data frame (df_receipt), find the oldest sales date (sales_ymd) for each customer ID (customer_id) and display 10 items.
P-025
# P-See 023. The oldest sales date is agg({'sales_ymd': min})Can be expressed by
df_receipt.groupby('customer_id').agg({'sales_ymd': min}).reset_index().head(10)
# (Another solution)
#Customer ID (customer) with loc_Sales dates (sales) grouped by id)_ymd) oldest(idxmin()Get in)
df_receipt[['customer_id', 'sales_ymd']].loc[df_receipt.groupby('customer_id').sales_ymd.idxmin()].head(10)
P-026: For the receipt detail data frame (df_receipt), find the newest sales date (sales_ymd) and the oldest sales date for each customer ID (customer_id), and display 10 different data.
P-026
#Customer ID (customer)_Sales date (sales) for each id)_ymd) new sales date(max)And old sales date(min)Create a data frame for
df_tmp = df_receipt.groupby('customer_id').agg({'sales_ymd':['max','min']}).reset_index()
#Change column(Details are explained below)
df_tmp.columns = ["_".join(pair) for pair in df_tmp.columns]
#Both query different data()Search by.
df_tmp.query('sales_ymd_max != sales_ymd_min').head(10)
Since df.columns becomes MultiIndex ([('customer_id',''), ('sales_ymd','max'), ('sales_ymd','min')],), extract them one by one with a for statement. Join the characters in parentheses with "_" .join ().
P-027: Calculate the average sales amount (amount) for each store code (store_cd) for the receipt statement data frame (df_receipt), and display the TOP5 in descending order.
P-027
#Receipt detail data frame (df_Receipt) to store code (store_Group by cd).
# agg({'amount':'mean'})Calculate the average sales amount (amount) at
# reset_index()Re-index with and sort_values('amount', ascending=False)Sort by sales amount (amount) in descending order
df_receipt.groupby('store_cd').agg({'amount':'mean'}) \
.reset_index().sort_values('amount', ascending=False).head(5)
Reference (groupby): https://note.nkmk.me/python-pandas-groupby-statistics/ Reference (agg): https://note.nkmk.me/python-pandas-agg-aggregate/
P-028: Calculate the median sales amount (amount) for each store code (store_cd) for the receipt statement data frame (df_receipt), and display the TOP5 in descending order.
P-028
# P-Refer to 027
#Median is median
df_receipt.groupby('store_cd').agg({'amount':'median'}).reset_index().sort_values('amount', ascending=False).head(5)
Reference (median): https://note.nkmk.me/python-statistics-mean-median-mode-var-stdev/
P-029: Find the mode of the product code (product_cd) for each store code (store_cd) for the receipt statement data frame (df_receipt).
P-029
#The first half is P-Refer to 027
#Product code (product_cd) to function(Lambda expression)To apply.
df_receipt.groupby('store_cd').product_cd.apply(lambda x: x.mode()).reset_index()
# df.groupby('grouping_content').What you want to find the mode.apply(lambda x: x.mode())
# (Wrong answer)
df_receipt.groupby('store_cd').agg({'product_cd':'mode'}).reset_index()
# >>>AttributeError: 'SeriesGroupBy' object has no attribute 'mode'
Reference (lambda expression): https://note.nkmk.me/python-lambda-usage/
P-030: For the receipt detail data frame (df_receipt), calculate the sample variance of the sales amount (amount) for each store code (store_cd), and display the TOP5 in descending order.
P-030
#Store code (store_Sample distribution of sales amount (amount) of cd)(var(ddof=0))
# reset_Re-index and sort by index(sort_values),descending order(ascending=False)
df_receipt.groupby('store_cd').amount.var(ddof=0).reset_index().sort_values('amount', ascending=False).head(5)
Reference: https://deepage.net/features/numpy-var.html Reference (sample variance and unbiased variance): https://bellcurve.jp/statistics/course/8614.html
P-031: Calculate the sample standard deviation of the sales amount (amount) for each store code (store_cd) for the receipt detail data frame (df_receipt), and display the TOP5 in descending order.
P-031
#Store code (store_cd) sample standard deviation of sales amount (amount)(std(ddof=0))
# reset_Re-index and sort by index(sort_values),descending order(ascending=False)
df_receipt.groupby('store_cd').amount.std(ddof=0).reset_index().sort_values('amount', ascending=False).head()
Reference: https://deepage.net/features/numpy-var.html Reference (sample variance and unbiased variance): https://bellcurve.jp/statistics/course/8614.html
P-032: Find the percentile value of the sales amount (amount) of the receipt detail data frame (df_receipt) in 25% increments.
P-032
#Percentile value of sales amount (amount)(quantile)
# np.arange(5)/4 >>> [0, 0.25, 0.5, 0.75, 1]
df_receipt.amount.quantile(q=np.arange(5)/4)
# (Another solution)
np.percentile(df_receipt['amount'], q=[25, 50, 75,100])
Reference (quantile): https://note.nkmk.me/python-pandas-quantile/
P-033: Calculate the average sales amount (amount) for each store code (store_cd) for the receipt detail data frame (df_receipt), and extract 330 or more.
P-033
#Store code (store_Group by cd).
#Average sales amount (amount)(mean)Is calculated and reset_index()Re-index with
# query()More than 300 sales amount (amount) by method
df_receipt.groupby('store_cd').amount.mean().reset_index().query('amount >= 330')
# (Another solution)
df_receipt.groupby('store_cd').amount.mean()[df_receipt.groupby('store_cd').amount.mean() >= 300]
P-034: For the receipt detail data frame (df_receipt), add up the sales amount (amount) for each customer ID (customer_id) and calculate the average of all customers. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation.
P-034
#How to write without using query
# ~Is Not. , Customer ID (customer_id) starts with Z(str.startswith("Z"))
#Customer ID (customer)_The average of all customers by summing the sales amount (amount) for each id)
df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum().mean()
#How to write using query
df_receipt.query('not customer_id.str.startswith("Z")', engine='python').groupby('customer_id').amount.sum().mean()
P-035: For the receipt statement data frame (df_receipt), add up the sales amount (amount) for each customer ID (customer_id) to find the average of all customers, and extract the customers who shop above the average. .. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. Only 10 data items need to be displayed.
P-035
#Customer ID"Z"Exclude.'customer_id'Divide each group. (2547.742234529256) Average for all customers
amount_mean = df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum().mean()
#Customer ID (customer)_Sum the sales amount (amount) for each id) (data frame type)
df_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index()
# amount_mean(Average for all customers)Display the above 10 items
df_amount_sum[df_amount_sum['amount'] >= amount_mean].head(10)
P-036: Internally combine the receipt detail data frame (df_receipt) and the store data frame (df_store), and display all items of the receipt detail data frame and 10 store names (store_name) of the store data frame.
P-036
# merge(A(df), B(df), how='inner'(Inner join), on='Common columns')
pd.merge(df_receipt, df_store[['store_cd','store_name']], how='inner', on='store_cd').head(10)
Reference: https://note.nkmk.me/python-pandas-merge-join/
P-037: Internally combine the product data frame (df_product) and the category data frame (df_category), and display all items of the product data frame and 10 subcategory names (category_small_name) of the category data frame.
P-037
# merge(A(df), B(df), how='inner'(Inner join), on='Common columns')
pd.merge(df_product
, df_category[['category_small_cd','category_small_name']]
, how='inner', on='category_small_cd').head(10)
Reference: https://note.nkmk.me/python-pandas-merge-join/
P-038: Find the total sales amount for each customer from the customer data frame (df_customer) and receipt detail data frame (df_receipt). However, for customers who have no shopping record, the sales amount should be displayed as 0. In addition, customers should target those whose gender code (gender_cd) is female (1), and exclude non-members (customer IDs starting with'Z'). Only 10 results need to be displayed.
P-038
#Total sales amount for each customer.
#Group by customer ID and total sales amount (amount)(sum)
df_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index()
#Customer is gender code (gender_For those whose cd) is female (1), non-member (customer ID is'Z'Excludes those starting with)
df_tmp = df_customer.query('gender_cd == "1" and not customer_id.str.startswith("Z")', engine='python')
# merge(A(df), B(df), how='inner'(Inner join), on='Common columns')
# fillna(0)For customers who have no shopping record at, the sales amount is 0
pd.merge(df_tmp['customer_id'], df_amount_sum, how='left', on='customer_id').fillna(0).head(10)
P-039: Extract the top 20 customers with the highest number of sales days and the top 20 customers with the highest total sales amount from the receipt detail data frame (df_receipt), and perform a complete outer join. However, non-members (customer IDs starting with'Z') should be excluded.
P-039
#Customer ID('customer_id')Group by and total sales amount (amount)(sum)
df_sum = df_receipt.groupby('customer_id').amount.sum().reset_index()
# customer_id.str.startswith("Z")(Except for customer IDs that start with Z)
df_sum = df_sum.query('not customer_id.str.startswith("Z")', engine='python')
#Sort the total sales amount (amount) and extract the top 20
df_sum = df_sum.sort_values('amount', ascending=False).head(20)
#Number of sales days(sales_ymd)To extract duplicates.
df_cnt = df_receipt[~df_receipt.duplicated(subset=['customer_id', 'sales_ymd'])]
# customer_id.str.startswith("Z")(Except for customer IDs that start with Z)
df_cnt = df_cnt.query('not customer_id.str.startswith("Z")', engine='python')
#Customer ID('customer_id')Group by and the number of sales days(sales_ymd)Number of cases(count)
df_cnt = df_cnt.groupby('customer_id').sales_ymd.count().reset_index()
#Number of sales days(sales_ymd)Sort by the number of cases, in descending order(ascending=False)The top 20 items sorted by are extracted.
df_cnt = df_cnt.sort_values('sales_ymd', ascending=False).head(20)
#Outer merge with merge
pd.merge(df_sum, df_cnt, how='outer', on='customer_id')
Reference (duplicate): https://note.nkmk.me/python-pandas-duplicated-drop-duplicates/ Reference (outer join): https://note.nkmk.me/python-pandas-merge-join/
P-040: I would like to investigate how many data will be obtained by combining all stores and all products. Calculate the number of direct products of stores (df_store) and products (df_product).
P-040
#Store (df_make a copy of store)
df_store_tmp = df_store.copy()
#Goods (df_make a copy of product)
df_product_tmp = df_product.copy()
#Key to combine(Column)が必要なのでそれぞれColumnを追加する。
df_store_tmp['key'] = 0
df_product_tmp['key'] = 0
#Outer join and check the number with len method
len(pd.merge(df_store_tmp, df_product_tmp, on='key', how='outer'))
Reference: https://note.nkmk.me/python-pandas-merge-join/
100 knocks of data science How to run 100 data science knocks on Mac
Recommended Posts