This is a struggle record of knocking 100 eggs without knowing the data scientist's egg. It is a mystery whether I can finish the race. ~~ Even if it disappears on the way, please think that it is not given to Qiita. ~~
100 knock articles 100 Knock Guide
** Be careful if you are trying to do it as it includes spoilers **
Sad news: The problem does not progress (41st at the time of writing)
This is hard to see! This way of writing is dangerous! If you have any questions, please let me know. ~~ I will use it as food while suffering damage to my heart.
This solution is wrong! This interpretation is different! Please comment if you have any.
This time from 36 to 40. [Last time] 33-35 [First time with table of contents]
P-036: Internally combine the receipt detail data frame (df_receipt) and the store data frame (df_store), and display all items of the receipt detail data frame and 10 store names (store_name) of the store data frame.
Finally came table join
In SQL terms
SQLteki.sql
SELECT *
FROM receipt r INNER JOIN store s
ON r.___ = s.___
What a place, such as. (Ignore projection)
For pandas (reference)
pd.maerge(df_receipt,df_store,on='store_cd',how='inner')
Because it seems to be
mine36.py
df=pd.merge(df_receipt,df_store[['store_cd','store_name']],on='store_cd',how='inner')
df.head(10)
'''Model answer'''
pd.merge(df_receipt, df_store[['store_cd','store_name']], how='inner', on='store_cd').head(10)
P-037: Internally combine the product data frame (df_product) and the category data frame (df_category), and display all items of the product data frame and 10 subcategory names (category_small_name) of the category data frame.
mine37.py
df=pd.merge(df_product,df_category[['category_major_cd', 'category_medium_cd','category_small_cd','category_small_name']]
,on=['category_major_cd', 'category_medium_cd','category_small_cd'],how='inner')
df.head(10)
'''Model answer'''
pd.merge(df_product
, df_category[['category_major_cd', 'category_medium_cd','category_small_cd','category_small_name']]
, how='inner', on=['category_major_cd', 'category_medium_cd','category_small_cd']).head(10)
Well, this is.
P-038: Find the total sales amount for each customer from the customer data frame (df_customer) and receipt detail data frame (df_receipt). However, for customers who have no shopping record, the sales amount should be displayed as 0. In addition, customers should target those whose gender code (gender_cd) is female (1), and exclude non-members (customer IDs starting with'Z'). Only 10 results need to be displayed.
I'm used to erasing the leading "Z". What we do this time is to do a left outer join and set the Null (Nan) value to 0.
Use .fillna (0)
to set Nan to 0 OK
mine38.py
df_cst=df_customer[df_customer.customer_id.str.contains('^[^Z]')].query("gender_cd == '1'")
df_rct=df_receipt.groupby('customer_id').agg({'amount':'sum'}).reset_index()
df=pd.merge(df_cst,df_rct,on='customer_id',how='left').fillna(0)
df.head(10)
'''Model answer'''
df_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index()
df_tmp = df_customer.query('gender_cd == "1" and not customer_id.str.startswith("Z")', engine='python')
pd.merge(df_tmp['customer_id'], df_amount_sum, how='left', on='customer_id').fillna(0).head(10)
In the model answer, I make a calculation and then select (Z start exclusion and female), but I make a selection first and then calculate.
P-039: Extract the top 20 customers with the highest number of sales days and the top 20 customers with the highest total sales amount from the receipt detail data frame (df_receipt), and perform a complete outer join. However, non-members (customer IDs starting with'Z') should be excluded.
This problem was completely wrong and took a long time. the first,
miss39.py
df=df_receipt[df_receipt.customer_id.str.contains('^[^Z]')]
df_day=df.groupby('customer_id').agg({'sales_ymd':'count'}).sort_values('sales_ymd',ascending=False).reset_index().head(20)
df_amo=df_amo.groupby('customer_id').agg({'amount':'sum'}).sort_values('amount',ascending=False).reset_index().head(20)
pd.merge(df_amo,df_day,on='customer_id',how='outer')
I wrote it like this and answered with full confidence.
As the content
--Top 20 customers with the most sales days in df_day
--Top 20 customers with the highest total sales amount in df_amo
I intended to.
But for df_receipt
―― "Same person" "Same day" "Different things"
I forgot that I had some data I bought, so I had to put it together in one line.
So, (Reference) Delete duplicate lines
df_day=df[~df.duplicated(subset=['customer_id', 'sales_ymd'])]
Add
mine39.py
df=df_receipt[df_receipt.customer_id.str.contains('^[^Z]')]
df_day=df[~df.duplicated(subset=['customer_id', 'sales_ymd'])]
df_day=df_day.groupby('customer_id').agg({'sales_ymd':'count'}).sort_values('sales_ymd',ascending=False).reset_index().head(20)
df_amo=df_amo.groupby('customer_id').agg({'amount':'sum'}).sort_values('amount',ascending=False).reset_index().head(20)
pd.merge(df_amo,df_day,on='customer_id',how='outer')
'''Model answer'''
df_sum = df_receipt.groupby('customer_id').amount.sum().reset_index()
df_sum = df_sum.query('not customer_id.str.startswith("Z")', engine='python')
df_sum = df_sum.sort_values('amount', ascending=False).head(20)
df_cnt = df_receipt[~df_receipt.duplicated(subset=['customer_id', 'sales_ymd'])]
df_cnt = df_cnt.query('not customer_id.str.startswith("Z")', engine='python')
df_cnt = df_cnt.groupby('customer_id').sales_ymd.count().reset_index()
df_cnt = df_cnt.sort_values('sales_ymd', ascending=False).head(20)
pd.merge(df_sum, df_cnt, how='outer', on='customer_id')
have become. (I did it. It was shorter than the model answer)
P-040: I would like to investigate how many data will be obtained by combining all stores and all products. Calculate the number of direct products of stores (df_store) and products (df_product).
mine40.py
len(df_store)*len(df_product)
** Oh no? ** **
mohan40.py
```Model answer```
df_store_tmp = df_store.copy()
df_product_tmp = df_product.copy()
df_store_tmp['key'] = 0
df_product_tmp['key'] = 0
len(pd.merge(df_store_tmp, df_product_tmp, how='outer', on='key'))