This is a struggle record of knocking 100 eggs without knowing the data scientist's egg. It is a mystery whether I can finish the race. ~~ Even if it disappears on the way, please think that it is not given to Qiita. ~~
100 knock articles 100 Knock Guide
** Be careful if you are trying to do it as it includes spoilers **
The link of the page that I referred to first has been changed to the first time only.
This is hard to see! This way of writing is dangerous! If you have any questions, please let me know. ~~ I will use it as food while suffering damage to my heart.
This solution is wrong! This interpretation is different! Please comment if you have any.
This time from 33 to 35. [Last time] 29-32 [First time with table of contents]
P-033: Calculate the average sales amount (amount) for each store code (store_cd) for the receipt detail data frame (df_receipt), and extract 330 or more.
mine33.py
df=df_receipt
df=df.groupby('store_cd').agg({'amount':'mean'}).query("amount>=330")
df
'''Model answer'''
df_receipt.groupby('store_cd').amount.mean().reset_index().query('amount >= 330')
Problems like reviewing so far. Specify the range of calculation data with guroupby and Add a line of operation with agg and Specify the condition with query.
If you think it's just, the model answer is .amount.mean ()
I see. However, it seems that it may not work if you write the column specifications by connecting them with .
, so be careful.
~~ I gave in to query ~~ I think it would be nice to be able to use the convenient part of the model answer by summarizing it in this way.
P-034: For the receipt detail data frame (df_receipt), add up the sales amount (amount) for each customer ID (customer_id) and calculate the average of all customers. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation.
mine34.py
df=df_receipt
df=df[df['customer_id'].str.contains('^[^Z]')].reset_index()
df.groupby('customer_id').amount.sum().mean()
'''Model answer'''
#How to write without using query
df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum().mean()
#How to write using query
df_receipt.query('not customer_id.str.startswith("Z")', engine='python').groupby('customer_id').amount.sum().mean()
This time it was a question with two patterns of model answers. I use contains
and usestart with
in the model answer. If you are used to regular expressions to some extent, contains
is easier to write, but is startwith
looking only ahead, so the response is faster?
Also, although I have specified "other than those that start with Z" in the regular expression, both model answers define "things that start with Z" and are "other than that". If the regular expression is too long, it will take time to read it, so if it is too long, I wonder if this is the case.
If you write this problem in SQL
py34.sql
SELECT SUM(amount)/COUNT(DISTINCT customer_id)
FROM receipt
WHERE NOT customer_id LIKE 'Z%'
I think it will feel like (operation confirmed for the time being)
P-035: For the receipt statement data frame (df_receipt), add up the sales amount (amount) for each customer ID (customer_id) to find the average of all customers, and extract the customers who shop above the average. .. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. Only 10 data items need to be displayed.
mine35.py
df=df_receipt
df=df[df['customer_id'].str.contains('^[^Z]')].reset_index()
df=df.groupby('customer_id').amount.sum().reset_index()
df[df['amount'].mean() <= df['amount']].head(10)
'''Model answer'''
amount_mean = df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum().mean()
df_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index()
df_amount_sum[df_amount_sum['amount'] >= amount_mean].head(10)
Finally, the code for the model answer has become longer. Thanks to the two letters ~~ df ~~
When writing this time, I was very conscious of which calculation was done on the data field.
The model answer is also divided into ʻamount_mean and
df_amount_sum`.
df=df_receipt
df=df[df['customer_id'].str.contains('^[^Z]')].reset_index()
--Collect costomer_id
s other than the first" Z "df=df.groupby('customer_id').amount.sum().reset_index()
--Calculate ʻamount total for each
costomer_id`df[df['amount'].mean() <= df['amount']].head(10)
--Narrow down to data above the average of 3I didn't refer to the average except for the comparison of 4, and I didn't like to make two dfs, so I wrote this this time. Of course, the problem says to find the average first, so if you want to use the average in other processes, this writing may be wrong.
In any case, we will compare the "average of the total" and the "total", so we will make that comparison in 4.
The frequent occurrence of ~~ .reset_index ()
also has the meaning of magic ~~
The table join will start from the next time, and it will be a little short because it is halfway, so I will separate it here. ~~ For that reason, I earned even a little with question sentences and model answers ~~