This is a struggle record of knocking 100 eggs without knowing the data scientist's egg. It is a mystery whether I can finish the race. ~~ Even if it disappears on the way, please think that it is not given to Qiita. ~~
100 knock articles 100 Knock Guide
** Be careful if you are trying to do it as it includes spoilers **
I'm scared because the way to write the model answer is too different
This is hard to see! This way of writing is dangerous! If you have any questions, please let me know. ~~ I will use it as food while suffering damage to my heart.
This solution is wrong! This interpretation is different! Please comment if you have any.
This time from 41 to 44. [Last time] 36-40 [First time with table of contents]
P-041: Aggregate the sales amount (amount) of the receipt detail data frame (df_receipt) for each date (sales_ymd), and calculate the increase or decrease in the sales amount from the previous day. It is sufficient to display 10 calculation results.
mine41.py
df_day=df_receipt.groupby('sales_ymd').agg({'amount':'sum'}).reset_index()
df_Yday=pd.concat([df_day['sales_ymd']+1,df_day['amount']],axis=1).rename(columns={'amount': 'Y_amount'})
df=pd.merge(df_day.head(10),df_Yday,on='sales_ymd',how='outer')
df['sa']=df['amount']-df['Y_amount']
df.head(10)
'''Model answer'''
df_sales_amount_by_date = df_receipt[['sales_ymd', 'amount']].groupby('sales_ymd').sum().reset_index()
df_sales_amount_by_date = pd.concat([df_sales_amount_by_date, df_sales_amount_by_date.shift()], axis=1)
df_sales_amount_by_date.columns = ['sales_ymd','amount','lag_ymd','lag_amount']
df_sales_amount_by_date['diff_amount'] = df_sales_amount_by_date['amount'] - df_sales_amount_by_date['lag_amount']
df_sales_amount_by_date.head(10)
It is a total of one day shift As my method
df_Y
[date (one day later), aggregate]I made it as. I don't think this is wrong, but I thought it would be better to pull from the table date instead of "1 day".
――For example, if there is a shop closed on Saturdays and Sundays, the date will fly
After seeing the model answer, If you find out how to use .shift ()
I thought it would be very easy, so I will use it from the next problem. It was to be
P-042: Aggregate the sales amount (amount) of the receipt detail data frame (df_receipt) for each date (sales_ymd), and combine the data of 1 day ago, 2 days ago, and 3 days ago with the data of each date. Only 10 results should be displayed.
mine42.py
df_day=df_receipt.groupby('sales_ymd').agg({'amount':'sum'})
df=df_day
for i in range(1,4):
df=pd.merge( df,df_day.shift(i).reset_index(),how='outer',on='sales_ymd')
df.columns=['sales_ymd','amount_sum_0','amount_sum_1','amount_sum_2','amount_sum_3']
df.head(10)
'''Model answer'''
#Code example 2:Horizontal holding case
df_sales_amount_by_date = df_receipt[['sales_ymd', 'amount']].groupby('sales_ymd').sum().reset_index()
for i in range(1, 4):
if i == 1:
df_lag = pd.concat([df_sales_amount_by_date, df_sales_amount_by_date.shift(i)],axis=1)
else:
df_lag = pd.concat([df_lag, df_sales_amount_by_date.shift(i)],axis=1)
df_lag.columns = ['sales_ymd', 'amount', 'lag_ymd_1', 'lag_amount_1', 'lag_ymd_2', 'lag_amount_2', 'lag_ymd_3', 'lag_amount_3']
df_lag.dropna().sort_values(['sales_ymd']).head(10)
At first I tried to merge four tables at once without turning with a for statement, but apparently pd.merge ()
can only merge two
Reference
So I made one source and turned it with a for statement.
Personally, it was the most fun data manipulation these days.
By the way, it seems that the data does not change with .shift (0)
.
df=pd.merge( df_day.shift(0).reset_index() , df_day.shift(1).reset_index() , df_day.shift(2).reset_index() , df_day.shift(3).reset_index() ,how='outer',on='sales_ymd')
I wanted to write like this
P-043: Sales summary data frame (df_sales_summary) that combines the receipt detail data frame (df_receipt) and the customer data frame (df_customer) and totals the sales amount (amount) for each gender (gender) and age (calculated from age). ). Gender is 0 for male, 1 for female, and 9 for unknown. However, the item composition should be four items: age, sales amount for women, sales amount for men, and sales amount for unknown gender (cross tabulation of age vertically and gender horizontally). Also, the age group should be every 10 years old.
The next problem is
.groupby
I took the stage
mine43.py
df=pd.merge(df_receipt,df_customer,how='inner',on='customer_id')
df_bins=pd.cut(df.age,range(0,100,10))
df=pd.concat([df[['gender_cd','amount']],df_bins],axis=1)
df=df.groupby(['age','gender_cd']).agg({'amount':'sum'}).reset_index()
df_cross=pd.merge( df.query("gender_cd=='0'")[['age','amount']]
,df.query("gender_cd=='1'")[['age','amount']]
,how='outer',on='age')
df_cross=pd.merge( df_cross
,df.query("gender_cd=='9'")[['age','amount']]
,how='outer',on='age')
df_cross.columns=['age','male','female','unkown']
df_cross
'''Model answer'''
df_tmp = pd.merge(df_receipt, df_customer, how ='inner', on="customer_id")
df_tmp['era'] = df_tmp['age'].apply(lambda x: math.floor(x / 10) * 10)
df_sales_summary = pd.pivot_table(df_tmp, index='era', columns='gender_cd', values='amount', aggfunc='sum').reset_index()
df_sales_summary.columns = ['era', 'male', 'female', 'unknown']
df_sales_summary
If you think and see the model answer, do not divide it into bins
.apply(lambda x: math.floor(x / 10) * 10)
Is it divided by 10 (integer type) and multiplied by 10?
I don't know how to use lambda
.
And it's a pivot table ... I see. Is it possible to use it like that ...
I will study the pivot table (
P-044: The sales summary data frame (df_sales_summary) created in the previous question was a horizontal sales of gender. Let's hold the gender vertically from this data frame and convert it into 3 items: age, gender code, and sales amount. However, the gender code is '00' for men, '01' for women, and '99' for unknowns.
Since this is a continuous problem, the df summarized last time is used as it is. Since it means to change it to vertical holding, I divided the df into three and arranged them vertically. Or rather, I changed the vertical to horizontal last time, so the opposite was easy ...
mine44.py
df_cross_M=df_cross[['age','male']].rename(columns={'male':'sum'})
df_cross_M['gender']='00'
df_cross_F=df_cross[['age','female']].rename(columns={'female':'sum'})
df_cross_F['gender']='01'
df_cross_U=df_cross[['age','unkown']].rename(columns={'unkown':'sum'})
df_cross_U['gender']='99'
df=pd.concat([df_cross_M,df_cross_F,df_cross_U])
df
'''Model answer'''
df_sales_summary = df_sales_summary.set_index('era'). \
stack().reset_index().replace({'female':'01',
'male':'00',
'unknown':'99'}).rename(columns={'level_1':'gender_cd', 0: 'amount'})
The model answer is finished with a very long sentence.
Did you use setindex ()
to save the chronological columns to the index and unify all the columns?
I think it's amazing that you can change the name at once with .replace ()
. I definitely want to master it
In any case, there are more and more ways that are different from my own, so it's very educational.
Thank you to those who always stack up. If there are people who can see their progress instead of registering individual articles, we will send you an update notification every time you update the table of contents of part1. Therefore, if you stock part1 without stocking all the articles, you can see your progress, so I think that you will not have to bother. (I just think)
I would like to thank everyone who has always seen it, this time for the first time, and everyone. Thank you for browsing.