This is a struggle record of knocking 100 eggs without knowing the data scientist's egg. It is a mystery whether I can finish the race. ~~ Even if it disappears on the way, please think that it is not given to Qiita. ~~
100 knock articles 100 Knock Guide
** Be careful if you are trying to do it as it includes spoilers **
The reason I'm writing here is because I earn about one page to prevent spoilers ()
This time, I derailed and derailed, and I couldn't understand what I wanted to investigate (the cause was the 30th).
This is hard to see! This way of writing is dangerous! If you have any questions, please let me know. ~~ I will use it as food while suffering damage to my heart.
This time from 29 to 32. [Last time] 23-28 [First time with table of contents]
** There were many things I didn't understand this time **
P-029: Find the mode of the product code (product_cd) for each store code (store_cd) for the receipt statement data frame (df_receipt).
I knew that the mode was the mode, but in the continuation of the last time
groupby(store_cd).agg({'product_cd':['mode']})
It will be moss if you write. Or rather, it doesn't seem to be in agg?
Even if I help and check usual site, it doesn't work very well. Why? Article
Although it is also written on the following site, if you want to find the mode in mode after group by, you need to combine value_counts and apply. Although it is in English, the code example is simple and easy to understand, so please refer to it. https://github.com/pandas-dev/pandas/issues/11562
df.groupby ('grouping_content'). The target you want to find the mode. Apply (lambda x: x.mode ())
** ~~ Lambda …… ~~ **
mine29.py
df=df_receipt
df=df.groupby('store_cd').product_cd.apply(lambda x: x.mode()).reset_index()
df.head(5)
** Yeah, I don't know **
P-030: For the receipt detail data frame (df_receipt), calculate the sample variance of the sales amount (amount) for each store code (store_cd), and display the TOP5 in descending order.
~~ Please guess at the time of continuous throwing of problem sentences ~~
First of all, I'm not good at math at high school 2, and after more than 10 years, I'm dying if there are letters and large symbols written on the top, bottom, left, and right of Sigma ~~ Even when I ask for standard deviation in Excel, I copy and paste
The character of "sample dispersion" that came out in such a situation
** Distributed **
If something is written, it will be a white eye. Can't you?
The standard variance formula looks like this
~~ Dying ~~ Write like a program for the time being while squeezing your eyes and organize your head
ikinokoru.py
def Sno2jo(X_list):
n = len(X_list)
X_ave = sum(X_list) / n
ret = 0.0
for X in range(X_list):
ret += (X - X_ave)**2
ret = ret / n
return ret
Is it like this ...? (The array starts at 0) (Operation has not been confirmed)
** Program can go **
After that, various site and [video] While looking at (https://www.youtube.com/watch?v=lD35jzfrxaU), I somehow understood it, but since it derailed greatly, I rounded it up and returned to knock.
I think it would be easier to use a statistical function this time (I'm doing Python for that) and solve it.
** Reference **
mine30.py
df=df_receipt
df= df.groupby('store_cd').amount.var(ddof=0).reset_index().sort_values('amount',ascending=False).head(5)
df
'''Model answer'''
df_receipt.groupby('store_cd').amount.var(ddof=0).reset_index().sort_values('amount', ascending=False).head(5)
'''Failure example'''
df=pd.concat([df['store_cd'],df.groupby('store_cd').amount.var(ddof=0)],axis=0)
In the failure example I wondered if I couldn't combine well because the result was like this. I thought it would be fine on its own, and I was able to answer by myself.
No, it took a long time
mine31.py
df=df_receipt
df= df.groupby('store_cd').amount.std(ddof=0).reset_index().sort_values('amount',ascending=False).head()
df
Break through 30 as it is with copy
When it was ddof = False
, I was told to pass it by Int and set it to 0
.
ddof
is a consistent estimator or unbiased estimator from here and video Is it a difference in quantity?
I didn't understand the meaning of the percentile value (although I did understand it somehow), so I looked it up and then looked up the reference site (flowing cheat).
** Reference **
mine32.py
df=df_receipt
df['amount'].quantile([0.0,0.25,0.5,0.75,1.0])
'''Model answer 1'''
np.percentile(df_receipt['amount'], q=[25, 50, 75,100])
'''Model answer 2'''
df_receipt.amount.quantile(q=np.arange(5)/4)
The direction is close to model answer 2, but I thought it would be a good idea to use np.arange.
For those who want to play with this CSV ~~ Save to \ 100 knocks-preprocess \ docker \ work There are various CSVs here