Data science 100 knocks ~ Battle for less than beginners part5

This is a struggle record of knocking 100 eggs without knowing the data scientist's egg. It is a mystery whether I can finish the race. ~~ Even if it disappears on the way, please think that it is not given to Qiita. ~~

100 knock articles 100 Knock Guide

** Be careful if you are trying to do it as it includes spoilers **

The reason I'm writing here is because I earn about one page to prevent spoilers ()

This time, I derailed and derailed, and I couldn't understand what I wanted to investigate (the cause was the 30th).

This is hard to see! This way of writing is dangerous! If you have any questions, please let me know. ~~ I will use it as food while suffering damage to my heart.

This time from 29 to 32. [Last time] 23-28 [First time with table of contents]

** There were many things I didn't understand this time **

29th

P-029: Find the mode of the product code (product_cd) for each store code (store_cd) for the receipt statement data frame (df_receipt).

I knew that the mode was the mode, but in the continuation of the last time groupby(store_cd).agg({'product_cd':['mode']}) It will be moss if you write. Or rather, it doesn't seem to be in agg?

Even if I help and check usual site, it doesn't work very well. Why? Article

Although it is also written on the following site, if you want to find the mode in mode after group by, you need to combine value_counts and apply. Although it is in English, the code example is simple and easy to understand, so please refer to it. https://github.com/pandas-dev/pandas/issues/11562 df.groupby ('grouping_content'). The target you want to find the mode. Apply (lambda x: x.mode ())

** ~~ Lambda …… ~~ **

mine29.py


df=df_receipt
df=df.groupby('store_cd').product_cd.apply(lambda x: x.mode()).reset_index()
df.head(5)

** Yeah, I don't know **

30th

P-030: For the receipt detail data frame (df_receipt), calculate the sample variance of the sales amount (amount) for each store code (store_cd), and display the TOP5 in descending order.

~~ Please guess at the time of continuous throwing of problem sentences ~~

First of all, I'm not good at math at high school 2, and after more than 10 years, I'm dying if there are letters and large symbols written on the top, bottom, left, and right of Sigma ~~ Even when I ask for standard deviation in Excel, I copy and paste

The character of "sample dispersion" that came out in such a situation

** Distributed **

If something is written, it will be a white eye. Can't you?

The standard variance formula looks like this

variance178.png

~~ Dying ~~ Write like a program for the time being while squeezing your eyes and organize your head

ikinokoru.py


def Sno2jo(X_list):

   n = len(X_list)
   X_ave = sum(X_list) / n
   ret = 0.0

   for X in range(X_list):
      ret += (X - X_ave)**2

   ret = ret / n
   return ret

Is it like this ...? (The array starts at 0) (Operation has not been confirmed)

** Program can go **

After that, various site and [video] While looking at (https://www.youtube.com/watch?v=lD35jzfrxaU), I somehow understood it, but since it derailed greatly, I rounded it up and returned to knock.

I think it would be easier to use a statistical function this time (I'm doing Python for that) and solve it.

** Reference **

mine30.py


df=df_receipt
df= df.groupby('store_cd').amount.var(ddof=0).reset_index().sort_values('amount',ascending=False).head(5)
df

'''Model answer'''
df_receipt.groupby('store_cd').amount.var(ddof=0).reset_index().sort_values('amount', ascending=False).head(5)

'''Failure example'''
df=pd.concat([df['store_cd'],df.groupby('store_cd').amount.var(ddof=0)],axis=0)

In the failure example image.png I wondered if I couldn't combine well because the result was like this. I thought it would be fine on its own, and I was able to answer by myself.

No, it took a long time

31st

mine31.py


df=df_receipt
df= df.groupby('store_cd').amount.std(ddof=0).reset_index().sort_values('amount',ascending=False).head()
df

Break through 30 as it is with copy When it was ddof = False, I was told to pass it by Int and set it to 0. ddof is a consistent estimator or unbiased estimator from here and video Is it a difference in quantity?

32nd

I didn't understand the meaning of the percentile value (although I did understand it somehow), so I looked it up and then looked up the reference site (flowing cheat).

** Reference **

mine32.py


df=df_receipt
df['amount'].quantile([0.0,0.25,0.5,0.75,1.0])

'''Model answer 1'''
np.percentile(df_receipt['amount'], q=[25, 50, 75,100])

'''Model answer 2'''
df_receipt.amount.quantile(q=np.arange(5)/4)

The direction is close to model answer 2, but I thought it would be a good idea to use np.arange.

Up to here for this time

For those who want to play with this CSV ~~ Save to \ 100 knocks-preprocess \ docker \ work There are various CSVs here

Recommended Posts

Data science 100 knocks ~ Battle for less than beginners part5
Data science 100 knocks ~ Battle for less than beginners part10
Data science 100 knocks ~ Battle for less than beginners part8
Data Science 100 Knock ~ Battle for less than beginners part3
Data Science 100 Knock ~ Battle for less than beginners part6
Data Science 100 Knock ~ Battle for less than beginners part2
Data Science 100 Knock ~ Battle for less than beginners part1
Data Science 100 Knock ~ Battle for less than beginners part9
Data Science 100 Knock ~ Battle for less than beginners part7
Data Science 100 Knock ~ Battle for less than beginners part4
Data Science 100 Knock ~ Battle for less than beginners part11
How to implement 100 data science knocks for data science beginners (for windows10 Home)
Challenge 100 data science knocks
Try "100 knocks on data science" ①
100 Pandas knocks for Python beginners
Time series data anomaly detection for beginners
[Hands-on for beginners] Read kaggle's "Forecasting Home Prices" line by line (Part 1: Reading data)
[Linux command] less command option list [Must-see for beginners]
[Python] 100 knocks on data science (structured data processing) 018 Explanation
[Python] 100 knocks on data science (structured data processing) 023 Explanation
[Python] 100 knocks on data science (structured data processing) 030 Explanation
[Python] 100 knocks on data science (structured data processing) 022 Explanation
For new students (Recommended efforts for Python beginners Part 1)
How to use data analysis tools for beginners
[Python] 100 knocks on data science (structured data processing) 017 Explanation
[Python] 100 knocks on data science (structured data processing) 026 Explanation
[Python] 100 knocks on data science (structured data processing) 016 Explanation
[Python] 100 knocks on data science (structured data processing) 024 Explanation
[Python] 100 knocks on data science (structured data processing) 027 Explanation
Basics of pandas for beginners ② Understanding data overview
[Python] 100 knocks on data science (structured data processing) 029 Explanation
[Python] 100 knocks on data science (structured data processing) 015 Explanation
[Python] 100 knocks on data science (structured data processing) 028 Explanation