This is a struggle record of knocking 100 eggs without knowing the data scientist's egg. It is a mystery whether I can finish the race. ~~ Even if it disappears on the way, please think that it is not given to Qiita. ~~
100 knock articles 100 Knock Guide
** Be careful if you are trying to do it as it includes spoilers **
I don't understand the problem of statistics. (59 ~)
This is hard to see! This way of writing is dangerous! If you have any questions, please let me know. ~~ I will use it as food while suffering damage to my heart.
This solution is wrong! This interpretation is different! Please comment if you have any.
This time from 52 to 56. [Last time] 45-51 [First time with table of contents]
P-052: Total sales amount (amount) of receipt detail data frame (df_receipt) for each customer ID (customer_id), and 0 for 2000 yen or less and 1 for 2000 yen or more for the total sales amount. Display 10 items together with the customer ID and the total sales amount. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation.
mine52.py
df=df_receipt[df_receipt['customer_id'].str.contains('^[^Z]')].reset_index()
df=df.groupby('customer_id').agg({'amount':'sum'}).reset_index()
df['amount']=df['amount'].apply(lambda x : 0 if x<=2000 else 1)
df.head(10)
'''Model answer'''
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python')
df_sales_amount = df_sales_amount[['customer_id', 'amount']].groupby('customer_id').sum().reset_index()
df_sales_amount['sales_flg'] = df_sales_amount['amount'].apply(lambda x: 1 if x > 2000 else 0)
df_sales_amount.head(10)
After doing groupby
, add up and divide it into 0 and 1 while comparing them.
If anything, it's a question of how to master lambda
, so it was a python problem after a long time.
Python isn't the only problem, but it's really annoying to have to solve it with type-aware sentences in sentences that aren't type-aware.
In this case, it took a long time to understand which hierarchy type of data is contained in x
oflambda x: ~
.
P-053: Binar Tokyo (the first 3 digits are 100 to 209) to 1 and the other ones to 0 for the postal code (postal_cd) of the customer data frame (df_customer). In addition, combine it with the receipt detail data frame (df_receipt) and count the number of customers who have a shopping record for the entire period for each of the created two values.
mine53.py
df=df_customer.copy()
df['postal_cd']=df.postal_cd.apply(lambda x: 1 if 100<=int(x[:3])<=209 else 0)
df=pd.merge(df,df_receipt,how='inner',on='customer_id')
df.groupby('postal_cd').agg({'customer_id':'nunique'})
'''Model answer'''
df_tmp = df_customer[['customer_id', 'postal_cd']].copy()
df_tmp['postal_flg'] = df_tmp['postal_cd'].apply(lambda x: 1 if 100 <= int(x[0:3]) <= 209 else 0)
pd.merge(df_tmp, df_receipt, how='inner', on='customer_id'). \
groupby('postal_flg').agg({'customer_id':'nunique'})
It is too convenient to be able to write the form 100 <= x <= 209 in the if statement in lambda
. It's a secret that I forgot to write ʻif because I concentrated too much on ʻelse
.
P-054: The address of the customer data data frame (df_customer) is one of Saitama prefecture, Chiba prefecture, Tokyo metropolitan area, and Kanagawa prefecture. Create a code value for each prefecture and extract it along with the customer ID and address. The values should be 11 for Saitama prefecture, 12 for Chiba prefecture, 13 for Tokyo, and 14 for Kanagawa prefecture. You can display 10 results.
mine54.py
df=df_customer.copy()
df['to_cd']=df['address'].apply(lambda x:11 if x.startswith('Saitama')\
else 12 if x.startswith('Chiba')\
else 13 if x.startswith('Tokyo') \
else 14 if x.startswith('Kanagawa Prefecture') else 0)
df[['customer_id','address','to_cd']].head(10)
'''Model answer'''
pd.concat([df_customer[['customer_id', 'address']], df_customer['address'].str[0:3].map({'Saitama': '11',
'Chiba':'12',
'Tokyo':'13',
'Kanagawa':'14'})], axis=1).head(10)
Well, it's convenient for start with ()
(I forgot to spell contains
)
What you have to be careful about is to return with bool
type.
P-055: Sum the sales amount (amount) of the receipt detail data frame (df_receipt) for each customer ID (customer_id), and find the quartile of the total amount. Then, create a category value for the total sales amount for each customer based on the following criteria, and display both the customer ID and sales amount. Category values are 1 to 4 in order from the top. You can display 10 results.
-More than the minimum and less than the first quartile --From the first quartile to less than the second quartile --From the second quartile to less than the third quartile --Third quartile and above
mine55.py
df=df_receipt.copy()
df=df.groupby('customer_id')['amount'].sum().reset_index()
si=df.quantile(q=[0.25,0.5,0.75]).T
#float(si[0.25])
df['sibun']=df['amount'].apply(lambda x:1 if x<float(si[0.25])
else 2 if x<float(si[0.5])
else 3 if x<float(si[0.75])
else 4)
df.head(10)
'''Model answer'''
#Code example 1
df_sales_amount = df_receipt[['customer_id', 'amount']].groupby('customer_id').sum().reset_index()
pct25 = np.quantile(df_sales_amount['amount'], 0.25)
pct50 = np.quantile(df_sales_amount['amount'], 0.5)
pct75 = np.quantile(df_sales_amount['amount'], 0.75)
def pct_group(x):
if x < pct25:
return 1
elif pct25 <= x < pct50:
return 2
elif pct50 <= x < pct75:
return 3
elif pct75 <= x:
return 4
df_sales_amount['pct_group'] = df_sales_amount['amount'].apply(lambda x: pct_group(x))
df_sales_amount.head(10)
#Code example 2
df_temp = df_receipt.groupby('customer_id')[['amount']].sum()
df_temp['quantile'], bins = pd.qcut(df_receipt.groupby('customer_id')['amount'].sum(), 4, retbins=True)
display(df_temp.head())
print('quantiles:', bins)
Get a quartile from the total and divide by less than or equal to ...
However, in code example 2, the total is put in pd.cut
as it is and it is specified as 4 divisions.
The quartile was what I was doing on the 32nd run, but I forgot as expected. Fukushu Shinakan
P-056: Calculate the age in 10-year increments based on the age of the customer data frame (df_customer), and extract it together with the customer ID (customer_id) and date of birth (birth_day). However, all people over the age of 60 should be in their 60s. The category name indicating the age is arbitrary. The first 10 items should be displayed.
mine56.py
df=df_customer.copy()
df_bins=pd.cut(df.age,[10,20,30,40,50,60,150],right=False,labels=[10,20,30,40,50,60])
df=pd.concat([df[['customer_id','birth_day']],df_bins],axis=1)
df.head(10)
'''Model answer'''
#Code example 1
df_customer_era = pd.concat([df_customer[['customer_id', 'birth_day']],
df_customer['age'].apply(lambda x: min(math.floor(x / 10) * 10, 60))],
axis=1)
df_customer_era.head(10)
#Code example 2
df_customer['age_group'] = pd.cut(df_customer['age'], bins=[0, 10, 20, 30, 40, 50, 60, np.inf], right=False)
df_customer[['customer_id', 'birth_day', 'age_group']].head(10)
This time it's the 43rd review.
I didn't know the existence of np, inf
, so I put in the number 150, which is obviously impossible. Also, I used df.age.min ()
to get the minimum age and confirmed that 10 ~ was fine, and then pulled out 0.
There is an item called labels
in pd.cut
, so if you put it in, you will get a beautiful number.
Also, right = False will take the range 10 <= x <20
.