Data Science 100 Knock ~ Battle for less than beginners part11

This is a struggle record of knocking 100 eggs without knowing the data scientist's egg. It is a mystery whether I can finish the race. ~~ Even if it disappears on the way, please think that it is not given to Qiita. ~~

100 knock articles 100 Knock Guide

** Be careful if you are trying to do it as it includes spoilers **

It may not be possible to update for a while. When it disappears, Sumanne

This is hard to see! This way of writing is dangerous! If you have any questions, please let me know. ~~ I will use it as food while suffering damage to my heart.

This solution is wrong! This interpretation is different! Please comment if you have any.

This time from 57 to 62. [Last time] 52-56 [First time with table of contents]

57th

P-057: Combine the extraction result of the previous question and gender, and create new category data that represents the combination of gender and age. The value of the category representing the combination is arbitrary. The first 10 items should be displayed.

Show only your program and the first line

df=df_customer.copy() df_bins=pd.cut(df.age,[10,20,30,40,50,60,150],right=False,labels=[10,20,30,40,50,60]) df=pd.concat([df[['customer_id','birth_day']],df_bins],axis=1) df.head(10)


 >|customer_id 	|birth_day 	|age|
 |--:|--:|--:|
 |CS021313000114 	|1981-04-29 	|30|


#### **`mine57.py`**
```py

df=pd.concat([df_customer[['customer_id','birth_day','gender_cd']],df_bins],axis=1)
df['age_gen']=df.gender_cd.astype('str')+df.age.astype('str')
df.head(10)

'''Model answer'''
df_customer_era['era_gender'] = df_customer['gender_cd'] + df_customer_era['age'].astype('str')
df_customer_era.head(10)

Since I did pd.concat, I felt that I didn't have to divert it from the last time.

still, Add a gender digit to 30 in this ʻagecolumn 1 (female) + 30 (age) = 130` Is the purpose of this time

I didn't understand what I wrote

miss57.py


df=pd.concat([df_customer[['customer_id','birth_day','gender_cd']],df_bins],axis=1)
df=df.groupby(['age','gender_cd']).agg({'customer_id':'count'})
pd.pivot_table(df,index='age',columns='gender_cd')

~~ I accidentally cross-tabulated ~~

58th

P-058: Make the gender code (gender_cd) of the customer data frame (df_customer) a dummy variable and extract it together with the customer ID (customer_id). You can display 10 results.

mine58.py


df=df_customer.copy()
pd.concat([df['customer_id'],pd.get_dummies(df['gender_cd'])],axis=1).head(10)

'''Model answer'''
pd.get_dummies(df_customer[['customer_id', 'gender_cd']], columns=['gender_cd']).head(10)

What is a dummy variable? I thought, I checked It seems that the corresponding item is created in the first column and the presence or absence of the element is indicated by trueʻorfalse` in the table.

Or rather, it's faster to look at the table

male Female unknown
0 1 0
0 0 1
0 1 0
0 1 0
0 1 0

things like this

59th

P-059: The sales amount (amount) of the receipt detail data frame (df_receipt) is totaled for each customer ID (customer_id), and the total sales amount is ** standardized ** to an average of 0 and a standard deviation of 1, and the customer ID. , Display with the total sales amount. The standard deviation used for standardization may be either unbiased standard deviation or sample standard deviation. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. You can display 10 results.

……

………

What is standardization?

I read various sites and tried to understand, but ~~ I was skipping math at that time ~~ I couldn't catch up with my understanding.

Shall I ask my seniors to tell me the reference website? Also write

df['hyou1'] =df['amount_sum'] - df.amount_sum.mean()

I make a mistake.

I was trying to look up the answer preprocessing.scale in a reverse lookup to try to understand it.

https://note.nkmk.me/python-list-ndarray-dataframe-normalize-standardize/ Second half

Normalization / standardization of pandas.DataFrame and pandas.Series Use pandas methods ~ Omitted ~

In the program print( (df.T - df.T.mean()) / df.T.std() ) # col1 col2 col3 # a -1.0 0.0 1.0 # b -1.0 0.0 1.0 # c -1.0 0.0 1.0

This is it In other words (Data / .mean ()) /. Std () If so

mine59.py


df=df_receipt.copy()
df=df.query('not customer_id.str.startswith("Z")',engine='python')
df=df.groupby('customer_id').agg({'amount':'sum'}).reset_index()

df['hyou1'] =(df['amount'] - df.amount.mean()) / df.amount.std()
df.head(10)

'''Model answer'''
#sklearn preprocessing.Calculated with sample standard deviation to use scale
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python'). \
    groupby('customer_id').agg({'amount':'sum'}).reset_index()
df_sales_amount['amount_ss'] = preprocessing.scale(df_sales_amount['amount'])
df_sales_amount.head(10)

It matched.

60th

P-060: The sales amount (amount) of the receipt detail data frame (df_receipt) is totaled for each customer ID (customer_id), and the total sales amount is normalized to the minimum value 0 and the maximum value 1 to the customer ID and sales amount. Display with the total. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. You can display 10 results.

At same site

print((df - df.min()) / (df.max() - df.min())) # col1 col2 col3 # a 0.0 0.0 0.0 # b 0.5 0.5 0.5 # c 1.0 1.0 1.0

Because there is, this is diverted

mine60.py


df=df_receipt.copy()
df=df.query('not customer_id.str.startswith("Z")',engine='python')
df=df.groupby('customer_id').agg({'amount':'sum'}).reset_index()
df['minmax'] =(df['amount'] - df.amount.min()) / (df.amount.max()-df.amount.min())
df.head(10)

'''Model answer'''
#sklearn preprocessing.Calculated with sample standard deviation to use scale
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python'). \
    groupby('customer_id').agg({'amount':'sum'}).reset_index()
df_sales_amount['amount_mm'] = preprocessing.minmax_scale(df_sales_amount['amount'])
df_sales_amount.head(10)

61st and 62nd

P-061: The sales amount (amount) of the receipt detail data frame (df_receipt) is totaled for each customer ID (customer_id), and the total sales amount is converted to the common logarithm (base = 10) to total the customer ID and sales amount. Display with. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. You can display 10 results.

P-062: The sales amount (amount) of the receipt detail data frame (df_receipt) is totaled for each customer ID (customer_id), and the total sales amount is converted to natural logarithm (base = e) to total the customer ID and sales amount. Display with. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. You can display 10 results.

Logarithmization is use exponential function

mine61_62.py


df=df_receipt.copy()
df=df.query('not customer_id.str.startswith("Z")',engine='python')
df=df.groupby('customer_id').agg({'amount':'sum'}).reset_index()

#60, common logarithm ratio
df['jouyou']=df.amount.apply(lambda x: math.log10(x))
#61, natural logarithm ratio
df['shizen']=df.amount.apply(lambda x: math.log(x))

df.head(10)

Can be put out with

mohan61_62.py


#sklearn preprocessing.Calculated with sample standard deviation to use scale
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python'). \
    groupby('customer_id').agg({'amount':'sum'}).reset_index()
df_sales_amount['amount_log10'] = np.log10(df_sales_amount['amount'] + 1)
df_sales_amount.head(10)

#sklearn preprocessing.Calculated with sample standard deviation to use scale
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python'). \
    groupby('customer_id').agg({'amount':'sum'}).reset_index()
df_sales_amount['amount_loge'] = np.log(df_sales_amount['amount'] + 1)
df_sales_amount.head(10)

………

What is + 1?

Up to here for this time

The log is the part that I started to get confused in high school mathematics, so I'm really sorry. If you know about this + 1, please comment. Really Niwa Karanai

Recommended Posts

Data Science 100 Knock ~ Battle for less than beginners part3
Data Science 100 Knock ~ Battle for less than beginners part6
Data Science 100 Knock ~ Battle for less than beginners part2
Data Science 100 Knock ~ Battle for less than beginners part9
Data Science 100 Knock ~ Battle for less than beginners part7
Data Science 100 Knock ~ Battle for less than beginners part11
Data science 100 knocks ~ Battle for less than beginners part5
Data science 100 knocks ~ Battle for less than beginners part10
Data science 100 knocks ~ Battle for less than beginners part8
Data science 100 knock commentary (P021 ~ 040)
Data science 100 knock commentary (P041 ~ 060)
Data science 100 knock commentary (P081 ~ 100)
"Data Science 100 Knock (Structured Data Processing)" Python-007 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-006 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-001 Explanation
Time series data anomaly detection for beginners
"Data Science 100 Knock (Structured Data Processing)" Python-002 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 021 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-004 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 020 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 025 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-003 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 019 Explanation
[Hands-on for beginners] Read kaggle's "Forecasting Home Prices" line by line (Part 1: Reading data)
[Linux command] less command option list [Must-see for beginners]
For new students (Recommended efforts for Python beginners Part 1)
How to use data analysis tools for beginners
Preparing to try "Data Science 100 Knock (Structured Data Processing)"
Data science 100 knock (structured data processing) environment construction (Windows10)
Basics of pandas for beginners ② Understanding data overview