This is a struggle record of knocking 100 eggs without knowing the data scientist's egg. It is a mystery whether I can finish the race. ~~ Even if it disappears on the way, please think that it is not given to Qiita. ~~
100 knock articles 100 Knock Guide
** Be careful if you are trying to do it as it includes spoilers **
It may not be possible to update for a while. When it disappears, Sumanne
This is hard to see! This way of writing is dangerous! If you have any questions, please let me know. ~~ I will use it as food while suffering damage to my heart.
This solution is wrong! This interpretation is different! Please comment if you have any.
This time from 57 to 62. [Last time] 52-56 [First time with table of contents]
P-057: Combine the extraction result of the previous question and gender, and create new category data that represents the combination of gender and age. The value of the category representing the combination is arbitrary. The first 10 items should be displayed.
Show only your program and the first line
df=df_customer.copy() df_bins=pd.cut(df.age,[10,20,30,40,50,60,150],right=False,labels=[10,20,30,40,50,60]) df=pd.concat([df[['customer_id','birth_day']],df_bins],axis=1) df.head(10)
>|customer_id |birth_day |age|
|--:|--:|--:|
|CS021313000114 |1981-04-29 |30|
#### **`mine57.py`**
```py
df=pd.concat([df_customer[['customer_id','birth_day','gender_cd']],df_bins],axis=1)
df['age_gen']=df.gender_cd.astype('str')+df.age.astype('str')
df.head(10)
'''Model answer'''
df_customer_era['era_gender'] = df_customer['gender_cd'] + df_customer_era['age'].astype('str')
df_customer_era.head(10)
Since I did pd.concat
, I felt that I didn't have to divert it from the last time.
still,
Add a gender digit to 30 in this ʻagecolumn
1 (female) + 30 (age) = 130`
Is the purpose of this time
I didn't understand what I wrote
miss57.py
df=pd.concat([df_customer[['customer_id','birth_day','gender_cd']],df_bins],axis=1)
df=df.groupby(['age','gender_cd']).agg({'customer_id':'count'})
pd.pivot_table(df,index='age',columns='gender_cd')
~~ I accidentally cross-tabulated ~~
P-058: Make the gender code (gender_cd) of the customer data frame (df_customer) a dummy variable and extract it together with the customer ID (customer_id). You can display 10 results.
mine58.py
df=df_customer.copy()
pd.concat([df['customer_id'],pd.get_dummies(df['gender_cd'])],axis=1).head(10)
'''Model answer'''
pd.get_dummies(df_customer[['customer_id', 'gender_cd']], columns=['gender_cd']).head(10)
What is a dummy variable? I thought, I checked
It seems that the corresponding item is created in the first column and the presence or absence of the element is indicated by trueʻor
false` in the table.
Or rather, it's faster to look at the table
male | Female | unknown |
---|---|---|
0 | 1 | 0 |
0 | 0 | 1 |
0 | 1 | 0 |
0 | 1 | 0 |
0 | 1 | 0 |
things like this
P-059: The sales amount (amount) of the receipt detail data frame (df_receipt) is totaled for each customer ID (customer_id), and the total sales amount is ** standardized ** to an average of 0 and a standard deviation of 1, and the customer ID. , Display with the total sales amount. The standard deviation used for standardization may be either unbiased standard deviation or sample standard deviation. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. You can display 10 results.
…
……
………
I read various sites and tried to understand, but ~~ I was skipping math at that time ~~ I couldn't catch up with my understanding.
Shall I ask my seniors to tell me the reference website? Also write
df['hyou1'] =df['amount_sum'] - df.amount_sum.mean()
I make a mistake.
I was trying to look up the answer preprocessing.scale
in a reverse lookup to try to understand it.
https://note.nkmk.me/python-list-ndarray-dataframe-normalize-standardize/ Second half
Normalization / standardization of pandas.DataFrame and pandas.Series Use pandas methods ~ Omitted ~
In the program print( (df.T - df.T.mean()) / df.T.std() ) # col1 col2 col3 # a -1.0 0.0 1.0 # b -1.0 0.0 1.0 # c -1.0 0.0 1.0
This is it
In other words
(Data / .mean ()) /. Std ()
If so
mine59.py
df=df_receipt.copy()
df=df.query('not customer_id.str.startswith("Z")',engine='python')
df=df.groupby('customer_id').agg({'amount':'sum'}).reset_index()
df['hyou1'] =(df['amount'] - df.amount.mean()) / df.amount.std()
df.head(10)
'''Model answer'''
#sklearn preprocessing.Calculated with sample standard deviation to use scale
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python'). \
groupby('customer_id').agg({'amount':'sum'}).reset_index()
df_sales_amount['amount_ss'] = preprocessing.scale(df_sales_amount['amount'])
df_sales_amount.head(10)
It matched.
P-060: The sales amount (amount) of the receipt detail data frame (df_receipt) is totaled for each customer ID (customer_id), and the total sales amount is normalized to the minimum value 0 and the maximum value 1 to the customer ID and sales amount. Display with the total. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. You can display 10 results.
At same site
print((df - df.min()) / (df.max() - df.min())) # col1 col2 col3 # a 0.0 0.0 0.0 # b 0.5 0.5 0.5 # c 1.0 1.0 1.0
Because there is, this is diverted
mine60.py
df=df_receipt.copy()
df=df.query('not customer_id.str.startswith("Z")',engine='python')
df=df.groupby('customer_id').agg({'amount':'sum'}).reset_index()
df['minmax'] =(df['amount'] - df.amount.min()) / (df.amount.max()-df.amount.min())
df.head(10)
'''Model answer'''
#sklearn preprocessing.Calculated with sample standard deviation to use scale
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python'). \
groupby('customer_id').agg({'amount':'sum'}).reset_index()
df_sales_amount['amount_mm'] = preprocessing.minmax_scale(df_sales_amount['amount'])
df_sales_amount.head(10)
P-061: The sales amount (amount) of the receipt detail data frame (df_receipt) is totaled for each customer ID (customer_id), and the total sales amount is converted to the common logarithm (base = 10) to total the customer ID and sales amount. Display with. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. You can display 10 results.
P-062: The sales amount (amount) of the receipt detail data frame (df_receipt) is totaled for each customer ID (customer_id), and the total sales amount is converted to natural logarithm (base = e) to total the customer ID and sales amount. Display with. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. You can display 10 results.
Logarithmization is use exponential function
mine61_62.py
df=df_receipt.copy()
df=df.query('not customer_id.str.startswith("Z")',engine='python')
df=df.groupby('customer_id').agg({'amount':'sum'}).reset_index()
#60, common logarithm ratio
df['jouyou']=df.amount.apply(lambda x: math.log10(x))
#61, natural logarithm ratio
df['shizen']=df.amount.apply(lambda x: math.log(x))
df.head(10)
Can be put out with
mohan61_62.py
#sklearn preprocessing.Calculated with sample standard deviation to use scale
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python'). \
groupby('customer_id').agg({'amount':'sum'}).reset_index()
df_sales_amount['amount_log10'] = np.log10(df_sales_amount['amount'] + 1)
df_sales_amount.head(10)
#sklearn preprocessing.Calculated with sample standard deviation to use scale
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python'). \
groupby('customer_id').agg({'amount':'sum'}).reset_index()
df_sales_amount['amount_loge'] = np.log(df_sales_amount['amount'] + 1)
df_sales_amount.head(10)
………
What is + 1
?
The log is the part that I started to get confused in high school mathematics, so I'm really sorry.
If you know about this + 1
, please comment. Really Niwa Karanai