Let's rewrite the reconstruction node that converts vertically held data to horizontally held with SPSS Modeler with Python pandas.
From the following POS data with ID, let's aggregate (1) the total purchase amount for each product category and (2) the purchase ratio for each product category for each customer.
■ Before processing We use ID-attached POS data that records who (CUSTID) purchased when (SDATE) and what (PRODUCTID, L_CLASS product major classification, M_CLASS product middle classification) and how much (SUBTOTAL).
■ After processing For each customer (CUSTID), the total purchase amount for each product category (1) and the purchase ratio for each product category (L_CLASS) are totaled.
There are three major product categories: BAG, COMETICS, and SHOES. The 100001 customer has a purchase of 7243 yen for BAG, 10273 yen for COSMETICS, and 26870 yen for SHOES. If you calculate it as a percentage of the amount, it will be BAG 16.3%, COSMETICS 23.1%, SHOES 60.5%. When such aggregation is performed, the characteristics of the customer become apparent.
First, (1) find up to the total purchase amount for each product category. Reconfigure nodes are commonly used in combinations of data type nodes, record summaries, and even replacement nodes. In addition, this time we will also combine record aggregation and record join nodes to calculate the total purchase price.
First, the data type node recognizes what category value is in the product classification (L_CLASS). When you perform "Read Value" on the data type node, all the major categories are automatically recognized.
Next, select the value of the field you want to hold horizontally in the reconfiguration node. In the example below, BAG, COMETICS, SHOES are expanded as columns for the product major classification (L_CLASS), and subtotals (SUBSTOTAL) are set for the values.
A preview of the result of this setting looks like below. You can see that the subtotal (SUBSTOTAL) has been sorted into the columns of each product major classification (L_CLASS).
Next, use the "Record Aggregation" node to aggregate the subtotal values of BAG, COMETICS, and SHOES into one record by the customer (uncheck because the record frequency is unnecessary).
With this, the total purchase amount for each BAG, COMETICS, and SHOES product category for each customer has been aggregated. It was calculated that the customer number 100001 had a purchase of 7243 yen for BAG, 10273 yen for COSMETICS, and 26870 yen for SHOES.
It's almost complete, but it has some null values. This means that this customer has never made a purchase in this product category. If this is left as it is, it will be difficult to use for calculation, so replace NULL with 0.
It is the replacement node that does that. Select the node with the total value of BAG, COMETICS, SHOES and set it to 0 if it is null.
Nulls have been converted to 0 as shown below.
Next, the total purchase amount for each customer is calculated by adding the record aggregation and record combination nodes.
First, the record aggregation node calculates the total subtotal (SUBTOTAL) of each customer (CUSTID), that is, the total purchase amount (again, the record frequency is not required).
The 100001 customer purchases a total of 44,386 yen.
Combine this aggregation result with the aggregation of the total amount of BAG, COMETICS, and SHOES earlier with the record join node.
Then, the total purchase amount could be combined after the total amount of BAG, COMETICS, and SHOES for each customer.
At the end, I shortened the long column name in the filter node (this work is not mandatory).
The final data looks like this:
From here, it is a little applied usage as a reconfiguration node, but I will introduce it because I often want to calculate a "ratio" that is easy to compare with other customers.
Let's take a look at the data we created earlier.
For the 100001 customer, the purchase amount ratio can be calculated by dividing BAG by 7243 yen ÷ 44,386 yen, COSMETICS by 10273 yen ÷ 44,386 yen, and SHOES by 26870 yen ÷ 44,386 yen. Do this calculation on the field creation node.
Set multiple modes and select BAG_SUM, COSMETICS_SUM, SHOES_SUM in the field list. Then you can use @FIELD in the CLEM expression to refer to the BAG_SUM, COSMETICS_SUM, and SHOES_SUM columns. here @FIELD/ALL_SUM By specifying Three ratios of BAG_SUM / ALL_SUM, COSMETICS_SUM / ALL_SUM, and SHOES_SUM / ALL_SUM are calculated.
With this, the purchase ratio for each major product category of BAG, COMETICS, and SHOES for each customer has been aggregated. The purchase ratio of customer No. 100001 is BAG 16.3%, COSMETICS 23.1%, SHOES 60.5%.
I will try the same data processing with pandas. Actually (1) Calculation of the total purchase amount for each product category can be done very easily with one method called pivot_table in pandas.
pivot_tran_df= df.pivot_table(
index=['CUSTID'],columns=['L_CLASS'],values=['SUBTOTAL'],
aggfunc='sum',
fill_value=0,
margins=True, margins_name='ALL')
--index = ['CUSTID'], columns = ['L_CLASS'], values = ['SUBTOTAL'] and expand SUBTOTAL for each major category of BAG, COMETICS, SHOES included in L_CLASS using CUSTID as a key. .. This is the image of the reconfiguration node settings. --Indicates that the total value of SUBTOTAL is calculated with aggfunc ='sum'. Here is an image of the aggregation node. --fill_value = 0 indicates that the part that became NULL is filled with 0. Here is an image of the replacement node. --margins = True, margins_name ='ALL' indicates to add column totals and row totals. This serves as an aggregation node and a record join node.
The purchase amount and total purchase amount of BAG, COMETICS, SHOES can be calculated for each CUSTID as shown below.
--Reference
The column total column is required, but the row total row is not needed this time, so we will delete it below.
pivot_tran_df=pivot_tran_df[:-1]
--Reference --python — How to remove the last line of data in a pandas data frame https://www.it-swarm.dev/ja/python/pandas%E3%83%87%E3%83%BC%E3%82% BF% E3% 83% 95% E3% 83% AC% E3% 83% BC% E3% 83% A0% E3% 81% AE% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 81% AE% E6% 9C% 80% E5% BE% 8C% E3% 81% AE% E8% A1% 8C% E3% 82% 92% E5% 89% 8A% E9% 99% A4% E3% 81% 99% E3% 82% 8B% E6% 96% B9% E6% B3% 95/1050572731 /
If you use pivot_table, the columns will be hierarchical and multi-column.
If it is a multi-column, it is difficult to combine and handle it, so convert it to a flat snake format (concatenated with underscore) column name. According to the Modeler naming convention, it is as follows.
pivot_tran_df.columns = [
pivot_tran_df.columns.names[1]+"_"+levels[1]+"_"+levels[0]
for levels in pivot_tran_df.columns]
I also used a regular expression here and changed it to a shorter column name like BAG_SUM.
import re
pivot_tran_df=pivot_tran_df.rename(
columns= lambda str:
re.sub('L_CLASS_(.+)_SUBTOTAL',r'\1_SUM',str))
(2) In Modeler, the purchase ratio for each product category was calculated by @ FIELD / ALL_SUM.
pandas uses the div method. Close to the image of the Replace node in Modeler, all columns are divided by ALL_SUM and replaced. ALL_SUM itself is also divided to 1.0.
pivot_tran_ratio_df=pivot_tran_df.div(pivot_tran_df["ALL_SUM"], axis=0)
--Reference
Drop ALL_SUM and rename it because it's RATIO (percentage) instead of SUM (total).
import re
pivot_tran_ratio_df=pivot_tran_ratio_df\
.drop(columns=['ALL_SUM'])\
.rename(columns= lambda str:re.sub('_SUM','_RATIO',str))
Finally, (1) Rejoin the DataFrame of the total purchase amount for each product category to complete.
pivot_tran_df=pivot_tran_df.join(pivot_tran_ratio_df)
The sample is placed below.
stream https://github.com/hkwd/200611Modeler2Python/blob/master/Restructure/Restructure.str?raw=true notebook https://github.com/hkwd/200611Modeler2Python/blob/master/Restructure/restructure.ipynb data https://raw.githubusercontent.com/hkwd/200611Modeler2Python/master/data/sampletranDEPT4en2019S.csv
■ Test environment Modeler 18.2.1 Windows 10 64bit Python 3.6.9 pandas 0.24.1
[Relay serialization] My recommended node --Awesome technician "reconstruction node" that generates features from POS with ID and IoT time series data | IBM Solution Blog https://www.ibm.com/blogs/solutions/jp -ja / spssmodeler-push-node-4 /
Recommended Posts