Sampling nodes are used for sampling in SPSS Modeler. I will explain this sampling node and rewrite it with Python pandas.

There are two types of sampling: (1) simple sampling and (2) complex sampling that reflects data trends. Last time explained ① simple sampling. This time, (2) complicated sampling will be explained.

① Simple sampling: ①-1. First N cases; ①-2. Random sampling
② Complex sampling ← This article: ②-1. Layered sampling; ②-2. Cluster sampling

0. raw data

The following POS data with ID is targeted. We use ID-attached POS data that records who (CUSTID) purchased when (SDATE) and what (PRODUCTID, L_CLASS product major classification, M_CLASS product middle classification) and how much (SUBTOTAL).

There are 28,599 cases in 6 fields.

1m. ②-1. Layered sampling Modeler version

Random sampling is a sampling method that can reflect trends in all data if there are enough records. However, some data may have a large bias in the distribution and only exist in small proportions. If the number of samplings is small, such data may not be able to reflect the tendency.

For example, let's look at the distribution of M_CLASS (classification in products) of this data. The number of sales of SHOES01 is 631 times, which is 2.21% of the total, so it is not big.

Looking at the distribution of M_CLASS (classification in the product) as a result of sampling this at 0.2%, SHOES01 has disappeared. Also, other items are different from the original distribution.

Originally, in such a case, the number of samplings should be increased, but if there is no choice but to make small samplings such as verification data, stratified sampling can be used.

This is a method of sampling data separately for each layer. In this example, the image is sampled for each middle classification of M_CLASS (classification in products).

Layered sampling is also performed at the sampling node. The sample method is "complex". Then specify the sample size. Here, 0.002 (0.2%) is specified. You can then specify layered variables by clicking the Cluster and Hierarchy button. Here, M_CLASS (classification in the product) is specified as the stratified variable.

Also, the random seed setting is checked so that sampling can be reproduced.

The result has a column called SampleWeight, which writes out the weights used internally when sampling. You can see that the values are the same for each M_CLASS. Normally it is not needed, so you can remove it with the filter node.

Looking at the distribution of M_CLASS (classification in products) as a result of sampling, there is SHOES01, which is close to the original distribution for all cases.

Note that SQL pushback does not work for stratified sampling. It turns purple and seems to be looking for an empty string in the layered column, but the sampling itself has not been converted to SQL.

1p. ②-1. Layered sampling pandas version

Use the groupby and sample functions to get layered sampling in pandas. First, group by ‘M_CLASS’. group_keys = False is not multi-indexed.

Then, 0.2% random sampling is executed with sample using the lamda formula for each block of data of each M_CLASS.

Stratified_df=df.groupby('M_CLASS', group_keys=False)\
    .apply(lambda x: x.sample(frac=0.002, random_state = 1))

Grouped by M_CLASS, the data is 0.2%.

Reference
python-Panda layered sampling
https://www.366service.com/jp/qa/2a7f0f8e735384ecbe3b386f1715396e
Stratified sampling of python Pandas
https://www.it-swarm-ja.tech/ja/python/pandas%E3%81%AE%E5%B1%A4%E5%88%A5%E3%82%B5%E3%83%B3%E3%83%97%E3%83%AA%E3%83%B3%E3%82%B0/832430095/

Another option is to use Stratified Shuffle Split. This is an object that performs layered sampling when separating training data and test data.

The sampling size of training data and test data is determined by the train_size and test_size arguments of StratifiedShuffleSplit. random_state is a random seed. Since it is originally an object for separating training data and test data, it is necessary to determine train_size and test_size.

If you specify the Dataframe (df) and the column (df ['M_CLASS']) you want to stratify with the split function for the instantiated sample, the index (train_, test_) of the Dataframe of the training data and test data will be returned. From there, I am creating a new Dataframe (StratifiedShuffleSplit_df).

from sklearn.model_selection import StratifiedShuffleSplit
sample = StratifiedShuffleSplit(n_splits = 1,train_size = 0.002,test_size = 0.01, random_state = 1)
for train_,test_ in sample.split(df, df['M_CLASS']):
    StratifiedShuffleSplit_ = df.loc[train_]
#    chunk_test = df1.loc[test_]

Reference
How to create test data using Stratified Shuffle Split of scikit-learn
https://www.randomlyforest.com/entry/2019/01/14/215102
sklearn.model_selection.StratifiedShuffleSplit — scikit-learn 0.23.1 documentation
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html

Comparing the distribution of M_CLASS between all data and these stratified sampling data and simple random sampling data, SHOES01 is missing from the simple random sampling data, and the distribution of all data cannot be reflected. I understand.

2m. ②-2. Cluster sampling Modeler version

The data this time is a purchasing transaction. Random sampling from the entire data will thin out the items purchased by each customer. The number of purchases and the purchase amount per person will be small, and it will be difficult to understand the purchase tendency of people who "buy SHOES often". You can analyze what the best-selling products are in the entire transaction, but the data will not be suitable for customer-oriented analysis.

In such a case, perform cluster sampling (aggregate ID sampling) that samples at the customer ID level. When cluster sampling is performed, the transaction of the extracted customer ID is retained by sampling by the customer ID, so it is possible to analyze by the customer axis.

Cluster sampling is also performed at the sampling node. The sample method is "complex" and the sample size is specified. Here, 0.1 (10%) is specified. You can then specify the variables you want to cluster by clicking the Cluster and Hierarchy button. Here, CUSTID is specified as the cluster.

Also, the random seed setting is checked so that sampling can be reproduced.

10% of all CUSTIDs were randomly sampled, the extracted CUSTID transactions were saved and 2652 were extracted. A column for SampleWeight has also been added, but I don't think it's used for complex sampling.

However, SQL pushback will not work if cluster sampling is performed using the sampling node function. Therefore, it is recommended to sample CUSTID by record aggregation and random sampling, and then rejoin with the original data.

Create a unique dataset with CUSTID in record aggregation.

The sample method is simple and a random% of 10% is specified.

Then combine the transactions from the original data.

This method will do SQL pushback. Random sampling is done in RAND (2743707) <1.0000000000000001e-01) and transactions are combined in WHERE (T0.CUSTID = T1.CUSTID).

[2020-08-12 12:58:45] Previewing SQL: SELECT T1.SDATE AS SDATE, T1.PRODUCTID AS PRODUCTID, T1. "L_CLASS" AS "L_CLASS", T1. "M_CLASS" AS "M_CLASS", T1.SUBTOTAL AS SUBTOTAL, T0.CUSTID AS CUSTID FROM (SELECT T0.CUSTID AS CUSTID FROM (SELECT T0.CUSTID AS CUSTID FROM SAMPLETRANDEPT4EN2019S T0 GROUP BY T0.CUSTID) T0 WHERE RAND (2743707) <1.0000000000000001e-01) (SELECT T0.CUSTID AS CUSTID, T0.SDATE AS SDATE, T0.PRODUCTID AS PRODUCTID, T0. "L_CLASS" AS "L_CLASS", T0. "M_CLASS" AS "M_CLASS", T0.SUBTOTAL AS SUBTOTAL FROM SAMPLETRANDEPT4EN2019S T0) T1 WHERE (T0.CUSTID = T1.CUSTID)

2p. ②-2. Cluster sampling pandas version

Use the unique, sample, and isin functions for cluster sampling with pandas. The process is the same as using the aggregation node, sampling node, and record join node in Modeler.

Creates a recordset that is unique and has a unique CUSTID. Random sampling is done with sample. Only CUSTIDs sampled from the original transaction with isin are extracted.

df_custid =pd.Series(df['CUSTID'].unique()).sample(frac=0.1,random_state=1)
df[df['CUSTID'].isin(df_custid)]

Cluster sampling can be performed as follows.

3. Sample

The sample is placed below.

stream https://github.com/hkwd/200611Modeler2Python/raw/master/sample/sample.str notebook https://github.com/hkwd/200611Modeler2Python/blob/master/sample/sampling.ipynb data https://raw.githubusercontent.com/hkwd/200611Modeler2Python/master/data/sampletranDEPT4en2019S.csv

■ Test environment Modeler 18.2.1 Windows 10 64bit Python 3.6.9 pandas 0.24.1

4. Reference information

Random sampling-Wikipedia
https://en.wikipedia.org/wiki/%E7%84%A1%E4%BD%9C%E7%82%BA%E6%8A%BD%E5%87%BA #% E7% B5% B1% E8% A8% 88% E8% AA% BF% E6% 9F% BB% E3% 81% AB% E3% 81% 8A% E3% 81% 91% E3% 82% 8B% E7% 84% A1% E4% BD% 9C% E7% 82% BA% E6% 8A% BD% E5% 87% BA% E3% 81% AE% E6% 89% 8B% E6% B3% 95 The explanation of stratified sampling method and cluster sampling method is easy to understand.

Sampling node https://www.ibm.com/support/knowledgecenter/ja/SS3RA7_18.2.1/modeler_mainhelp_client_ddita/clementine/mainwindow_navigationstreamsoutputtab.html

Rewrite the sampling node of SPSS Modeler with Python (2): Layered sampling, cluster sampling