When dealing with data from a large number of records, the analysis process takes time. In such cases, it is common to create a subset (sample) from all the cases, make it a processable size, and then analyze it. Also, even if the final analysis is for all cases, it is often the case that a small subset is used to lighten the process and perform a preliminary analysis. Making a small sample from all the data in this way is called sampling (sampling).
It is the sampling node that does this sampling in SPSS Modeler. I will explain this sampling node and rewrite it with Python pandas.
There are two types of sampling: (1) simple sampling and (2) complex sampling that reflects data trends. I will explain each of them in two articles.
The following POS data with ID is targeted. We use ID-attached POS data that records who (CUSTID) purchased when (SDATE) and what (PRODUCTID, L_CLASS product major classification, M_CLASS product middle classification) and how much (SUBTOTAL).
There are 28,599 cases in 6 fields.
Extract 100 items from the beginning. This is a very light process. It is also useful when previewing data that is known to be large. This is the default mode for Modeler sampling.
■ Image after processing The order has not changed, and 100 items have been extracted.
Use the sampling node.
The sample method is "Simple", and the sample method is "First n cases" and 100 cases are specified.
By the way, SQL pushback is effective for the "first n cases" of the sampling node. The following is an example of using SQL pushback for Db2. FETCH FIRST 100 ROWS ONLY is used.
SELECT T0.CUSTID AS CUSTID,T0.SDATE AS SDATE,T0.PRODUCTID AS PRODUCTID,T0."L_CLASS" AS "L_CLASS",T0."M_CLASS" AS "M_CLASS",T0.SUBTOTAL AS SUBTOTAL FROM (SELECT T0.CUSTID AS CUSTID,T0.SDATE AS SDATE,T0.PRODUCTID AS PRODUCTID,T0."L_CLASS" AS "L_CLASS",T0."M_CLASS" AS "M_CLASS",T0.SUBTOTAL AS SUBTOTAL FROM SAMPLETRANDEPT4EN2019S T0 FETCH FIRST 100 ROWS ONLY) T0
In the case of file input, the same result will be returned every time the "first n items" are executed repeatedly. However, RDB does not guarantee the order, so the same result may not always be returned. If you want to guarantee the same result, insert a sort node etc. to fix the order and then use the sampling node.
If you want to get the first N items in pandas, use the head function.
df.head(100)
Randomly extract n% from all cases. Sampling the first N records is a light method unless the original data tends to be biased by order. However, in general, data is often recorded in the order of occurrence, and the tendency of data often differs depending on the order of occurrence. For example, with one day's worth of data, it is possible that the trends in the data differ between morning and night. Random sampling (simple random sampling) is the most common sampling method that can deal with these problems.
Note that this process is an image of rolling dice for all records, so if the amount of data is large, it will be a heavy process.
After all it is done in the sampling node.
The sample method is "Simple", and the sample method is "Random%", which is 0.2%.
Also, check the random seed setting. By checking this, the same result will be returned when sampling is executed repeatedly. It is usually checked because the analysis becomes unstable when the data changes. If you want to change the sampling result, click the Generate Seed button and change the seed to change the next sampling result.
■ Image after processing 0.2% was randomly selected. 68 cases were extracted from 28,599 cases. Unlike the first N cases, the data of the first 100001 customers is not selected, and the transactions of each customer are also sparsely selected.
By the way, SQL pushback is effective for "Random%" of sampling node. The following is an example of using SQL pushback for Db2. BERNOULLI (0.200000) REPEATABLE (377028) is used. Different RDBs may use different methods. In particular, I feel that there may be RDBs that do not support seeds (the sampling results are not reproducible).
SELECT T0.CUSTID AS CUSTID,T0.SDATE AS SDATE,T0.PRODUCTID AS PRODUCTID,T0."L_CLASS" AS "L_CLASS",T0."M_CLASS" AS "M_CLASS",T0.SUBTOTAL AS SUBTOTAL FROM (SELECT T0.CUSTID AS CUSTID,T0.SDATE AS SDATE,T0.PRODUCTID AS PRODUCTID,T0."L_CLASS" AS "L_CLASS",T0."M_CLASS" AS "M_CLASS",T0.SUBTOTAL AS SUBTOTAL FROM SAMPLETRANDEPT4EN2019S T0 TABLESAMPLE BERNOULLI(0.200000) REPEATABLE(377028)) T0
As pointed out earlier, random sampling can be a heavy process to process all cases, so it is better to cache the sampling results or output them to a file or table for more efficient subsequent analysis. I will.
You may also want to determine the number of cases and perform random sampling. In that case, calculate and set the ratio by (number of samplings) / (total number) * 100%. If you want to randomly sample 100 items with this data, it will be 100/28599 * 100 = 0.349%.
If you set the random% to 0.35 and specify the maximum number to 100, the number is limited to 100.
■ Image after processing 100 random samplings were made.
If you want to get random sampling with pandas, use the sample function. Set the percentage with frac. It is a decimal, not a fraction (1 is 100%). Here, 0.2% is specified. Specify a random seed with random_state.
df.sample(frac=0.002, random_state = 1)
If you want to reassign the index, reset_index (drop = True) is performed.
df.sample(frac=0.002, random_state = 1).reset_index(drop=True)
When specifying the number of cases, specify n instead of frac.
df.sample(n=100, random_state = 1)
The sample is placed below.
stream https://github.com/hkwd/200611Modeler2Python/raw/master/sample/sample.str notebook https://github.com/hkwd/200611Modeler2Python/blob/master/sample/sampling.ipynb data https://raw.githubusercontent.com/hkwd/200611Modeler2Python/master/data/sampletranDEPT4en2019S.csv
■ Test environment Modeler 18.2.1 Windows 10 64bit Python 3.6.9 pandas 0.24.1
Sampling node https://www.ibm.com/support/knowledgecenter/ja/SS3RA7_18.2.1/modeler_mainhelp_client_ddita/clementine/mainwindow_navigationstreamsoutputtab.html
Recommended Posts