Feature extraction is performed from the time series sensor data using the "field creation" node that processes data from existing data using functions with SPSS Modeler. And let's rewrite the process with Python pandas.
SPSS Modeler provides nodes for various data processing, but the "Field Creation" node is a fairly general-purpose node for data processing with a high degree of freedom.
The processing pattern can be selected from the "Derived" list. Derivation is hard to imagine, but it is an English translation called derivative, which means a processing pattern created by deriving from the original data. I will explain in order of personal use.
Both process records from top to bottom. Especially for count type and state type, it is essential to be aware of the record processing order.
Since it is a general-purpose node, various processing can be considered, but this time we will use it for the purpose of extracting features from time-series sensor data.
Since the time-series sensor data does not have much information as it is, the key to analysis is to process and create effective features. For example, it would be easy if we could grasp the simple feature that "an error will occur if the power exceeds 200W", but how the value of the sensor has actually changed, for example, the amount of power is rising rapidly and the amount of power is stable. In most cases, meaningful analysis cannot be performed without analysis using information such as repeating the increase / decrease in a zigzag manner without doing so.
The data to be analyzed are as follows. M_CD: Machine code UP_TIIME: Startup time POWER: Power TEMP: Temperature ERR_CD: Error code
For each machine code, changes in power and temperature along the startup time, and any errors are recorded in chronological order.
This time, we will create the following features from this data. ① Conditional: Power difference 1 hour ago (2) Flag type: A flag that catches the zigzag that power increases or decreases ③ Count type: Cumulative number of zigzag occurrences ④ State type: A state in which zigzag occurs frequently or not.
In each case, the record order is required, so use the sort node to sort by each machine code and startup time.
Let's make a feature of "difference from the power one hour ago".
Set to Derived: Conditional on the field creation node. Then, an input item showing the structure of the IF statement will appear. Actually, the IF statement can be written by "Derivation: CLEM expression", but it is recommended to use "Derivation: Conditional" to improve readability.
First, enter @ DIFF1 (POWER) in Then :. @ DIFF1 is Modeler's built-in function called CLEM function, which calculates the difference from the previous line. Now you can calculate the difference from the power one hour ago.
Next, set If: to M_CD = @OFFSET (M_CD, 1) and Else: to undef. @OFFSET is a function that refers to the value N lines before. Here, the previous line is referenced. undef means NULL. In other words, if it is the same as M_CD in the previous line, @ DIFF1 (POWER) is calculated, and if it is different from M_CD in the previous line, it is meaningless to calculate the difference with the power of another machine, so it means to put NULL. Will be.
The result is as follows. There is a new derived column called POWER_DIFF that contains the value of POWER in the previous row minus the current POWER. In the example on line 930, 988W-991W = -3W is included.
Also, if you look at line 941, you will find $ null $. This is the data of the machine with M_CD = 204 up to the 940th line, and the data of the machine with M_CD = 209 from the 941st line, so it is meaningless to calculate the difference with the power, so NULL is entered.
By the way, let's look at two machines, M_CD = 1000 and M_CD = 229, in a time series graph.
M_CD = 1000 has a monotonous decrease of -1W and -2W from the beginning, and has never increased. At the end, there is a relatively large reduction of -5W and -6W.
In the case of M_CD = 229, there was a considerable positive and negative difference, and the increase and decrease were repeated.
In pandas, we group by M_CD and calculate diff (1), which represents the calculation one hour ago, for POWER and put it in a new column called df ['POWER_DIFF'].
#Power difference 1 hour ago
df['POWER_DIFF'] = df.groupby(['M_CD'])['POWER'].diff(1)
There may be something wrong with the power supply if the power goes up and down repeatedly like a machine with M_CD = 229. It is not possible to capture the zigzag of power increase / decrease with only the single value (example: -5W) of "the difference in power one hour ago". Create a feature that indicates that the difference in power has changed from positive to negative, or from negative to positive.
Set to "Derived: Flag type" in the field creation node. Since I wanted to display the data type of the field in the time series graph later, I set it to continuous type, 1 for true and 0 for false. If you just want a flag, you can leave the data type as a flag type. To true conditions POWER_DIFF * @OFFSET(POWER_DIFF,1) < 0 To set. "Difference in power 1 hour ago" * "Difference in power 1 hour ago" is calculated to determine whether it will be negative. The multiplication of plus and minus is minus, and the multiplication of plus and minus is plus. Therefore, it is flagged when the sign is inverted, that is, when zigzag occurs.
The result is as follows. There is a new derived column called FLUCTUATION, which contains a 1 if the POWER_DIFF and POWER_DIFF in the previous row have different signs. In line 1195, it increased by 5W one hour ago, and this time also increased by 5W, so it is increasing monotonously. So the flag is 0. On the other hand In line 1197, 5W increased 1 hour ago, but this time decreased to -1W, so zigzag is occurring. So the flag is 1. The zigzag situation that could not be understood without looking at the graph can now be understood by just looking at one record on line 1197.
Let's look at two machines, M_CD = 1000 and M_CD = 229, again in a time series graph.
Since M_CD = 1000 has a monotonous power reduction from the beginning, there is no zigzag.
If M_CD = 229, you can see that the increase and decrease are repeated finely.
Creating a zigzag flag in pandas can be a bit confusing. First, create a variable for POWER_DIFF one hour ago. Grouped by M_CD, for POWER_DIFF, the value one hour ago is referenced by shift (1) and put in a new column called df ['PREV_POWER_DIFF'].
#POWER 1 hour ago_Added DIFF column
df['PREV_POWER_DIFF'] = df.groupby(['M_CD'])['POWER_DIFF'].shift(1)
This column isn't created in Modeler because it's not needed in the end, but it's needed for calculations in pandas.
Next, define the function func_fluctuation. In the following IF statement in this function if x.POWER_DIFF * x.PREV_POWER_DIFF < 0: "Difference in power 1 hour ago" * "Difference in power 1 hour ago" is calculated and judged whether it becomes negative.
I then call this function with lambda for each row and put the result in a new column called df ['FLUCTUATION']. Note that we are converting from pandas.Series to pandas.DataFrame by setting axis = 1.
#Function to judge plus and minus inversion
def func_fluctuation(x):
if x.POWER_DIFF * x.PREV_POWER_DIFF < 0:
return 1
else:
return 0
#Call a function that determines the inversion of plus and minus from each line
df['FLUCTUATION'] = df.apply(lambda x:func_fluctuation(x),axis=1)
I was able to generate it as follows.
If you have a lot of zigzags like a machine with M_CD = 229, you may have some problem with the power supply, but if you have a few zigzags, you may think that there is no problem. Let's create a feature of the cumulative sum of how many times the zigzag has occurred cumulatively after startup.
Set to "Derived: Count type" in the field creation node. Incremental condition FLUCTUATION = 1 Increment is 1 To set. It means that when a zigzag occurs, it counts up by one. In addition, M_CD / = @OFFSET (M_CD, 1) is set as the reset condition, and the counter is set to 0 when the machine changes.
The result is as follows. There is a new derived column called FLUC_COUNT, and when 1 is entered in FLUCTUATION, it will be counted up one by one. Looking at line 1194, FLUC_COUNT is 1 because FLUCTUATION has occurred. After that, 1 is maintained until line 1197. And since FLUCTUATION is occurring again on line 1197, it has increased to 2.
Now let's look at two machines with M_CD = 104 and M_CD = 229 in a time series graph.
M_CD = 104 has two zigzags after 40 hours, after which the power is steadily decreasing. So FLUC_COUNT will remain at 2 after about 50 hours.
When M_CD = 229, the increase / decrease was repeated finely, and the zigzag state was repeated 40 times or more.
In pandas, you can calculate with a function called cumsum () that calculates the cumulative sum. Grouped by M_CD, the cumulative sum of FLUCTUATION is calculated by cumsum () and put in a new column called df ['FLUC_COUNT'].
#Cumulative number of zigzag
df['FLUC_COUNT'] = df.groupby(['M_CD'])['FLUCTUATION'].cumsum()
I was able to generate it as follows.
If the zigzag state also fluctuates a little, there may not be a big problem. On the other hand, if the zigzag repeats in a short period of time, the effect may remain even if the zigzag subsides after that. "Derivation: state type" can express such a complicated situation.
Set to "Derived: State type" on the field creation node. Since I wanted to display the data type of the field in the time series graph later, I made it continuous type, and set it to 1 for "on" and 0 for "off". If you just want a flag, you can leave the data type as a flag type. In the conditional expression of the switch "on" FLUCTUATION = 1 and @OFFSET(FLUCTUATION,1) = 1 To set. This means that there was a zigzag and that the zigzag happened an hour ago. In other words, the zigzag occurred for two hours in a row.
Next, in the conditional expression of the switch "off" @SINCE(FLUCTUATION = 1) >= 5 or M_CD /= @OFFSET(M_CD,1) To set. @SINCE returns a number that indicates how many lines before the expression given as an argument holds. @SINCE (FLUCTUATION = 1)> = 5 means that the zigzag last occurred more than 5 lines ago. In other words, it means that it is stable because there is no zigzag for more than 5 hours in a row.
Also, M_CD / = @OFFSET (M_CD, 1) is a reset condition, and it is set to return the status to off when the machine changes.
Similar to the flag type, but the state type allows the on and off conditions to be asymmetric. Here, if the zigzag unstable situation continues twice, it is turned on, while the stable state does not return to off until it continues five times.
The result is as follows. There is a new derived column called UNSTABILITY. First, looking at the 902nd line, 1 is entered in FLUCTUATION for 2 consecutive records, and 1 is reached. If the zigzag continues for 2 hours in a row, it is judged to be unstable.
Next, from line 903 to line 906, FLUCTUATION does not occur at 0, but UNSTABILITY remains at 1. And on line 907, FLUCTUATION was 1 more than 5 records ago, that is, FLUCTUATION was 0 more than 5 records in a row, so UNSTABILITY returned to 0. Since the zigzag did not occur for more than 5 hours, it was judged to be stable.
Now let's look at two machines with M_CD = 204 and M_CD = 229 in a time series graph.
M_CD = 204 has two zigzags after 49 hours, after which the power is steadily decreasing. So UNSTABILITY will stay at 0 5 hours after it becomes 1.
When M_CD = 229, it keeps increasing and decreasing finely, and UNSTABILITY is 1 for a long period of time, but there are 3 times without zigzag for 5 hours in a row, and UNSTABILITY is 0 during that period.
Since it is difficult to express such a complicated condition with pandas, I thought about serial processing with loop processing.
#The first line is the initial value of stability
df.at[0, 'UNSTABILITY'] = 0
stable_seq_count = 0
#2nd line(index=1)Loop processing from
for index in range(1,len(df)):
#The default is to keep the previous status
df.at[index, 'UNSTABILITY'] = df.at[index-1, 'UNSTABILITY']
#If there is a change
if df.at[index, 'FLUCTUATION'] == 1 :
#Initialize continuous stability count
stable_seq_count = 0
#Instability judgment when fluctuation continues twice
if df.at[index-1, 'FLUCTUATION'] == 1:
df.at[index, 'UNSTABILITY'] = 1
#If there is no fluctuation, increase the continuous stability count
elif df.at[index, 'FLUCTUATION'] == 0:
stable_seq_count += 1
#Stable status judgment when continuous stability count continues 5 times or more or when the machine becomes another machine
if stable_seq_count >= 5 or df.at[index, 'M_CD'] != df.at[index-1, 'M_CD']:
df.at[index, 'UNSTABILITY'] = 0
I was able to generate it as follows.
The sample is placed below.
stream https://github.com/hkwd/200611Modeler2Python/raw/master/derive/derive3.str notebook https://github.com/hkwd/200611Modeler2Python/blob/master/derive/derive.ipynb data https://raw.githubusercontent.com/hkwd/200611Modeler2Python/master/data/Cond4n_e.csv
■ Test environment Modeler 18.2.2 Windows 10 64bit Python 3.6.9 pandas 0.24.1
Field creation node https://www.ibm.com/support/knowledgecenter/ja/SS3RA7_18.2.2/modeler_mainhelp_client_ddita/clementine/derive_overview.html
Recommended Posts