Suddenly, time series data is difficult to handle, isn't it? Moreover, I think that the more variables you have, the more likely you are to break your heart. However, "If you extract the features from the time series data, you can do anything else!" I think there are many people.
This time, we will introduce ** tsfresh **, a library that seems to be useful for feature engineering of multidimensional time series data.
I referred to the following article. -Library tsfresh that automatically extracts features from time series data -Easy statistical processing of time series data with tsfresh
I installed it via pip. You couldn't install from pip without a new pip, so please upgrade pip.
pip install --upgrade pip
Upgrade pip with
pip install tsfresh
Install tsfresh with.
pip install pandas==0.21
Please change like.
Here is the version I'm using.
Finding multidimensional time series data was a hassle, so this time we'll use a pseudo-transformed dataset that can be downloaded from tsfresh. (If you already have your own data, please skip it.)
First of all, I can grasp the procedure with this pseudo data, but the result that comes out is not interesting at all, so if you have your own data, I recommend you to use it.
UEA & UCR Time Series Classification Repository There seems to be a lot of time series data that seems to be interesting ...
First, load the data.
In[1]
import pandas as pd
import numpy as np
from tsfresh.examples.har_dataset import download_har_dataset, load_har_dataset
download_har_dataset()
df = load_har_dataset()
print(df.shape)
df.head()
If you check with, you can see that this data has 7352 sample points and 128 variables (128 dimensions).
Next, cut out only 100 sample points and 50 variables.
In[2]
df = df.iloc[0:100, 0:50]
print(df.shape)
df.head()
This time, "There are five subjects, and 10 variables of time-series data are acquired from sensors attached to the body to classify whether the subjects are children or adults." Imagine a situation like this.
The purpose of this time is to see through the flow of feature engineering. The correspondence of values is messed up. Of course, it cannot be classified by this data.
In[3]
# id:5 and 10 variables
#Each 10 variables are assigned to each individual (subject).
df_s1 = df.iloc[:,0:10].copy()
df_s2 = df.iloc[:,10:20].copy()
df_s3 = df.iloc[:,20:30].copy()
df_s4 = df.iloc[:,30:40].copy()
df_s5 = df.iloc[:,40:50].copy()
#Create a column whose value is each individual id.
df_s1['id'] = 'sub1'
df_s2['id'] = 'sub2'
df_s3['id'] = 'sub3'
df_s4['id'] = 'sub4'
df_s5['id'] = 'sub5'
#Rewrite the variable name of each column.
columns = ['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'id']
df_s1.columns = columns
df_s2.columns = columns
df_s3.columns = columns
df_s4.columns = columns
df_s5.columns = columns
df_s1.head()
There are 5 data frames like this.
According to the Official Documents, ** extract_features () **, which is the main function in this article, has an argument. The format for passing is specified. The data type is the pandas dataframe object type, but there are three formats.
--Flat DataFrame --Stacked DataFrame (vertical stacking of data frames) --Dictionary of flat DataFrames (Dictionary of flat dataframes)
The above are the three types. This time, I will format it to this first format.
id time x y
A t1 x(A, t1) y(A, t1)
A t2 x(A, t2) y(A, t2)
A t3 x(A, t3) y(A, t3)
B t1 x(B, t1) y(B, t1)
B t2 x(B, t2) y(B, t2)
B t3 x(B, t3) y(B, t3)
Continuing from earlier,
In[4]
df = pd.concat([df_s1, df_s2, df_s3, df_s4, df_s5], axis=0)
print(df['id'].nunique())
df.head()
As, when connected,
It will be. The number of unique ids is 5, so they are concatenated without any problems.
You can now pass it to the extract_features () function.
For the previous data frame
In[5]
from tsfresh import extract_features
df_features = extract_features(df, column_id='id')
df_features.head()
When you apply
The feature amount is calculated like this. There are 754 features for one variable. I think that there are quite a lot, but I think that it is useful if you think that the difficulty of handling time series data is solved.
What each feature means is The documentation's Overview on extracted features (https://tsfresh.readthedocs.io/en/latest/text/list_of_features.html) describes it. It seems that the features (statistics) that require parameters are calculated with multiple parameters.
In the documentation, after extracting the features as described above, [Filter features] (https://tsfresh.readthedocs.io/en/latest/text/feature_filtering.html) is recommended. The function for that feature filtering is ** select_features () **. This function uses a statistical hypothesis test to select features so that only features that are likely to have a statistically significant difference based on the features.
before that,
In[6]
from tsfresh.utilities.dataframe_functions import impute
df_features = impute(df_features)
By doing, the unmanageable values such as NaN and infinity are complemented by the feature quantity obtained earlier.
In addition, select_features () narrows down the features based on the dependent variable y, so prepare the data in a pseudo manner. This y must be pandas.Series or numpy.array, as you can see in the Source Code (https://tsfresh.readthedocs.io/en/latest/_modules/tsfresh/feature_selection/selection.html) .. This time,
In[7]
from tsfresh import select_features
X = df_features
y = [0, 0, 0, 1, 1]
y = np.array(y)
Prepare an X data frame with features extracted and a numpy array of y allocated appropriately.
And
In[8]
X_selected = select_features(X, y)
print(X_selected.shape)
X_selected
When you do
Well.
Nothing comes out brilliantly. That's right, because the data is appropriate ... With proper data, you should be able to select features well. (Those who could (or couldn't) select features based on their own data would be delighted to cry if you could give us a comment.)
Because it ends with this This time, instead of feature selection using statistical hypothesis testing, I tried dimensional compression with PCA. It's like creating an instance called pca, fitting it, and transforming it using that instance.
In[9]
from sklearn.decomposition import PCA
pca = PCA(n_components=4)
pca.fit(df_features)
X_PCA = pca.transform(df_features)
When this is transformed into a data frame and displayed,
In[10]
X_PCA = pd.DataFrame( X_PCA )
X_PCA.head()
It will be.
further, Python: Principal component analysis (PCA) with scikit-learn If you try to find the contribution rate / cumulative contribution rate with reference to
In[11]
print('Contribution rate of each dimension: {0}'.format(pca.explained_variance_ratio_))
print('Cumulative contribution rate: {0}'.format(sum(pca.explained_variance_ratio_)))
out[11]
Contribution rate of each dimension: [0.30121012 0.28833114 0.22187195 0.1885868 ]
Cumulative contribution rate: 0.9999999999999999
It will be. perhaps, It may be possible to compress the features with PCA instead of selecting the features.
Although it is a power play, if you use this series of flows, it seems that you can escape from the troublesomeness unique to time series.
From there [Kaggle] Baseline model construction, Pipeline processing You can do various things such as making a baseline model like this.
Next, I will try it with actual multidimensional time series data.
Recommended Posts