Overview

There is a powerful package called tslearn for clustering time series data, and a memorandum when checking the operation It seems that it can be used at work, so I eat a little bit --Implementation period: October 2020 --Environment: Ubuntu18.04LTS

Creating a Conda virtual environment for operation check

Create a new virtual environment for operation check according to the procedure of Miniconda Install memorandum Then install the following packages required

conda install -c conda-forge tslearn
conda install -c conda-forge h5py

The procedure for tslearn also as written, and the following is also required. scikit-learn, numpy, scipy

Preparation

The following three methods are implemented in tslearn's Clustering.

tslearn.clustering.KernelKMeans
tslearn.clustering.KShape
tslearn.clustering.TimeSeriesKMeans

This time, I will use a method called K-Shape. An overview of K-Shape can be found in this blog, from which you can also refer to the original paper.

This time, we will try clustering of arrhythmia waveforms based on tslearn's Official sample code. For the data, we used ECG Heartbeat Categorization Dataset from Kaggle. Since it is not a DNN, download only mitbih_train.csv. 187 points of 125Hz waveform data and the last column are labeled 0-4. '0' is the normal waveform, and other than that, it seems to be a different waveform for each symptom. There are 87554 cases in total, but shuffle and use 100 points. Also, reserve the label to confirm that clustering was successful later.

code

Import the required libraries.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from tslearn.clustering import KShape
from tslearn.datasets import CachedDatasets
from tslearn.preprocessing import TimeSeriesScalerMeanVariance

Format the data and finally scale it.

train_df=pd.read_csv('/home/hoge/mitbih_train.csv',header=None)
trainArr = []
trainArr = train_df.values
np.random.shuffle(trainArr)

trainArr_X = []
trainArr_y = []
trainArr_X = trainArr[:100, :187]
trainArr_y = trainArr[:100, -1:]
trainArr_X = trainArr_X.reshape([100, 187, 1])
print(trainArr_X.shape)         # (100, 187, 1)
print(trainArr_y.shape)         # (100, 1)
sz = trainArr_X.shape[1]

# For this method to operate properly, prior scaling is required
trainArr_X = TimeSeriesScalerMeanVariance().fit_transform(trainArr_X)

Perform clustering and view the results. There were 5 classes.

# kShape clustering
ks = KShape(n_clusters=5, verbose=True, random_state=seed)
y_pred = ks.fit_predict(trainArr_X)

plt.figure(figsize=(10, 10), tight_layout = True)
for yi in range(5):
    plt.subplot(5, 1, 1 + yi)
    
    for xx in trainArr_X[y_pred == yi]:
        plt.plot(xx.ravel(), "b-", alpha=.2)
    plt.plot(ks.cluster_centers_[yi].ravel(), "r-")
    plt.xlim(0, sz)
    plt.ylim(-8, 8)
    plt.title("Cluster %d" % (yi + 1))

plt.show()

result

Get the figure below. The red line seems to be the waveform closest to the center of each cluster.

Screenshot from 2020-10-04 16-18-33.png

(I really wanted to color-code 100 lines by trainArr_y for each disease name, but I couldn't write it immediately due to lack of Python power, so I will support it at a later date) The number of labels of 100 original data used here (irrelevant to the Cluster number in the above figure) is as follows. '0': 86 pieces, '1': 2 pieces, '2': 4 pieces, '3': 2 pieces, '4': 6 pieces

Since the number of labels and the number of each class are clearly different, it seems that it can not be used without adjusting anything. Investigate what kind of parameters are available. The raw data was padded with zero out of 187 points in the latter half. I rushed in as it was, but it seems that this was not a problem.

Kaggle also has a code that classifies this dataset with CNN, and the accuracy is also outstandingly good.

tslearn trial memorandum

Overview

Creating a Conda virtual environment for operation check

Preparation

code

result