Introduction

About EDM posted in Previous article, it is a story of implementing Simplex Projection in Python in order to challenge the rumination of learning contents and temporal extrapolation. ..

Regarding pyEDM implemented in the laboratory of the proposer Sugihara et al., We will implement it because it does not support prediction outside the input DataFrame. (Reference).

For this implementation, we refer to Blog by Professor Ushio of Kyoto University.

Simplex Projection

0. Data preparation

From the pyEDM sample dataset [Tent Map](https://en.wikipedia.org/wiki/%E3%83%86%E3%83%B3%E3%83%88%E5%86%99%E5% 83% 8F) data is used.

!pip install pyEDM

import pyEDM
tentmap = pyEDM.sampleData['TentMap']

print(tentmap.head())

  Time	TentMap
0    1	-0.09920
1    2	-0.60130
2    3	 0.79980
3    4	-0.79441
4    5	 0.79800

Time-delay Embedding

Its own lag for the variable $ y_t , such as \ { {y_ {t}, y_ {t-\ tau}, ..., y_ {t-(E-1) \ tau}} $ } Use to reconstruct the original dynamics. In fact, you can use other related variables in addition to the target variable. This time, for the sake of simplicity, let's set one variable and $ E = 2, \ tau = 1 $. Actually, it is a variable that greatly depends on the reconstruction accuracy of the dynamics, so it is necessary to estimate the data on hand in advance.

In the case of pyEDM, ʻEmbed () is prepared ([doc]([ʻEmbed ()](https://sugiharalab.github.io/EDM_Documentation/edm_functions/#embed))).

Implemented

def time_delay_embedding(objective_df, target_var, E=2, tau=1):
    return_df = pd.DataFrame(objective_df[target_var].values,
                             columns=[target_var+"(t-0)"],
                             index=objective_df.index)    
    
    for i in range(tau, E*tau-1, tau):
        colname = target_var + "(t-" + str(i) + ")"
        return_df[colname] = objective_df[[target_var]].shift(i)
    
    return return_df.dropna()

emb_df = time_delay_embedding(tentmap, "TentMap", E=2, tau=1)
print(emb_df.head())

   TentMap(t-0)	 TentMap(t-1)
1      -0.60130      -0.09920
2       0.79980	     -0.60130
3      -0.79441	      0.79980
4       0.79800	     -0.79441
5      -0.81954	      0.79800

2. Search for Nearest Neighbor

In this implementation, the latest value of the data on hand is targeted and predicted after $ 1 $ time.

target_pt = emb_df.iloc[-1,:].values #The latest value that serves as the basis for forecasting

diff = emb_df.values - target_pt #Difference between all points of data on hand from the reference point
l2_distance = np.linalg.norm(diff, ord=2, axis=1) #Calculation of L2 norm from difference
l2_distance = pd.Series(l2_distance, index=emb_df.index, name="l2") #I want to sort so pandas.Convert to Series

nearest_sort = l2_distance.iloc[:-1].sort_values(ascending=True) #end(=The latest value that is the standard)Sort in ascending order except

print(nearest_sort.head(3))

index    distance
  124    0.003371
  177    0.018171
  163    0.018347
Name: l2

3. Movement of Nearest Neighbor

Assuming that the point that is the basis of prediction is $ y_ {t} $ and the extracted neighborhood points are \ {$ y_ {t_1}, y_ {t_2}, y_ {t_3} $ } Use \ {$ y_ {t_1 + 1}, y_ {t_2 + 1}, y_ {t_3 + 1} $ } to calculate the point $ y_ {t + 1} $ you want to predict. Since the index of the neighborhood point obtained in the previous section is \ {124, 177, 163 }, You will use the points that correspond to \ {125, 178, 164 }.

There is no limit to the number of neighborhood points that can be referenced, but $ E + 1 $ is used for the embedded dimension $ E $.

knn = int(emb_df.shape[1] + 1)
nn_index = np.array(nearest_sort.iloc[:knn,:].index.tolist())
nn_future_index = nn_index + pt

print(nn_index, "-->", nn_future_index)

nn_future_points = lib_df.loc[nn_future_index].values
print(nn_future_points)

[124 177 163] --> [125 178 164]

[[0.16743 0.91591]
 [0.15932 0.91998]
 [0.1335  0.93295]]

4. Calculation of prediction points

\ {$ Y_ {t_1 + 1}, y_ {t_2 +, depending on the distance to the reference points $ y_ {t} $ and \ {$ y_ {t_1}, y_ {t_2}, y_ {t_3} $ } Calculate the weights \ {$ w_ {t_1 + 1}, w_ {t_2 + 1}, w_ {t_3 + 1} $ } for 1}, y_ {t_3 + 1} $ }.

nn_distances = nearest_sort.loc[nn_index].values #From the reference point yt1, yt2,Distance to yt3
nn_weights = np.exp(-nn_distances/nn_distances[0]) #Calculate weight
total_weight = np.sum(nn_weights)

print("nn_distances")
print("nn_weights")
print("total_weight")

[0.00337083 0.01817143 0.01834696]
[0.36787944 0.00455838 0.00432709]
0.376764916711149

Calculate the prediction point $ y_ {t + 1} $ using the weights. The formula is a simple weighted average, $ y_{t+1} = \frac{w_{t_1+1} \times y_{t_1+1} + w_{t_2+1} \times y_{t_2+1} + w_{t_3+1} \times y_{t_3+1}}{w_{total}}$

forecast_point = list()
for yt in nn_future_points.T:
    forecast_point.append(np.sum((yt * nn_weights) / total_weight))

print(forecast_point)

[0.16694219792961462, 0.9161549438807427]

By the way, the correct answer in this case is ground_truth: [0.16928 0.91498] Therefore, a value that is fairly close is required.

Finally, I plotted the state of the calculation.

Overall view

Expansion of reference points $ y_ {t} $ and \ {$ y_ {t_1}, y_ {t_2}, y_ {t_3} $ }

Expansion of predicted and true values

The entire function that combines 1. to 4. 2.

def forecast_simplex_projection(input_df=None,
                                    forecast_tp=1, knn=None
                                   ):
        if input_df is None:
            raise Exception("Invalid argument: is None: lib_df<pandas.DataFrame> is None")

        if knn is None:
            knn = int(input_df.shape[1] + 1) #Embedded dimension+1
    
        #Get variable name from input DataFrame
        input_cols = input_df.columns.tolist()

        #DeepCopy data for recursive processing in the subsequent stage
        lib_df = input_df.copy()
    
        # forecast_Recursively execute one-step prediction up to the step specified by tp
        forecast_points = list()
        for pt in range(1, forecast_tp+1):
            #Distance calculation from the reference point to each point in the library
            target_pt = lib_df.iloc[-1,:].values
            lib_pt = lib_df.values

            diff = lib_pt - target_pt

            l2_distance = np.linalg.norm(diff, ord=2, axis=1)
            l2_distance = pd.Series(l2_distance, index=lib_df.index, name="l2")

            #Sort distances in ascending order
            nearest_sort = l2_distance.iloc[:-1].sort_values(ascending=True)

            #Index of neighborhood points with respect to reference point and t+Index of 1
            nn_index = np.array(nearest_sort.iloc[:knn].index.tolist())
            nn_future_index = nn_index + 1
    
            #Calculate weight according to the distance of neighboring points
            nn_distances = nearest_sort.loc[nn_index].values
            nn_distances = nn_distances + 1.0e-6
        
            nn_weights = np.exp(-nn_distances/nn_distances[0])
            total_weight = np.sum(nn_weights)
        
            #Calculate predicted value for each embedded dimension
            forecast_value = list()
            for yt in nn_future_points.values.T:
                forecast_value.append(np.sum((yt * nn_weights) / total_weight))
        
            #Add prediction results to library
            forecast_points.append(forecast_value)
            series_forecast = pd.Series(forecast_value, index=input_df.columns)
            lib_df = lib_df.append(series_forecast, ignore_index=True)
        
        forecast_points = np.array(forecast_points)
    
        return forecast_points

At the end

This time, I implemented Simplex Projection, which is the simplest of EDM, in Python. Other methods include S-Map and Multiview Embedding. It is also possible to use different lags with multiple variables for dynamic reconstruction, which affects prediction accuracy. From the next time onward, I will challenge such contents.