The story begins by drawing a power approximation curve in EXCEL

There was a certain decaying time series data, and when I tried to draw an approximation curve with EXCEL, the power approximation seemed to be good, so I decided to reproduce it with Python instead of EXCEL for business reasons. As a result of trying it, it doesn't fit for some reason ... When I wondered why, I arrived at the following information. Exponentiation by Python Most of the problems can be solved by looking at this site, but I will write an article for my memorandum.

Difference between power approximation by curve_fit of scipy and power approximation by EXCEL

The power expression is defined as follows. 　　y = bx^{a} scipy's curve_fit performs non-linear regression that returns $ a $ and $ b $ that are the closest approximations to the data in the above equation.

So what about EXCEL? The above equation can be converted as follows. 　　y = bx^{a} $ \ Rightarrow \ ln y = \ ln (bx ^ {a}) ・・・ (take a logarithm) $ $ \ Rightarrow \ ln y = a \ ln x + \ ln b ・・・ (decompose the right side) $ $ \ Rightarrow Y = aX + B ・・・ (combines the logarithmic part as a new variable) $ By taking the logarithm of the power expression in this way, a linear regression of $ Y = aX + B $ can be performed. In other words, it seems that the exponentiation approximation of EXCEL outputs the result of ** logarithmic conversion and linear regression **.

Try it out in Python

Let's see what actually happens in Python. For the data, we will use the data on changes in the number of infant mortality and mortality rate in the "2011 White Paper on Children and Youth" on the page of the Cabinet Office. "2011 White Paper on Children and Youth" Changes in Infant Deaths and Mortality

First read the data

`read_data.py`


import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
import pandas as pd

df=pd.read_csv('Infant mortality rate.csv',encoding='cp932') #Encoding with garbled characters prevention='cp932'
display(df)

Molded because it has a strange structure

`adjust_data.py`


df=df.iloc[3:18,:].rename(columns={'1st-1-Figure 6 Changes in infant mortality and mortality rate':'Annual'\
                                 ,'Unnamed: 1':'Infant mortality (persons)'\
                                 ,'Unnamed: 2':'Infant mortality (thousands)'\
                                 ,'Unnamed: 3':'Infant mortality rate'})
#Create serial number columns for later processing
rank=range(1,len(df)+1)
df['rank']=rank
#Infant mortality rate is float type because all columns are object type
df['Infant mortality rate']=df['Infant mortality rate'].astype(float)
df['Infant mortality (persons)']=df['Infant mortality (persons)'].str.replace(',','').astype(np.int)
display(df)

Infant mortality plot

`plot.py`


x=df['Annual']
y=df['Infant mortality rate']
ax=plt.subplot(1,1,1)
ax.plot(x,y)
ax.set_xlabel('Annual')
ax.set_ylabel('Infant mortality rate')
plt.show()

The baby mortality rate is much lower than it was 60 years ago. Medical progress is amazing. Now it's time to find the approximate parameters.

Nonlinear regression with scipy curve_fit

`func.py`


def exp_func(x, a, b):
    return b*(x**a)

def exp_fit(val1_quan, val2_quan):
    #maxfev: maximum number of function calls, check_finite: If True, ValueError occurs if NaN is included.
    l_popt, l_pcov = curve_fit(exp_func, val1_quan, val2_quan, maxfev=10000, check_finite=False)
    return exp_func(val1_quan, *l_popt),l_popt

Find the parameters $ a $ and $ b $ of exp_func using exp_fit.

`culc_params.py`


x=df['Annual']
x2=df['rank']
y=df['Infant mortality rate']
y_fit,l_popt=exp_fit(x2,y)

ax=plt.subplot(1,1,1)
ax.plot(x,y,label='obs')
ax.plot(x,y_fit,label='model')
ax.set_xlabel('Annual')
ax.set_ylabel('Infant mortality rate')
plt.legend()
plt.show()
print('a : {},   b : {}'.format(l_popt[0],l_popt[1]))#Obtained parameter a,Check b

Good feeling.

Reproduction of EXCEL exponentiation approximation (logarithmic transformation and linear regression)

`func2.py`


def exp_func_log(x, a, b):
    return a*np.log(x) + np.log(b)

def exp_func_log_fit(val1_quan, val2_quan):
    l_popt, l_pcov = curve_fit(exp_func_log, val1_quan, np.log(val2_quan), maxfev=10000, check_finite=False)
    return exp_func_log(val1_quan, *l_popt),l_popt

def log_to_exp(x,a,b):
    return np.exp(a*np.log(x) + np.log(b))

Find the parameters $ a $ and $ b $ of exp_func_log using exp_func_log_fit. Since $ Y $ approximated using the obtained parameters $ a $ and $ b $ is $ \ ln y $, it is converted back from the logarithm by log_to_exp.

`culc_params2.py`


x=df['Annual']
x2=df['rank']
y=df['Infant mortality rate']
y_fit,l_popt=exp_func_log_fit(x2,y)
y_fit=log_to_exp(x2,l_popt[0],l_popt[1])

ax=plt.subplot(1,1,1)
ax.plot(x,y,label='obs')
ax.plot(x,y_fit,label='model')
ax.set_xlabel('Annual')
ax.set_ylabel('Infant mortality rate')
plt.legend()
plt.show()
print('a : {},   b : {}'.format(l_popt[0],l_popt[1])) #Obtained parameter a,Check b

It feels good, but I think it was better to do a direct non-linear regression.

Summary

I'm not sure which one is right for me. However, if you get into a situation where "EXCEL can do it, but Python doesn't do the same!", You may remember this.

Postscript (immediately after writing on 2020/02/28)

When the numerical values of the data are large and fluctuating, the approximation by non-linear regression is likely to be pulled by the large numerical fluctuations. From the point of view of generalization ability, it may be better to perform logarithmic transformation and linear regression.

Insert a dummy number into the number of infant mortality (persons)

`dummydata.py`


df=pd.read_csv('Infant mortality rate.csv',encoding='cp932')
df=df.iloc[3:18,:].rename(columns={'1st-1-Figure 6 Changes in infant mortality and mortality rate':'Annual'\
                                 ,'Unnamed: 1':'Infant mortality (persons)'\
                                 ,'Unnamed: 2':'Infant mortality (thousands)'\
                                 ,'Unnamed: 3':'Infant mortality rate'})
#Create serial number columns for later processing
rank=range(1,len(df)+1)
df['rank']=rank
#Infant mortality rate is float type because all columns are object type
df['Infant mortality rate']=df['Infant mortality rate'].astype(float)
df['Infant mortality (persons)']=df['Infant mortality (persons)'].str.replace(',','').astype(np.int)

#Insert dummy data
df2=df.copy()
df2.loc[df2['Annual']=='Heisei 2', 'Infant mortality (persons)']=60000
df2.loc[df2['Annual']=='13', 'Infant mortality (persons)']=40000
df2.loc[df2['Annual']=='15', 'Infant mortality (persons)']=20000
df2.loc[df2['Annual']=='18', 'Infant mortality (persons)']=10000
display(df2)

x=df2['Annual']
y=df2['Infant mortality (persons)']
ax=plt.subplot(1,1,1)
ax.plot(x,y)
ax.set_xlabel('Annual')
ax.set_ylabel('Infant mortality (persons)')
ax.set_title('Dummy numbers in 1990,13,15,Insert in 18')
plt.show()

Draw an approximate curve using the function used in this volume again

`dummydata.py`


#Nonlinear regression
x=df2['Annual']
x2=df2['rank']
y=df2['Infant mortality (persons)']
y_fit,l_popt=exp_fit(x2,y)

ax=plt.subplot(1,1,1)
ax.plot(x,y,label='obs')
ax.plot(x,y_fit,label='model')
ax.set_xlabel('Annual')
ax.set_ylabel('Infant mortality (persons)')
plt.legend()
plt.show()
print('a : {},   b : {}'.format(l_popt[0],l_popt[1]))

#Logarithmic transformation linear regression
x=df2['Annual']
x2=df2['rank']
y=df2['Infant mortality (persons)']
y_fit,l_popt=exp_func_log_fit(x2,y)
y_fit=log_to_exp(x2,l_popt[0],l_popt[1])

ax=plt.subplot(1,1,1)
ax.plot(x,y,label='obs')
ax.plot(x,y_fit,label='model')
ax.set_xlabel('Annual')
ax.set_ylabel('Infant mortality (persons)')
plt.legend()
plt.show()
print('a : {},   b : {}'.format(l_popt[0],l_popt[1]))

Approximate with non-linear regression Approximate with logarithmic transformation linear regression

Obviously, the one approximated by non-linear regression is pulled by the fluctuation of the numerical value entered by the dummy. It seems important to distinguish and use properly depending on the situation.

Story of power approximation by Python

The story begins by drawing a power approximation curve in EXCEL

Difference between power approximation by curve_fit of scipy and power approximation by EXCEL

Try it out in Python

First read the data

read_data.py

Molded because it has a strange structure

adjust_data.py

Infant mortality plot

plot.py

Nonlinear regression with scipy curve_fit

func.py

culc_params.py

Reproduction of EXCEL exponentiation approximation (logarithmic transformation and linear regression)

func2.py

culc_params2.py

Summary

Postscript (immediately after writing on 2020/02/28)

Insert a dummy number into the number of infant mortality (persons)

dummydata.py

Draw an approximate curve using the function used in this volume again

dummydata.py

`read_data.py`

`adjust_data.py`

`plot.py`

`func.py`

`culc_params.py`

`func2.py`

`culc_params2.py`

`dummydata.py`

`dummydata.py`