There was a certain decaying time series data, and when I tried to draw an approximation curve with EXCEL, the power approximation seemed to be good, so I decided to reproduce it with Python instead of EXCEL for business reasons. As a result of trying it, it doesn't fit for some reason ... When I wondered why, I arrived at the following information. Exponentiation by Python Most of the problems can be solved by looking at this site, but I will write an article for my memorandum.
The power expression is defined as follows.
So what about EXCEL?
The above equation can be converted as follows.
Let's see what actually happens in Python. For the data, we will use the data on changes in the number of infant mortality and mortality rate in the "2011 White Paper on Children and Youth" on the page of the Cabinet Office. "2011 White Paper on Children and Youth" Changes in Infant Deaths and Mortality
read_data.py
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
import pandas as pd
df=pd.read_csv('Infant mortality rate.csv',encoding='cp932') #Encoding with garbled characters prevention='cp932'
display(df)
adjust_data.py
df=df.iloc[3:18,:].rename(columns={'1st-1-Figure 6 Changes in infant mortality and mortality rate':'Annual'\
,'Unnamed: 1':'Infant mortality (persons)'\
,'Unnamed: 2':'Infant mortality (thousands)'\
,'Unnamed: 3':'Infant mortality rate'})
#Create serial number columns for later processing
rank=range(1,len(df)+1)
df['rank']=rank
#Infant mortality rate is float type because all columns are object type
df['Infant mortality rate']=df['Infant mortality rate'].astype(float)
df['Infant mortality (persons)']=df['Infant mortality (persons)'].str.replace(',','').astype(np.int)
display(df)
plot.py
x=df['Annual']
y=df['Infant mortality rate']
ax=plt.subplot(1,1,1)
ax.plot(x,y)
ax.set_xlabel('Annual')
ax.set_ylabel('Infant mortality rate')
plt.show()
The baby mortality rate is much lower than it was 60 years ago. Medical progress is amazing. Now it's time to find the approximate parameters.
func.py
def exp_func(x, a, b):
return b*(x**a)
def exp_fit(val1_quan, val2_quan):
#maxfev: maximum number of function calls, check_finite: If True, ValueError occurs if NaN is included.
l_popt, l_pcov = curve_fit(exp_func, val1_quan, val2_quan, maxfev=10000, check_finite=False)
return exp_func(val1_quan, *l_popt),l_popt
Find the parameters $ a $ and $ b $ of exp_func using exp_fit.
culc_params.py
x=df['Annual']
x2=df['rank']
y=df['Infant mortality rate']
y_fit,l_popt=exp_fit(x2,y)
ax=plt.subplot(1,1,1)
ax.plot(x,y,label='obs')
ax.plot(x,y_fit,label='model')
ax.set_xlabel('Annual')
ax.set_ylabel('Infant mortality rate')
plt.legend()
plt.show()
print('a : {}, b : {}'.format(l_popt[0],l_popt[1]))#Obtained parameter a,Check b
Good feeling.
func2.py
def exp_func_log(x, a, b):
return a*np.log(x) + np.log(b)
def exp_func_log_fit(val1_quan, val2_quan):
l_popt, l_pcov = curve_fit(exp_func_log, val1_quan, np.log(val2_quan), maxfev=10000, check_finite=False)
return exp_func_log(val1_quan, *l_popt),l_popt
def log_to_exp(x,a,b):
return np.exp(a*np.log(x) + np.log(b))
Find the parameters $ a $ and $ b $ of exp_func_log using exp_func_log_fit. Since $ Y $ approximated using the obtained parameters $ a $ and $ b $ is $ \ ln y $, it is converted back from the logarithm by log_to_exp.
culc_params2.py
x=df['Annual']
x2=df['rank']
y=df['Infant mortality rate']
y_fit,l_popt=exp_func_log_fit(x2,y)
y_fit=log_to_exp(x2,l_popt[0],l_popt[1])
ax=plt.subplot(1,1,1)
ax.plot(x,y,label='obs')
ax.plot(x,y_fit,label='model')
ax.set_xlabel('Annual')
ax.set_ylabel('Infant mortality rate')
plt.legend()
plt.show()
print('a : {}, b : {}'.format(l_popt[0],l_popt[1])) #Obtained parameter a,Check b
It feels good, but I think it was better to do a direct non-linear regression.
I'm not sure which one is right for me. However, if you get into a situation where "EXCEL can do it, but Python doesn't do the same!", You may remember this.
When the numerical values of the data are large and fluctuating, the approximation by non-linear regression is likely to be pulled by the large numerical fluctuations. From the point of view of generalization ability, it may be better to perform logarithmic transformation and linear regression.
dummydata.py
df=pd.read_csv('Infant mortality rate.csv',encoding='cp932')
df=df.iloc[3:18,:].rename(columns={'1st-1-Figure 6 Changes in infant mortality and mortality rate':'Annual'\
,'Unnamed: 1':'Infant mortality (persons)'\
,'Unnamed: 2':'Infant mortality (thousands)'\
,'Unnamed: 3':'Infant mortality rate'})
#Create serial number columns for later processing
rank=range(1,len(df)+1)
df['rank']=rank
#Infant mortality rate is float type because all columns are object type
df['Infant mortality rate']=df['Infant mortality rate'].astype(float)
df['Infant mortality (persons)']=df['Infant mortality (persons)'].str.replace(',','').astype(np.int)
#Insert dummy data
df2=df.copy()
df2.loc[df2['Annual']=='Heisei 2', 'Infant mortality (persons)']=60000
df2.loc[df2['Annual']=='13', 'Infant mortality (persons)']=40000
df2.loc[df2['Annual']=='15', 'Infant mortality (persons)']=20000
df2.loc[df2['Annual']=='18', 'Infant mortality (persons)']=10000
display(df2)
x=df2['Annual']
y=df2['Infant mortality (persons)']
ax=plt.subplot(1,1,1)
ax.plot(x,y)
ax.set_xlabel('Annual')
ax.set_ylabel('Infant mortality (persons)')
ax.set_title('Dummy numbers in 1990,13,15,Insert in 18')
plt.show()
dummydata.py
#Nonlinear regression
x=df2['Annual']
x2=df2['rank']
y=df2['Infant mortality (persons)']
y_fit,l_popt=exp_fit(x2,y)
ax=plt.subplot(1,1,1)
ax.plot(x,y,label='obs')
ax.plot(x,y_fit,label='model')
ax.set_xlabel('Annual')
ax.set_ylabel('Infant mortality (persons)')
plt.legend()
plt.show()
print('a : {}, b : {}'.format(l_popt[0],l_popt[1]))
#Logarithmic transformation linear regression
x=df2['Annual']
x2=df2['rank']
y=df2['Infant mortality (persons)']
y_fit,l_popt=exp_func_log_fit(x2,y)
y_fit=log_to_exp(x2,l_popt[0],l_popt[1])
ax=plt.subplot(1,1,1)
ax.plot(x,y,label='obs')
ax.plot(x,y_fit,label='model')
ax.set_xlabel('Annual')
ax.set_ylabel('Infant mortality (persons)')
plt.legend()
plt.show()
print('a : {}, b : {}'.format(l_popt[0],l_popt[1]))
Approximate with non-linear regression Approximate with logarithmic transformation linear regression
Obviously, the one approximated by non-linear regression is pulled by the fluctuation of the numerical value entered by the dummy. It seems important to distinguish and use properly depending on the situation.
Recommended Posts