Thank you for taking the time to read the article! !!
Let me start by introducing myself! I'm a member of society who enjoys learning Python in my spare time
is.
Far from programming, I was crazy about PC, so I started studying Python from September 1st last month.
I started programming with Progate, PyQ, and Aidemy, so it's been about two months since I started programming.
Having learned all about Aidemy's data analysis course, I wanted to output it, so I decided to write this article.
Although I can't secure a lot of time for programming learning, such as while working or going to school,
I am a programming beginner who wants to learn programming. As I wrote above about myself, I am also a program
I'm a beginner. Therefore, please use it as one of the samples of how much you can do in about 2 months.
I'm happy.
Python3 MacBookAir Jupyter Notebook
Create a SARIMA model (a type of time series model) that predicts Japan's GDP, and display the actual and predicted values in a graph.
import csv
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from datetime import datetime
from statsmodels.tsa.statespace.sarimax import SARIMAX
import itertools
Real GDP of e-stat (government statistics)
__1. Pre-process data __ Since the target is clarified, we will calculate back and process the raw data.
#①
df = pd.read_csv('gaku-jg2022 (1).csv',encoding="shift-jis")
df = df.drop(range(0,6)) #Erase unnecessary lines
df = df.drop([110,111,112])
df = df.drop(df.columns[range(2, 30)], axis=1) #Erase unnecessary columns
df = df.reset_index(drop=True) #Renumber lines
df = df.rename(columns={'Substantial original series': 'Date'}) #Retitle the column
df = df.rename(columns={'Unnamed: 1': 'RealGDP'})
#②
#Process the data in the Time column
j = 1994
k = 0
for i in range(len(df["Date"])):
df.loc[i,"Date"] = j
k += 1
if k%4 == 0:
j += 1
df["Date"] #Extract only Date
index = pd.date_range("1994","2020",freq = "Q")#Separate data quarterly
df.index = index
del df["Date"]
#③
#Process Real GDP
i = 0
for x in df["RealGDP"]:
x = x.replace(',', '')
df.iloc[i,0] = float(x)
i += 1
In (1), the columns that do not require the acquired raw data (such as the column of private final consumption expenditure), the rows that include non-numeric values such as variable names and blanks, and the data for FY2020 are odd, so they are truncated. In (2), the existing value is changed for the work of indexing the time information (what is done from the line of the variable index). (3) Since the data type of the GDP column value is a character string and contains ",", convert it to a float type for graph display.
__2. Graph display __ If you display the data processed in 1 as a graph with the following code,
#Represent the data as a line graph
#Set the title of the graph
plt.title("quarterly-RealGDP_in_Japan")
#Graph x-axis and y-axis naming
plt.xlabel("date")
plt.ylabel("GDP")
#Data plot
plt.plot(df)
plt.show()
The horizontal axis is time, and the vertical axis is the GDP value graph. It swings up and down in the short term and tends to rise in the long term.
I will. Exceptionally, it can be seen that the GDP value has dropped significantly around 2008 due to the Lehman shock. I minutes
When analyzed, it can only be interpreted to this extent, but what pattern does the machine read from this data, and what is it?
Will you make such a prediction? I'm looking forward to it!
__3. Determine parameters __ The SARIMA model requires seven variables, one to determine visually in the graph and the other six to output with a function.
I will.
One is a parameter called period s. How many units of patterns on the data can be seen repeatedly in the period s
Enter the time it took. Considering the period s in the graph displayed above, the vertical movement is repeated.
Since he exercises up and down four times in four years, a periodic pattern occurs once (one unit) in one year. Therefore, the cycle s is 1 year
In the meantime, since the data we are dealing with this time is quarterly data, 4 data is equivalent to 1 year, so we can see that s = 4. Next, the remaining 6 are output by the following function.
#Determine the parameters of the SARIMA model
def selectparameter(DATA,s):
p = d = q = range(0, 2)
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]
parameters = []
BICs = np.array([])
for param in pdq:
for param_seasonal in seasonal_pdq:
try:
mod = sm.tsa.statespace.SARIMAX(DATA,
order=param,
seasonal_order=param_seasonal)
results = mod.fit()
parameters.append([param, param_seasonal, results.bic])
BICs = np.append(BICs,results.bic)
except:
continue
return print(parameters[np.argmin(BICs)])
#Process Real GDP values
i = 0
for x in df["RealGDP"]:
x = x.replace(',', '')
df.iloc[i,0] = x
i += 1
selectparameter(df["RealGDP"].values.astype(float), 4)
[(0, 1, 0), (1, 1, 0, 12), 1641.6840970980422] CPU times: user 19.1 s, sys: 8.17 s, total: 27.3 s Wall time: 14.2 s
Set the parameters to (0, 1, 0), (1, 1, 0, 12) from the output result.
__4. Model Fitting and Prediction __
#Model fit
SARIMA_df = sm.tsa.statespace.SARIMAX(df.astype("float64"),order=(0, 1, 0),seasonal_order=(1, 1, 0, 12)).fit()#Please write your answer here
#Substitute prediction data for pred
pred = SARIMA_df.predict("2015-03-31", "2022-12-31")
#Visualization of pred data and original time series data
plt.plot(df)
plt.plot(pred, color="r")
plt.show()
Predict GDP from March 31, 2015 to December 31, 2022, graph the actual value in blue and the predicted value in red
To do. The graph looks like this:
Since blue and red overlap quite a bit, it can be said that the prediction is good.
However, since the impact on the economy caused by the new coronavirus is not taken into consideration, the predicted values after that are considerably different from the actual values.
You can expect it to be. I would like to wait for future actual measurement values.
Even though I didn't have a strict understanding, I couldn't even do a blind touch two months ago, so I let the machine learn.
I'm a little impressed to be able to make predictions.
I still have Aidemy's course remaining in my future plans, so I will study in another course and output other than data analysis.
I will come back here to do it.
Thank you very much for reading to the end! !!
e-stat Aidemy Data Analysis Course Population Trends in Japan by Machine Learning [Big data analysis method and "SARIMA model" that predict the future](https://deepage.net/bigdata/2016/10/22/bigdata-analytics.html#sarima%E3%83%A2%E3%83] % 87% E3% 83% AB)
Recommended Posts