I have been studying Deep Learning for about half a year, and recently I am interested in analyzing time series data and am currently studying. At the following site, I would like to do what former Tableau employee Mr. Iwahashi is doing while also studying python / chainer.
URL: Deep learning starting from data understanding by Tableau
Purpose: Create a model that predicts power consumption from meteorological data --Explanatory variable: Meteorological data + α --Explained variable: Power consumption
It's still something that beginners are doing, so I don't think it's a smart way at all, so I'm waiting for comments like "It's better to do this" and "I can enjoy this more!" (Laughs) It's a little long, but I hope you'll read it.
The data can be downloaded as a csv file from the following TEPCO and Japan Meteorological Agency sites.
TEPCO: Download past power usage record data Japan Meteorological Agency: Past weather data download
From here, get the data for 2018 (one year's worth). For meteorological data
--Temperature
I decided to use (the same as Mr. Iwahashi's).
After downloading the data, let's load it with pandas. We used Google Colaboratory for the analysis. (I used Jupyter lab at first, but it took an unusual amount of time to draw the graph, so I changed to Google Colaboratory in the middle)
To read the file, upload the downloaded csv file to Google Drive and use it by connecting to Google Colaboratory.
First, import the library. After this, I will import the library as needed (you can import all at once, but I did it by groping, so I imported it when needed).
#Library import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
You can use the code below to integrate Google Colaboratory with Google Drive. If you google, you will find a detailed method, so please refer to that.
Reference: The story that mounting on Google Drive in Colaboratory was easy
#Connect with Google Drive
from google.colab import drive
drive.mount('/content/drive')
Well, when you come to this point, it's time to read the file. "Xxx" below can be any folder / file name on Google Drive.
#Read file/Displaying DataFrame
df = pd.read_csv('drive/My Drive/xxx/xxx.csv')
df
** Preprocessing with Python ** Mr. Iwahashi mentioned above is doing preprocessing and visualization with Tableau (DeskTop / Prep), but since this time I am studying Python, I would like to do the same (similar) work only with Python.
First, let's take a look at the meteorological data (data.csv) downloaded from the Japan Meteorological Agency website. By the way, in Excel, it is as follows. There are rows and columns that you don't need.
Extract the matrix you want to use from this dataset. (However, if you have a dataset of this amount, it's easier to play with it in Excel. But here I dare to extract the data I want to use in python (I'm just studying python)).
df_weather = pd.read_csv('xxx\weather_data_1_2.csv', header=[0], skiprows=3, engine='python') #,index_col=0
df_weather = df_weather.iloc[1:,[0, 1,4,8,12,18,21]]
df_weather.head()
Date and time | temperature(℃) | Precipitation(mm) | Daylight hours(time) | Insolation(MJ/㎡) | Relative humidity(%) | Local barometric pressure(hPa) |
---|---|---|---|---|---|---|
2018/1/1 1:00:00 | 1.5 | 0.0 | 0.0 | 0.0 | 82.0 | 1009.6 |
2018/1/1 2:00:00 | 1.0 | 0.0 | 0.0 | 0.0 | 83.0 | 1009.5 |
2018/1/1 3:00:00 | 1.2 | 0.0 | 0.0 | 0.0 | 80.0 | 1009.0 |
2018/1/1 4:00:00 | 0.6 | 0.0 | 0.0 | 0.0 | 85.0 | 1008.6 |
2018/1/1 5:00:00 | 1.9 | 0.0 | 0.0 | 0.0 | 80.0 | 1008.8 |
And the output will be like this.
When reading the data on the first line, use ** index specification (index_col =), header specification (header =), and skip unnecessary lines (skiprows) ** to make it roughly the shape you want to use. In the second row, **. Iloc ([]) ** is used to get the required rows / columns.
Next, I would like to change the column name because it is difficult to use if the column name is Japanese. This can be converted at once by creating a variable called columns and using the dictionary type {'before change','after change'} code as follows.
#Convert column names to English
columns = {'Date and time':'Date_Time',
'temperature(℃)':'Temperature',
'Precipitation(mm)':'Precipitation',
'Daylight hours(time)':'Daylighthours',
'Insolation(MJ/㎡)':'SolarRadiation',
'Relative humidity(%)':'Humidity',
'Local barometric pressure(hPa)':'Airpressure'}
df_weather.rename(columns=columns, inplace=True)
df_weather.head(3)
Next, let's check the data type. I always check the following three things (please let me know if there are other things that should be checked).
#Data type confirmation
print(df_weather.shape)
print(df_weather.dtypes)
print(df_weather.info)
print(df_weather.describe())
When downloading data from the Meteorological Agency, the size of the data that can be downloaded is fixed, and if the number of data for one year is the number of downloads, it is necessary to download it in multiple times (in my case, 2 months per file). I was only able to download minutes of data). So you need to join multiple files.
Here, the file names are df_weather1 ~ 6, respectively.
df_weather_integrate = pd.concat([df_weather1, df_weather2, df_weather3, df_weather4, df_weather5, df_weather6],ignore_index=True)
I named the combined files df_weather_integrated (it's a bit long ...) and used ** pd.concat () ** to combine the six files vertically.
Next is the confirmation of missing values. Check for missing values with the familiar ** dataframe.isnull (). Any () **. It returns as a Boolean type (True or False).
df_weather_integrated.isnull().any()
1 | 2 |
---|---|
Date_Time | False |
Temperature | False |
Precipitation | False |
Daylighthours | True |
SolarRadiation | True |
Humidity | True |
Airpressure | False |
Now that we know that there are missing values in Daylight hours, SolarRadiation, and Humidity, let's check how many missing values there are. Check with **. isnull (). Sum () **.
df_weather_integrated.isnull().sum()
1 | 2 |
---|---|
Date_Time | 0 |
Temperature | 0 |
Precipitation | 0 |
Daylighthours | 1 |
SolarRadiation | 1 |
Humidity | 27 |
Airpressure | 0 |
Now you know the number of missing values in the meteorological data. Later, we will process the missing values together with the power consumption data.
Next, let's take a look at the power data downloaded from the TEPCO website. If it is Excel, it will be downloaded as follows.
df_Elec = pd.read_csv('xxx\elec_2018.csv',header=[0],skiprows=1, engine='python')
Here, as before, header and skip rows are used to remove unnecessary rows. For power consumption data, change column names (Japanese >> English), check data types, and check missing values (details omitted), just like meteorological data.
[After conversion]
DATE | TIME | kW | |
---|---|---|---|
0 | 2018/1/1 | 0:00 | 2962 |
1 | 2018/1/1 | 1:00 | 2797 |
2 | 2018/1/1 | 2:00 | 2669 |
・ ・ ・ | ・ ・ ・ | ・ ・ ・ | ・ ・ ・ |
As you may have noticed here, the 0th line is 2018/1/1 0:00. The weather data will start from 1:00 on January 1, 2018, so delete the 0th line here and re-index.
df_Elec = df_Elec.drop(df_Elec.index[0])
df_Elec =df_Elec.reset_index(drop=True)
df_Elec.head(3)
[After reconversion]
DATE | TIME | kW | |
---|---|---|---|
0 | 2018/1/1 | 1:00 | 2962 |
1 | 2018/1/1 | 2:00 | 2797 |
2 | 2018/1/1 | 3:00 | 2669 |
・ ・ ・ | ・ ・ ・ | ・ ・ ・ | ・ ・ ・ |
Next, since the date and time of the meteorological data are combined into one column, combine the first and second columns of the power consumption data so that they have the same shape as the meteorological data. Here, ** pd.to_datetime ** creates a new column (DATE_TIME) with datetime type.
#Combine DATE and TIME to create a new DATE_Make TIME
df_Elec['DATE_TIME'] = pd.to_datetime(df_Elec['DATE'] +' '+ df_Elec['TIME'])
df_Elec = df_Elec[['DATE_TIME', 'kW']]
You are now ready to combine your weather and power consumption data. Now let's combine these two data. Here we are using ** pd.merge () **.
#Combine two files
df = pd.merge(df_weather_integrate, df_Elec ,left_on='Date_Time', right_on='DATE_TIME', how='left')
You have now merged the two files (horizontally)! Looking at this data frame here, there are unnecessary columns (DATE_TIME) remaining and NaN in kW in the last row, so we will process these at the end.
#Leave only the lines you use
df = df[['Date_Time','Temperature', 'Precipitation', 'Daylighthours', 'SolarRadiation', 'Humidity', 'Airpressure', 'kW']]
#8759 is deleted because there is a gap between power consumption data and meteorological data in the last line
df.drop(8759,inplace=True)
Now you can finally analyze the data! (It was long...)
Next time, I would like to analyze the preprocessed data with chainer!
This time I tried the following 4 points.
--Data acquisition --Data reading & confirmation --Data shaping --Data combination
Next time, I would like to handle missing values.
Recommended Posts