Using "regression", which is the basis of deep learning, I would like to make a program to predict the house price. I will write from a beginner's point of view as much as possible. The previous article is here .
Regression is the task of predicting __numerical values based on characteristic data. This time we will create a program to predict house prices, but it is also possible to predict price movements of stocks and FX (foreign exchange trading).
import
Import is ↓. This time let's import pandas to check the data.
By the way, pandas is very easy to handle, but very slow. It is common to use numpy for learning and pandas for visual confirmation and data preprocessing.
from tensorflow.keras.datasets import boston_housing
from tensorflow.keras.layers import Activation, Dense, Dropout
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers import Adam
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
This time, we will use a library called boston_housing to predict house prices. boston_housing contains characteristic information and the correct label for deciding on a home in Boston, USA. Characteristic information (hereinafter referred to as explanatory variables) includes the crime rate and accessibility of the area.
Obviously, if this explanatory variable contains sloppy information, the prediction accuracy will be poor. For example, even if you add the number of pachinko parlors in the vicinity to the explanatory variable Only the pachinker feels value, so it will interfere with the prediction. We have to add something that everyone feels worth.
The most difficult part of regression prediction is the definition of this explanatory variable. This time it is easy because it is included in the downloaded data.
The download is the code below. The downloaded explanatory variables divide the correct label into learning and verification. (train_data, train_labels) is for training and (test_data, test_labels) is for verification. This area is the same as the classification of the previous article.
(train_data, train_labels), (test_data, test_labels) = boston_housing.load_data()
Check the number of shapes. There are 404 learning data and 102 verification data. Compared to the classification of the previous article, it's a lot less. It is the number of cases that makes me uncertain whether I can really predict it.
Next is the simplest and most important task, data preprocessing. This home price forecast is not time series data, so there is no continuity. When predicting such data, it is safer to shuffle the data.
Use random.random ()
to create a random number, and np.argsort ()
to create an index and sort.
order = np.argsort(np.random.random(train_labels.shape))
train_data = train_data[order]
train_labels = train_labels[order]
Next, normalize the explanatory variables that determine the house price. This time, normalization is used to set the explanatory variable to a value with a variance of 1 with an average of 0.
In regression prediction, the prediction may be pulled by a large number of explanatory variables. It is said that it is good to normalize in this way.
Normalization can be calculated by subtracting the mean from the data you want to normalize and dividing by the standard deviation. The code is ↓.
mean = train_data.mean(axis=0)
std = train_data.std(axis=0)
train_data = (train_data - mean) / std
test_data = (test_data - mean) / std
Use pandas to make sure the explanatory variables are normalized.
#Checking the data after preprocessing the dataset
column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
df = pd.DataFrame(train_data, columns=column_names)
df.head()
Create a model of the neural network. This time, we will prepare three layers of total total connection.
model = Sequential()
model.add(Dense(64, activation='relu', input_shape=(13,)))
model.add(Dense(64, activation='relu'))
model.add(Dense(1))
model.compile(loss='mse', optimizer=Adam(lr=0.001), metrics=['mae']) #compile
I will explain each one.
First, create a sequential model with model = Sequential ()
.
model.add (Dense (64, activation ='relu', input_shape = (13,)))
is the input layer.
Dense: Total binding, Number of units: 64, Activation function: ReLU, Explanatory variables: 13
model.add (Dense (64, activation ='relu'))
is a hidden layer.
Dense: Total binding, Number of units: 64, Activation function: ReLU
model.add (Dense (1))
is the output layer.
This time it is a numerical prediction, so there is only one unit (number of outputs).
Compile with model.compile (loss ='mse', optimizer = Adam (lr = 0.001), metrics = ['mae'])
.
With loss ='mse', set the loss function __ to find the __ error between the predicted value and the actual value.
Let's set mse which is said to be suitable for regression.
With optimizer = Adam (lr = 0.001), set Adam to the optimization function __ to reduce the error, and set the learning rate to 0.01.
With metrics = ['mae'], set the mae of the evaluation function __ to evaluate the performance of the __ model.
This time I would like to learn using Early Stopping. With EarlyStopping, if you don't see any improvement in learning with the specified number of epochs, it will stop automatically. This time, if there is no improvement in 20 epochs, I would like to stop.
For learning, set the maximum number of epochs to 500, set the verification data to 20%, and enable Early Stopping with callbacks = [early_stop].
#Prepare for Early Stopping
early_stop = EarlyStopping(monitor='val_loss', patience=30)
#Execution of learning
history = model.fit(train_data, train_labels, epochs=500,
validation_split=0.2, callbacks=[early_stop])
This is an explanation of the learning situation. __loss is the training data error __. The closer it is to 0, the better the result. __mae is the average absolute error of the training data __. The closer it is to 0, the better the result. __val_loss is the error in the validation data __. The closer it is to 0, the better the result. __val_mae is the mean absolute error __ of the validation data. The closer it is to 0, the better the result.
I have set the number of epochs to 500, but I think that it has been discontinued because no improvement can be seen on the way.
Draw history.history where the learning result is saved with matplotlib.
plt.plot(history.history['mae'], label='train mae')
plt.plot(history.history['val_mae'], label='val mae')
plt.xlabel('epoch')
plt.ylabel('mae [1000$]')
plt.legend(loc='best')
plt.ylim([0,5])
plt.show()
Evaluate the training data with model.evalute.
test_loss, test_mae = model.evaluate(test_data, test_labels)
print('loss:{:.3f}\nmae: {:.3f}'.format(test_loss, test_mae))
The result is worse than the training data, but the result is almost the same. It's amazing that even with a small number of cases, about 400, we can get a number close to this. Perhaps the definition of the explanatory variable is excellent.
Finally, let's output the forecast data and check it.
Display the correct label first, then infer.
Since the output result of inference is two-dimensional, let's convert it to one-dimensional with flatten ()
.
#Display correct label
print(np.round(test_labels[0:10]))
#Display of inferred price
test_predictions = model.predict(test_data[0:10]).flatten()
print(np.round(test_predictions))
It seems that you can get a number close to the correct label.
Properties below this forecast may be selling cheaper than the market price. However, it may be cheaper for reasons that cannot be expressed by explanatory variables (such as ghosts appearing). It is dangerous to buy or sell based on this forecast result alone, but I think it will be helpful.
Recommended Posts