Use PyCaret to predict the price of pre-owned apartments in Tokyo!

§ What is PyCaret? 

I would like to introduce the steps to predict the price of a pre-owned condominium by creating a prediction model using PyCaret , which is now a hot topic on Twitter etc. think.

PyCaret allows you to perform a series of machine learning steps such as creating predictive models, tuning, and making predictions with a minimum of Python code. It becomes.

In this article, what is machine learning through Pycaret for those who are about to start machine learning? I would like to proceed while introducing that. Please note that there may be expressions that differ from the strict definition, as we emphasize sensory comprehension.

Now, with this PyCaret , you can execute various processes required to create a predictive model for machine learning with a simple code. It's an excellent product that gives you, and how convenient it is to use it! I was surprised with. And it's free !

But, however. If you don't understand the difficulty of the machine learning process in the first place, you may not appreciate it.

§ What is supervised machine learning? 

Creating a predictive model and improving its accuracy is a really difficult task.

First of all, I would like to review the necessary work for "supervised machine learning" .

In supervised machine learning, some rule (model) is derived based on past data for which the result (answer) is known, and the result to be known is predicted by applying the rule.

For example, if you have a set of property conditions and prices for a second-hand condominium in the past, you can predict the price if you know the conditions such as where the nearest station is, how many square meters it is, and how many years it was built. Let the machine make this rule using various algorithms. This is machine learning.

The steps of supervised machine learning are roughly divided as follows.

① Divide the data into training data and verification data 

Divide the data into training data and validation data to validate the model with data that is not used to train the model. If you evaluate it with training data, it's like cheating. Normally, in order to use the training data and the data that verifies the accuracy of the model evenly and fairly, it is divided and alternately replaced (cross-validation).

② Model creation (learning) and hyperparameter tuning 

The algorithm used for prediction is decided, trained using training data, a model is created, and tuning is performed to improve accuracy.

③ Model evaluation 

Evaluate the accuracy of the model created by training and evaluate how much error it has and how useful it is.

④ Prediction of unknown values 

After the evaluation is completed, the prediction result is obtained by inputting the data whose result is unknown to the prediction model. I only know the conditions of the condominium property, but I don't know the price, so the image is to predict it.

§ Actually try using PyCaret.

Now let's carry out this flow using Pycart.

First, let's install pycaret with pip.

This time I'm running pip from Anaconda Powershell.

First, let's read the data. This time as well, the price of the used condominium along the Tokyo Chuo Line used in "Data Science Cram School Starting with Tableau" Use the data as teacher data. The data itself has the following contents.

The explanation of each item is as follows.

Item name Description Data type

uid Unique ID of the property String type

years Age Integer type

minutes Walking distance (minutes) from the nearest station Integer type

sqrm Number of square meters of property Integer type

distance Straight line distance between Tokyo Station and the nearest station (how far is it from the city center?) floating point

renovate Refurbished flag Integer type (0 or 1)

express Flag to see if the express stops Integer type (0 or 1)

price Transaction price (objective variable) Integer type (unit: 1 yen)

Load the data into Pandas and drop the unique ID that is unnecessary for learning.
Call Pycaret with import. Since this is a regression model that guesses the price, import it from pycaret.regression.

If you set the objective variable (variable you want to predict) to the apartment price "price" and execute the Setup () function, it will automatically determine whether each field is a numeric column or a category column, and if there is no difference, Execute the return key.

Perform the necessary pretreatment.

Compare_model () evaluates and compares various models and arranges them in order of least error. Internally, cross-validation (the teacher data and the verification data are exchanged evenly and evenly) is carried out, and by default, Fold = 10 (the data is divided into 10 and the training data and the verification data are exchanged). The average score is displayed.

This time, the score of " CatBoost Regressor " seems to be good. Since RMSE is 6.3 million, you may roughly understand that the average error per condominium is about 6.3 million.

Execute the create_model () function to use the actually trained model. Shift + Tab will guide you on what to put in the argument, but here we will specify the "Cat Boost Regressor" with the best score.

When you create a model with tune_model (), the hyperparameters (predetermined parameters for model behavior before training) are left at their defaults. Find the appropriate hyperparameters in the preset range. It seems that Random Search is used here as the best parameter search method.

Visualize and check the accuracy of the model with plot_model ().

In Error Plot, if the prediction and the measurement are equal, the distribution approaches the straight line of y = x.

The Feature Importance Plot sees which variables are strongly involved in the prediction. In this case, you can see that the influence is strong in the order of age (years), square meter (sqrm), distance (straight line distance from Tokyo station to the nearest station), minutes (minutes on foot from the nearest station).

Finally, execute finalize_model () to finalize the model. You can see the optimized hyperparameters by printing here.

Now that we have a trained model, we need to prepare the apartment price information scraped from the real estate page on the web, which is used to create the model, and put it in the data frame named "data_unseen".

Let's put this "unseen_data" into the forecast model we just created and let it predict the price.

This operation is very easy, cut out the explanatory variables in the same form as the training data, and use the predict_model () function to calculate the predicted value.

Below, you can see that the predicted value has been added as "Label". Originally, for unknown data, we do not know the objective variable "price", but this time we have the price of the condominium already published on the Web, so "price" is included in the data. (However, the unit of public information is 10,000 yen.) Comparing the "price" published on this Web with the predicted value "Label" predicted by the prediction model, it is a relatively good value. You can feel that it is coming out. (Approximately 20-40 million seems to be in the range of apartment prices in Tokyo.)

This data can be dropped to CSV, so drop it to CSV and then connect it to Taleau for visualization.

If you compare the published price "price" on the Web with the "Label" calculated by the forecast model, you may find a "bargain property" whose actual price is cheaper than the forecast, and vice versa. Hmm. (Hereafter, the usage of the estimated price in Tableau is the same as this link.)

If the bargain price is defined as a calculated field as shown below and the predicted price is higher than the posted price, the bargain price is actually higher than expected because it is "cheaper than expected!" , Actually, it is "higher than I expected!", So I will try to reduce the bargain.

If you draw a box plot for each station and visualize it, it will be as follows.

There are discoveries such as Yotsuya is too wide and unreliable, Kichijoji has a low bargain, Nishi-Kokubunji has a high bargain.

This suggests that in addition to the explanatory variable of the distance from Tokyo Station to the nearest station (how close to the city center), there is no explanation unless new factors such as "popularity of the nearest station" are taken into consideration. You may get it. (* Note: Whether it is an actual bargain or not depends on individual values, and the article is written on the premise that it cannot be judged unconditionally.)

In addition, as a further development of data utilization, latitude and longitude are geocoded from scraped data and bargain properties are mapped on a map. Below, orange means that the property is a bargain, blue means that it is not a bargain, and the size of the circle is the price of the property. By clicking the point of the property on the map, you can check the image information of the property and the state of the surrounding city in cooperation with Google Street View. Of course, it is not posted here, but it is also possible to display the property information in the dashboard by linking to the posted web page.

As mentioned above, with PyCaret, the process that is originally required for model creation in machine learning is executed entirely, so it is easy to create predictive models and hyperparameters with almost no need for Python coding or debugging. It was a shock to be able to tune (for free).

Even if you tune in Python for more accuracy, you can significantly reduce the initial workload by using Pycaret as a starting point.

However, it seems that it is still necessary to understand the minimum knowledge of Pandas and the basic method of what is being implemented as a flow of machine learning.

In addition, the value of forecast data can be doubled by visualizing errors with Tableau, etc., collating them with domain knowledge, considering adding new explanatory variables, and utilizing forecast results in an interactive dashboard. Isn't it possible?

In addition to the ensemble, PyCaret seems to be able to perform classification, clustering, anomaly detection, natural language processing, and association analysis for the regression that we have taken up this time, so please add practical examples that you will use in the future. I think

Classification Regression Clustering Anomaly Detection Natural Language Processing Association Rule Mining

Please give it a try.

References:

Announcing PyCaret 1.0.0

CatBoost Regressor

Qiita article:

Parameter tuning struggle for GBDT machine learning model ~ CatBoost vs LightGBM vs XGBoost vs Random Forests ~ Part 1 (@KYORO) </ a>

Recommended Posts
Use PyCaret to predict the price of pre-owned apartments in Tokyo!

I tried to predict the price of ETF

[First data science ⑥] I tried to visualize the market price of restaurants in Tokyo

I want to use the Qore SDK to predict the success of NBA players

Summary of how to use MNIST in Python

Use pygogo to get the log in json.

Use the Java SDK of GoogleMapsAPI to get the result of reverse GeoCoding in Japanese.

I want to use Python in the environment of pyenv + pipenv on Windows 10

How to get the number of digits in Python

Can artificial intelligence predict the price of Yu-Gi-Oh! Cards?

How to use the exists clause in Django's queryset

Let's use the open data of "Mamebus" in Python

How to use the model learned in Lobe in Python

Posted the number of new corona positives in Tokyo to Slack (deployed on Heroku)

To do the equivalent of Ruby's ObjectSpace._id2ref in Python

[Implementation explanation] How to use the Japanese version of BERT in Google Colaboratory (PyTorch)

Use date to x-axis of tsplot depicted in seaborn

I want to use the R dataset in python

Use hash to lighten collision detection of about 1000 balls in Python (related to the new coronavirus)

I tried to find the trend of the number of ships in Tokyo Bay from satellite images.

Try scraping the data of COVID-19 in Tokyo with Python

Use of past weather data 4 (feelings of the weather during the Tokyo Olympics)

How to find the optimal number of clusters in k-means

Survey on the use of machine learning in real services

How to use the __call__ method in a Python class

Comparison of how to use higher-order functions in Python 2 and 3

Notes on how to use marshmallow in the schema library

How to use the generator

How to use the decorator

I tried to predict the up and down of the closing price of Gurunavi's stock price using TensorFlow (progress)

I tried fitting the exponential function and logistics function to the number of COVID-19 positive patients in Tokyo

How to handle multiple versions of CUDA in the same environment

How to determine the existence of a selenium element in Python

Wrap (part of) the AtCoder Library in Cython for use in Python

How to implement Java code in the background of RedHat (LinuxONE)

[Note] Let's try to predict the amount of electricity used! (Part 1)

How to know the internal structure of an object in Python

[Python] PCA scratch in the example of "Introduction to multivariate analysis"

How to change the color of just the button pressed in Tkinter

Use the vector learned by word2vec in the Embedding layer of LSTM

How to check the memory size of a variable in Python

Output the contents of ~ .xlsx in the folder to HTML with Python

Feel free to change the label of the legend in Seaborn in python

How to use the asterisk (*) in Python. Maybe this is all? ..

I wrote the code to write the code of Brainf * ck in python

[Introduction to Python] How to use the in operator in a for statement?

How to check the memory size of a dictionary in Python

[TensorFlow 2] How to check the contents of Tensor in graph mode

Try to model the cumulative return of rollovers in futures trading

How to get the vertex coordinates of a feature in ArcPy

How to use Jupyter on the front end of supercomputer ITO

How to use machine learning for work? 01_ Understand the purpose of machine learning

What to do if you can't use the trash in Lubuntu 18.04.

Predict the rise and fall of BTC price using Qore SDK

Create a function to get the contents of the database in Go

I want to know the population of each country in the world.

Try to predict the triplet of boat race by ranking learning

Can the Kalman filter be used to predict stock price trends?

How to use the render function defined in .mako (.html) directly in mako

I want to use complicated four arithmetic operations in the IF statement of the Django template! → Use a custom template

Use Cloud Dataflow to dynamically change the destination according to the value of the data and save it in GCS

Item name	Description	Data type
uid	Unique ID of the property	String type
years	Age	Integer type
minutes	Walking distance (minutes) from the nearest station	Integer type
sqrm	Number of square meters of property	Integer type
distance	Straight line distance between Tokyo Station and the nearest station (how far is it from the city center?)	floating point
renovate	Refurbished flag	Integer type (0 or 1)
express	Flag to see if the express stops	Integer type (0 or 1)
price	Transaction price (objective variable)	Integer type (unit: 1 yen)