Introduction

Since I started studying machine learning, I've been able to input various knowledge, so I'll do it with kaggle! I tried, but I was frustrated. I have no idea how to deal with it! !! I was in a state and my motivation for learning was lowered.

I found the following article when I was looking for something interesting because I thought it would be useless as it is! [For super beginners] Python environment construction & scraping & machine learning & practical application that you can enjoy by moving with copy [Let's find a good rental property with SUUMO! ]

I was looking for a rental property recently, so I thought it was just right and tried it.

Implemented with reference to the article. I will introduce it because I made various improvements in my own way.

What kind of person is it for?

For self-proclaimed machine learning beginners like me. This is for people who have input various things but do not know what to do after that. I haven't explained the basic terms and methods of machine learning, so don't be afraid.

My environment

windous10 Home python3.7.3 jupyter notebook (for testing)

Source code https://github.com/pattatto/scraping

Improved

Let me give you an overview first. The directory structure is as follows.

As a series of flow

Data acquisition by scraping (suumo_getdata.py)
Data preprocessing (Preprocessing.py)
Feature creation (Feature_value.py)
Model learning (model_lightgbm.py)
Output the prediction result from the trained model (Create_Otoku_data.py)

Originally it was a single piece of code, but we have modularized these. The result of data processing is output in CSV each time and called every time it is used.

Scraping

`suumo_getdata.py`


from bs4 import BeautifulSoup
import urllib3
import re
import requests
import time
import pandas as pd
from pandas import Series, DataFrame

url = input()

result = requests.get(url)
c = result.content

soup = BeautifulSoup(c)

#I want to get the total number of pages
summary = soup.find("div",{'id':'js-bukkenList'})
body = soup.find("body")
pages = body.find_all("div",{'class':'pagination pagination_set-nav'})
pages_text = str(pages)
pages_split = pages_text.split('</a></li>\n</ol>')
pages_split0 = pages_split[0]
pages_split1 = pages_split0[-3:]
pages_split2 = pages_split1.replace('>','')#It gets in the way when it is 2 digits>Remove
pages_split3 = int(pages_split2)

urls = []

urls.append(url)

#After the second page, at the end of the url&page=2 is attached
for i in range(pages_split3-1):
    pg = str(i+2)
    url_page = url + '&page=' + pg
    urls.append(url_page)

names = []
addresses = []
buildings = []
locations0 = []
locations1 = []
locations2 = []
ages = []
heights = []
floors = []
rent = []
admin = []
others = []
floor_plans = []
areas = []
detail_urls = []


for url in urls:
    result = requests.get(url)
    c = result.content
    soup = BeautifulSoup(c)
    summary = soup.find("div",{'id':'js-bukkenList'})
    apartments = summary.find_all("div",{'class':'cassetteitem'})

    for apartment in apartments:

        room_number = len(apartment.find_all('tbody'))

        name = apartment.find('div', class_='cassetteitem_content-title').text
        address = apartment.find('li', class_='cassetteitem_detail-col1').text
        building = apartment.find('span', class_='ui-pct ui-pct--util1').text
        #Add as many property names and addresses to the list as there are rooms for each rental
        for i in range(room_number):
            names.append(name)
            addresses.append(address)
            buildings.append(building)

        sublocation = apartment.find('li', class_='cassetteitem_detail-col2')
        cols = sublocation.find_all('div')
        for i in range(len(cols)):
            text = cols[i].find(text=True)
            #Add data to each list as many as the number of rooms
            for j in range(room_number):
                if i == 0:
                    locations0.append(text)
                elif i == 1:
                    locations1.append(text)
                elif i == 2:
                    locations2.append(text)

        age_and_height = apartment.find('li', class_='cassetteitem_detail-col3')
        age = age_and_height('div')[0].text
        height = age_and_height('div')[1].text

        for i in range(room_number):
            ages.append(age)
            heights.append(height)

        table = apartment.find('table')
        rows = []
        rows.append(table.find_all('tr'))#Information for each room

        data = []
        for row in rows:
            for tr in row:
                cols = tr.find_all('td')#td detailed room information
                if len(cols) != 0:
                    _floor = cols[2].text
                    _floor = re.sub('[\r\n\t]', '', _floor)

                    _rent_cell = cols[3].find('ul').find_all('li')
                    _rent = _rent_cell[0].find('span').text#rent
                    _admin = _rent_cell[1].find('span').text#Management fee

                    _deposit_cell = cols[4].find('ul').find_all('li')
                    _deposit = _deposit_cell[0].find('span').text
                    _reikin = _deposit_cell[1].find('span').text
                    _others = _deposit + '/' + _reikin

                    _floor_cell = cols[5].find('ul').find_all('li')
                    _floor_plan = _floor_cell[0].find('span').text
                    _area = _floor_cell[1].find('span').text

                    _detail_url = cols[8].find('a')['href']
                    _detail_url = 'https://suumo.jp' + _detail_url

                    text = [_floor, _rent, _admin, _others, _floor_plan, _area, _detail_url]
                    data.append(text)

        for row in data:
            floors.append(row[0])
            rent.append(row[1])
            admin.append(row[2])
            others.append(row[3])
            floor_plans.append(row[4])
            areas.append(row[5])
            detail_urls.append(row[6])


        time.sleep(3)

names = Series(names)
addresses = Series(addresses)
buildings = Series(buildings)
locations0 = Series(locations0)
locations1 = Series(locations1)
locations2 = Series(locations2)
ages = Series(ages)
heights = Series(heights)
floors = Series(floors)
rent = Series(rent)
admin = Series(admin)
others = Series(others)
floor_plans = Series(floor_plans)
areas = Series(areas)
detail_urls = Series(detail_urls)

suumo_df = pd.concat([names, addresses, buildings, locations0, locations1, locations2, ages, heights, floors, rent, admin, others, floor_plans, areas, detail_urls], axis=1)

suumo_df.columns=['Apartment name','Street address', 'Building type', 'Location 1','Location 2','Location 3','Age','Building height','hierarchy','Rent','Management fee', 'Shiki/Thank you/Warranty/Shiki引,Amortization','Floor plan','Occupied area', 'Detailed URL']

suumo_df.to_csv('suumo.csv', sep = '\t', encoding='utf-16', header=True, index=False)

I am additionally acquiring the building type (apartment, apartment, etc.). When I ran the original code, there were quite a few apartments at the top of the deals. Under the same conditions, it is natural that an apartment is cheaper.

Pretreatment

`Preprocessing.py`


import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
from sklearn import preprocessing
import pandas_profiling as pdp

df = pd.read_csv('otokuSearch/data/suumo.csv', sep='\t', encoding='utf-16')

splitted1 = df['Location 1'].str.split('Ayumu', expand=True)
splitted1.columns = ['Location 11', 'Location 12']
splitted2 = df['Location 2'].str.split('Ayumu', expand=True)
splitted2.columns = ['Location 21', 'Location 22']
splitted3 = df['Location 3'].str.split('Ayumu', expand=True)
splitted3.columns = ['Location 31', 'Location 32']

splitted4 = df['Shiki/Thank you/Warranty/Shiki引,Amortization'].str.split('/', expand=True)
splitted4.columns = ['Security deposit', 'key money']

df = pd.concat([df, splitted1, splitted2, splitted3, splitted4], axis=1)

df.drop(['Location 1','Location 2','Location 3','Shiki/Thank you/Warranty/Shiki引,Amortization'], axis=1, inplace=True)

df = df.dropna(subset=['Rent'])

df['Rent'] = df['Rent'].str.replace(u'Ten thousand yen', u'')
df['Security deposit'] = df['Security deposit'].str.replace(u'Ten thousand yen', u'')
df['key money'] = df['key money'].str.replace(u'Ten thousand yen', u'')
df['Management fee'] = df['Management fee'].str.replace(u'Circle', u'')
df['Age'] = df['Age'].str.replace(u'New construction', u'0')
df['Age'] = df['Age'].str.replace(u'Over 99 years', u'0') #
df['Age'] = df['Age'].str.replace(u'Built', u'')
df['Age'] = df['Age'].str.replace(u'Year', u'')
df['Occupied area'] = df['Occupied area'].str.replace(u'm', u'')
df['Location 12'] = df['Location 12'].str.replace(u'Minutes', u'')
df['Location 22'] = df['Location 22'].str.replace(u'Minutes', u'')
df['Location 32'] = df['Location 32'].str.replace(u'Minutes', u'')

df['Management fee'] = df['Management fee'].replace('-',0)
df['Security deposit'] = df['Security deposit'].replace('-',0)
df['key money'] = df['key money'].replace('-',0)

splitted5 = df['Location 11'].str.split('/', expand=True)
splitted5.columns = ['Route 1', 'Station 1']
splitted5['1 walk from the station'] = df['Location 12']
splitted6 = df['Location 21'].str.split('/', expand=True)
splitted6.columns = ['Route 2', 'Station 2']
splitted6['2 on foot from the station'] = df['Location 22']
splitted7 = df['Location 31'].str.split('/', expand=True)
splitted7.columns = ['Route 3', 'Station 3']
splitted7['3 on foot from the station'] = df['Location 32']

df = pd.concat([df, splitted5, splitted6, splitted7], axis=1)

df.drop(['Location 11','Location 12','Location 21','Location 22','Location 31','Location 32'], axis=1, inplace=True)

df['Rent'] = pd.to_numeric(df['Rent'])
df['Management fee'] = pd.to_numeric(df['Management fee'])
df['Security deposit'] = pd.to_numeric(df['Security deposit'])
df['key money'] = pd.to_numeric(df['key money'])
df['Age'] = pd.to_numeric(df['Age'])
df['Occupied area'] = pd.to_numeric(df['Occupied area'])

df['Rent'] = df['Rent'] * 10000
df['Security deposit'] = df['Security deposit'] * 10000
df['key money'] = df['key money'] * 10000

df['1 walk from the station'] = pd.to_numeric(df['1 walk from the station'])
df['2 on foot from the station'] = pd.to_numeric(df['2 on foot from the station'])
df['3 on foot from the station'] = pd.to_numeric(df['3 on foot from the station'])

splitted8 = df['hierarchy'].str.split('-', expand=True)
splitted8.columns = ['Floor 1', 'Floor 2']
splitted8['Floor 1'].str.encode('cp932')
splitted8['Floor 1'] = splitted8['Floor 1'].str.replace(u'Floor', u'')
splitted8['Floor 1'] = splitted8['Floor 1'].str.replace(u'B', u'-')
splitted8['Floor 1'] = splitted8['Floor 1'].str.replace(u'M', u'')
splitted8['Floor 1'] = pd.to_numeric(splitted8['Floor 1'])
df = pd.concat([df, splitted8], axis=1)

df['Building height'] = df['Building height'].str.replace(u'Underground 1 above ground', u'')
df['Building height'] = df['Building height'].str.replace(u'Underground 2 above ground', u'')
df['Building height'] = df['Building height'].str.replace(u'Underground 3 above ground', u'')
df['Building height'] = df['Building height'].str.replace(u'4 underground', u'')
df['Building height'] = df['Building height'].str.replace(u'5 underground', u'')
df['Building height'] = df['Building height'].str.replace(u'6 underground', u'')
df['Building height'] = df['Building height'].str.replace(u'7 underground', u'')
df['Building height'] = df['Building height'].str.replace(u'8 underground', u'')
df['Building height'] = df['Building height'].str.replace(u'9 underground', u'')
df['Building height'] = df['Building height'].str.replace(u'One-story', u'1')
df['Building height'] = df['Building height'].str.replace(u'Floor', u'')
df['Building height'] = pd.to_numeric(df['Building height'])

df = df.reset_index(drop=True)
df['Floor plan DK'] = 0
df['Floor plan K'] = 0
df['Floor plan L'] = 0
df['Floor plan S'] = 0
df['Floor plan'] = df['Floor plan'].str.replace(u'Studio', u'1')

for x in range(len(df)):
    if 'DK' in df['Floor plan'][x]:
        df.loc[x,'Floor plan DK'] = 1
df['Floor plan'] = df['Floor plan'].str.replace(u'DK',u'')

for x in range(len(df)):
    if 'K' in df['Floor plan'][x]:
        df.loc[x,'Floor plan K'] = 1
df['Floor plan'] = df['Floor plan'].str.replace(u'K',u'')

for x in range(len(df)):
    if 'L' in df['Floor plan'][x]:
        df.loc[x,'Floor plan L'] = 1
df['Floor plan'] = df['Floor plan'].str.replace(u'L',u'')

for x in range(len(df)):
    if 'S' in df['Floor plan'][x]:
        df.loc[x,'Floor plan S'] = 1
df['Floor plan'] = df['Floor plan'].str.replace(u'S',u'')

df['Floor plan'] = pd.to_numeric(df['Floor plan'])

splitted9 = df['Street address'].str.split('Ward', expand=True)
splitted9.columns = ['Municipalities']
#splitted9['Ward'] = splitted9['Ward'] + 'Ward'
#splitted9['Ward'] = splitted9['Ward'].str.replace('Tokyo','')
df = pd.concat([df, splitted9], axis=1)

splitted10 = df['Station 1'].str.split('bus', expand=True)
splitted10.columns = ['Station 1', 'Bus 1']
splitted11 = df['Station 2'].str.split('bus', expand=True)
splitted11.columns = ['Station 2', 'Bus 2']
splitted12 = df['Station 3'].str.split('bus', expand=True)
splitted12.columns = ['Station 3', 'Bus 3']

splitted13 = splitted10['Bus 1'].str.split('Minutes\(bus stop\)', expand=True)
splitted13.columns = ['Bus time 1', 'Bus stop 1']
splitted14 = splitted11['Bus 2'].str.split('Minutes\(bus stop\)', expand=True)
splitted14.columns = ['Bus time 2', 'Bus stop 2']
splitted15 = splitted12['Bus 3'].str.split('Minutes\(bus stop\)', expand=True)
splitted15.columns = ['Bus time 3', 'Bus stop 3']

splitted16 = pd.concat([splitted10, splitted11, splitted12, splitted13, splitted14, splitted15], axis=1)
splitted16.drop(['Bus 1','Bus 2','Bus 3'], axis=1, inplace=True)

df.drop(['Station 1','Station 2','Station 3'], axis=1, inplace=True)
df = pd.concat([df, splitted16], axis=1)

splitted17 = df['Station 1'].str.split('car', expand=True)
splitted17.columns = ['Station 1', 'Car 1']
splitted18 = df['Station 2'].str.split('car', expand=True)
splitted18.columns = ['Station 2', 'Car 2']
splitted19 = df['Station 3'].str.split('car', expand=True)
splitted19.columns = ['Station 3', 'Car 3']

splitted20 = splitted17['Car 1'].str.split('Minutes', expand=True)
splitted20.columns = ['Car time 1', 'Vehicle distance 1']
splitted21 = splitted18['Car 2'].str.split('Minutes', expand=True)
splitted21.columns = ['Car time 2', 'Vehicle distance 2']
splitted22 = splitted19['Car 3'].str.split('Minutes', expand=True)
splitted22.columns = ['Car time 3', 'Vehicle distance 3']

splitted23 = pd.concat([splitted17, splitted18, splitted19, splitted20, splitted21, splitted22], axis=1)
splitted23.drop(['Car 1','Car 2','Car 3'], axis=1, inplace=True)

df.drop(['Station 1','Station 2','Station 3'], axis=1, inplace=True)
df = pd.concat([df, splitted23], axis=1)

df['Vehicle distance 1'] = df['Vehicle distance 1'].str.replace(u'\(', u'')
df['Vehicle distance 1'] = df['Vehicle distance 1'].str.replace(u'km\)', u'')
df['Vehicle distance 2'] = df['Vehicle distance 2'].str.replace(u'\(', u'')
df['Vehicle distance 2'] = df['Vehicle distance 2'].str.replace(u'km\)', u'')
df['Vehicle distance 3'] = df['Vehicle distance 3'].str.replace(u'\(', u'')
df['Vehicle distance 3'] = df['Vehicle distance 3'].str.replace(u'km\)', u'')

df[['Route 1','Route 2','Route 3', 'Station 1', 'Station 2','Station 3','Municipalities', 'Bus stop 1', 'Bus stop 2', 'Bus stop 3']] = df[['Route 1','Route 2','Route 3', 'Station 1', 'Station 2','Station 3','Municipalities', 'Bus stop 1', 'Bus stop 2', 'Bus stop 3']].fillna("NAN")
df[['Bus time 1','Bus time 2','Bus time 3',]] = df[['Bus time 1','Bus time 2','Bus time 3']].fillna(0)#If there is a missing value, an error will occur in the calculation of the feature amount, so replace it with 0.
df['Bus time 1'] = df['Bus time 1'].astype(float)
df['Bus time 2'] = df['Bus time 2'].astype(float)
df['Bus time 3'] = df['Bus time 3'].astype(float)

oe = preprocessing.OrdinalEncoder()
df[['Building type', 'Route 1','Route 2','Route 3', 'Station 1', 'Station 2','Station 3','Municipalities', 'Bus stop 1', 'Bus stop 2', 'Bus stop 3']] = oe.fit_transform(df[['Building type', 'Route 1','Route 2','Route 3', 'Station 1', 'Station 2','Station 3','Municipalities', 'Bus stop 1', 'Bus stop 2', 'Bus stop 3']].values)

df['Rent+Management fee'] = df['Rent'] + df['Management fee']

df_for_search = df.copy()

#Set maximum price
df = df[df['Rent+Management fee'] < 300000]

df = df[["Apartment name",'Building type', 'Rent+Management fee', 'Age', 'Building height', 'Floor 1',
       'Occupied area','Route 1','Route 2','Route 3', 'Station 1', 'Station 2','Station 3','1 walk from the station', '2 on foot from the station','3 on foot from the station','Floor plan', 'Floor planDK', 'Floor planK', 'Floor planL', 'Floor planS',
       'Municipalities', 'Bus stop 1', 'Bus stop 2', 'Bus stop 3', 'Bus time 1','Bus time 2','Bus time 3']]

df.columns = ['name', 'building', 'real_rent','age', 'hight', 'level','area', 'route_1','route_2','route_3','station_1','station_2','station_3','station_wolk_1','station_wolk_2','station_wolk_3','room_number','DK','K','L','S','adress', 'bus_stop1', 'bus_stop2', 'bus_stop3', 'bus_time1', 'bus_time2', 'bus_time3']


#pdp.ProfileReport(df)
df.to_csv('otokuSearch/Preprocessing/Preprocessing.csv', sep = '\t', encoding='utf-16', header=True, index=False)
df_for_search.to_csv('otokuSearch/Preprocessing/df_for_search.csv', sep = '\t', encoding='utf-16', header=True, index=False)

The processing was quite difficult here. The improved column contains station information. Originally it was only station information and distance from the station. However, the station information still included the travel time by bus stop and bus, and the time by car to the nearest station.

Like this ** Kawaguchi Station Bus 9 minutes (bus stop) Motogo Junior High School 1 minute walk ** When this data is preprocessed

Station 1	1 on foot
Kawaguchi Station Bus 9 minutes(bus stop)Motogo Junior High School	1

In the column of station 1 Kawaguchi Station Bus 9 minutes (bus stop) Motogo Junior High School On foot 1 1 minute) It will be a 1-minute walk from the next station. Moreover, since the bus information is included in the station name, the station information will be different from Kawaguchi station when Label encoding later. So I divide this into ** bus stop **, ** bus time **, ** car travel time **.

Also, the added building type and bus stop are Label encoded.

Feature creation

`Feature_value.py`


import pandas as pd
import numpy as np

df = pd.read_csv('otokuSearch/Preprocessing/Preprocessing.csv', sep='\t', encoding='utf-16')

df["per_area"] = df["area"]/df["room_number"]
df["hight_level"] = df["hight"]*df["level"]
df["area_hight_level"] = df["area"]*df["hight_level"]
df["distance_staion_1"] = df["station_1"]*df["station_wolk_1"]+df["bus_stop1"]*df["bus_time1"]
df["distance_staion_2"] = df["station_2"]*df["station_wolk_2"]+df["bus_stop2"]*df["bus_time2"]
df["distance_staion_3"] = df["station_3"]*df["station_wolk_3"]+df["bus_stop3"]*df["bus_time3"]

df.to_csv('otokuSearch/Featurevalue/Fettur_evalue.csv', sep = '\t', encoding='utf-16', header=True, index=False)

We have created a new feature of the distance to the station using the data of the newly created bus.

At first, I created such a feature amount. df["per_real_rent"] = df["real_rent"]/df["area"] However, when I looked closely, I deleted it because it contained the objective variable (predicted rent information). Later, I will visualize the importance of each feature during learning. At first, I was delighted that I was able to obtain a good amount of features because the importance was outstanding. .. ..

Model learning

`model_lightgbm.py`


#Data analysis library
import pandas as pd
import numpy as np

#Data visualization library
import matplotlib.pyplot as plt
import seaborn as sns

#Random forest library
import lightgbm as lgb

#Library to separate training data and model evaluation data for cross-validation
from sklearn.model_selection import KFold

#Library required for function processing
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
#Library required to save the model
import pickle

#A function that describes the predicted and correct values
def True_Pred_map(pred_df):
    RMSE = np.sqrt(mean_squared_error(pred_df['true'], pred_df['pred']))
    R2 = r2_score(pred_df['true'], pred_df['pred'])
    plt.figure(figsize=(8,8))
    ax = plt.subplot(111)
    ax.scatter('true', 'pred', data=pred_df)
    ax.set_xlabel('True Value', fontsize=15)
    ax.set_ylabel('Pred Value', fontsize=15)
    ax.set_xlim(pred_df.min().min()-0.1 , pred_df.max().max()+0.1)
    ax.set_ylim(pred_df.min().min()-0.1 , pred_df.max().max()+0.1)
    x = np.linspace(pred_df.min().min()-0.1, pred_df.max().max()+0.1, 2)
    y = x
    ax.plot(x,y,'r-')
    plt.text(0.1, 0.9, 'RMSE = {}'.format(str(round(RMSE, 5))), transform=ax.transAxes, fontsize=15)
    plt.text(0.1, 0.8, 'R^2 = {}'.format(str(round(R2, 5))), transform=ax.transAxes, fontsize=15)


df = pd.read_csv('otokuSearch/Featurevalue/Fettur_evalue.csv', sep='\t', encoding='utf-16')

#kf :A box that specifies the behavior of data partitioning. This time there is 10 divisions and data shuffle.
kf = KFold(n_splits=10, shuffle=True, random_state=1)

#predicted_df :Make empty data frames easier when combining each predicted value from now on
predicted_df = pd.DataFrame({'index':0, 'pred':0}, index=[1])

#The parameters have not been adjusted
lgbm_params = {
        'objective': 'regression',
        'metric': 'rmse',
        'num_leaves':80
}

#Cross-validation is performed in 4 divisions, so the loop is repeated 10 times.
#Give kf an index and ask them to determine the training data index and the evaluation data index.
#df,Train the index number of the training data and the index number of the evaluation data used for the first time from the index_index, val_Output to index
for train_index, val_index in kf.split(df.index):

    #Divide into training data, evaluation data & explanatory variables, and objective variables using training data index and evaluation data index
    X_train = df.drop(['real_rent','name'], axis=1).iloc[train_index]
    y_train = df['real_rent'].iloc[train_index]
    X_test = df.drop(['real_rent','name'], axis=1).iloc[val_index]
    y_test = df['real_rent'].iloc[val_index]

    #Process into a dataset for speeding up LightGBM
    lgb_train = lgb.Dataset(X_train, y_train)
    lgb_eval = lgb.Dataset(X_test, y_test)

    #LightGBM model building
    gbm = lgb.train(lgbm_params,
                lgb_train,
                valid_sets=(lgb_train, lgb_eval),
                num_boost_round=10000,
                early_stopping_rounds=100,
                verbose_eval=50)

    #Put the explanatory variables for evaluation in the model and output the predicted value
    predicted = gbm.predict(X_test)

    #temp_df :Combine the predicted value with the original index to match the predicted value with the correct answer value
    temp_df = pd.DataFrame({'index':X_test.index, 'pred':predicted})

    #predicted_df :Temp to empty dataframe_Combine df → In the loop after the second lap, predicted_Temp to df (content ants)_df join
    predicted_df = pd.concat([predicted_df, temp_df], axis=0)

predicted_df = predicted_df.sort_values('index').reset_index(drop=True).drop(index=[0]).set_index('index')
predicted_df = pd.concat([predicted_df, df['real_rent']], axis=1).rename(columns={'real_rent' : 'true'})

True_Pred_map(predicted_df)

print(r2_score(y_test, predicted)  )
lgb.plot_importance(gbm, figsize=(12, 6))
plt.show()

#Save model
with open('otokuSearch/model/model.pickle', mode='wb') as fp:
    pickle.dump(gbm, fp)

As mentioned in the original article, the training data and the data you want to predict are almost the same. I'm in a "cheat" state

So I implemented cross-validation. I borrowed the code here. It is explained in a very easy-to-understand manner. https://rin-effort.com/2019/12/31/machine-learning-8/

To briefly explain cross-validation

Divide the training data (10 this time)
Learning using one of them as validation and the rest as training data
Evaluate with validation data
Repeat learning → evaluation by changing validation data by the amount divided
Evaluate the model by averaging those scores

I tried to save the model I created later. It's a hassle to learn each time.

When executed in the same way as the original code, a predicted valid value map and a graph of important features are output. It's quite a correlation! The newly acquired building type does not contribute much.

Profitable property data creation

`Create_Otoku_data.py`


import pandas as pd
import numpy as np
import lightgbm as lgb
import pickle

#Data reading
df = pd.read_csv('otokuSearch/Featurevalue/Fettur_evalue.csv', sep='\t', encoding='utf-16')

#Loading trained model
with open('otokuSearch/model/model.pickle', mode='rb') as fp:
    gbm = pickle.load(fp)


#Creating profitable property data
y = df["real_rent"]
X = df.drop(['real_rent',"name"], axis=1)
pred = list(gbm.predict(X, num_iteration=gbm.best_iteration))
pred = pd.Series(pred, name="Predicted value")
diff = pd.Series(df["real_rent"]-pred,name="Difference from the predicted value")
df_for_search = pd.read_csv('otokuSearch/Preprocessing/df_for_search.csv', sep='\t', encoding='utf-16')
df_for_search['Rent+Management fee'] = df_for_search['Rent'] + df_for_search['Management fee']
df_search = pd.concat([df_for_search,diff,pred], axis=1)
df_search = df_search.sort_values("Difference from the predicted value")
df_search = df_search[["Apartment name",'Rent+Management fee', 'Predicted value',  'Predicted valueとの差', 'Detailed URL', 'Floor plan', 'Occupied area', 'hierarchy', 'Station 1', '1 on foot', 'Floor planDK', 'Floor planK', 'Floor planL']]
df_search.to_csv('otokuSearch/Otoku_data/otoku.csv', sep = '\t',encoding='utf-16')

No major changes We just read the training model and added columns for the output data.

Click here for the first place that shines from the output file! 2 minutes walk from the station! 3LDK! With this, the rent of 82,000 yen is wonderful! However, the route is a little local. .. ..

result

It is now possible to create reasonable property data. However, there is no EDA (Exploratory Data Analysis) at all, so more analysis is needed to improve accuracy. After that, it would be nice to increase the data that is scraped. For example, there are various types such as city gas and air conditioners. There is no end to other things like tuning hyperparameters. Well, I was able to do what I wanted to do, so Yoshi!

After managing various codes, I finally understood the necessity of git. There are various harvests such as using Atom and learning how to use git, I realized that programming is the best way to learn in practice.

from now on

I would like to somehow bring the model created this time using Django into the form of a system. When you enter the URL of the page, it seems that the rent is predicted and the profit is given. I will write an article if I can try it.

I was frustrated by Kaggle, so I tried to find a good rental property by scraping & machine learning