Result excerpt

Simultaneous posting on the blog: https://leoluistudio.com/blog/24/python%e3%82%92%e4%bd%bf%e3%81%a3%e3%81%a6%e6%9d%b1%e4%ba%ac%e9%83%bd%e5%ae%b6%e8%b3%83%e3%81%ab%e3%81%a4%e3%81%84%e3%81%a6%e3%81%ae%e7%a0%94%e7%a9%b6-4%e3%81%ae2/

Statistical method 1

Data collection

--Data source: https://suumo.jp --Library used: requests, BeautifulSoup, re, time, sqlite3 --The code is omitted because the collection part is just a request and put it in the database.

Database

Database structure

info
id	1 (PRIMARY KEY)
municipal	Chiyoda Ward
train	7 (Walk to the station)
type	apartment
date	1983
structure	Steel rebar
floor	8
carpark	0 (0 is none, 1 is yes)

price
id	1 (PRIMARY KEY)
pid	160000 (Circle)
1 (=info id)	7 (Walk to the station)
area	42.9 (Square meter)
date	east

First 5th row for each table

Programs and results

First install the necessary libraries

import sqlite3
import pandas as pd

import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import plot

from sklearn.decomposition import PCA

Connect the database

#If the result is output only
conn = sqlite3.connect(‘info.db’)
c = conn.cursor()
cursor = c.execute(“SQL code”)
for row in cursor:
    print(result)
conn.close()

#For processing by saving the result
conn = sqlite3.connect(‘info.db’)
df = pd.read_sql_query(“SQL code”, conn)
conn.close()

First look at the distribution of the data

#number of info lines
SELECT COUNT(id) FROM info
#Result: 82,812

#Number of lines in price
SELECT COUNT(id) FROM price
#Result: 624,499

#Number of "apartments" and "apartments"
SELECT type,COUNT(type) FROM info GROUP BY type
#Result: Apartment 33,110 /Mansion 49,702

Distribution of completion dates

# SQL
SELECT date AS year,COUNT(date) AS count FROM info WHERE date > 0 GROUP BY date

#Graph
fig = px.bar(df, x=’year’, y=’count’, height=500, width=1000)

Distribution of building structure

# SQL
SELECT structure,COUNT(structure) AS count FROM info WHERE structure != 0 GROUP BY structure

#Graph
fig = px.bar(df, x=structure, y=’count’, height=500, width=1000)

Rank distribution

# SQL
SELECT floor,COUNT(floor) AS count FROM info WHERE floor != 0 GROUP BY floor

#Graph
fig = px.bar(df, x=floor, y=’count’, height=500, width=1000)

In the table above, the distribution on the 20th floor and above cannot be seen at all, so let's take a closer look here.

# SQL
SELECT floor,COUNT(floor) AS count FROM info WHERE floor > 20 GROUP BY floor

#Graph
fig = px.bar(df, x=floor, y=’count’, height=500, width=1000)

You must group the continuous variables (e.g. price) before you analyze them.

def pricegroup(df):
    if df[‘price’] < 30000:
      return ‘<30,000’
    elif df[‘price’] < 60000:
      return ‘30,000-60,000’
    ……
    else:
      return ‘>270,000’

pricegroup_list = [‘<30,000’,
    ‘30,000-60,000’,
    ‘60,000-90,000’,
    ……
    ‘240,000-270,000’,
    ‘>270,000’]

Price distribution

# SQL
SELECT price FROM price

#Dataframe processing df[‘pricegroup’] = df.apply(pricegroup, axis=1)
dfcount = df.groupby([‘pricegroup’]).count()

#Graph
fig = px.bar(dfcount, x=dfcount.index, y=’price’, height=500, width=1000)
fig.update_layout(xaxis={‘categoryorder’:’array’, ‘categoryarray’:pricegroup_list}, yaxis_title=’count’)

The area distribution is the same

def pricegroup(df):
    if df[‘area’] < 5:
      return ‘<5’
    elif df[‘area’] < 10:
      return ‘5-10’
    ……
    else:
      return ‘>45’

pricegroup_list = [‘<5′, ’5-10′, ’10-15′, ’15-20′,
    ’20-25′, ’25-30′, ’30-35′, ’35-40′,
    ’40-45′,’>45′]

# SQL
SELECT area FROM price

#Dataframe processing
df[‘areagroup’] = df.apply(areagroup, axis=1)
dfcount = df.groupby([‘areagroup’]).count()

#Graph
fig = px.bar(dfcount, x=dfcount.index, y=’area’, height=500, width=1000)
fig.update_layout(xaxis={‘categoryorder’:’array’, ‘categoryarray’:areagroup_list}, yaxis_title=’count’)

Directional distribution

# SQL
SELECT direction,COUNT(direction) AS count FROM price WHERE direction != ‘-‘ GROUP BY direction

#Graph
fig = px.bar(df, x=direction, y=’count’, height=500, width=1000)

Before analyzing the wards, cities, towns and villages, divide them into 23 wards and cities.

m23_list = [‘Chiyoda Ward’,'Chuo-ku','Minato-ku','Shinjuku ward',‘Bunkyo Ward’,'Taito',‘Sumida Ward’,
‘Koto Ward’,‘Shinagawa Ward’,'Meguro','Ota-ku','Setagaya','Shibuya Ward','Nakano',
'Suginami','Toshima ward',‘Kita Ward’,'Arakawa',‘Itabashi Ward’,‘Nerima Ward’,'Adachi Ward',
'Katsushika',‘Edogawa Ward’]

municipal_dict = {}
conn = sqlite3.connect(‘info.db’)
c = conn.cursor()
cursor = c.execute(“SELECT id,municipal FROM info”)
for row in cursor:
    municipal_dict.update({row[0]:row[1]})
conn.close()

def municipal(df):
    return municipal_dict[df[‘pid’]]

def municipal23(df):
    if df[‘municipal’] in m23_list:
      return ‘Special Wards’
    else:
      return ‘Non Special Wards’

Relationship between price and area (divided into 23 wards and city)

# SQL
SELECT pid,price,area FROM price

#Dataframe processing
df[‘municipal’] = df.apply(municipal, axis=1)
df[‘municipal23’] = df.apply(municipal23, axis=1)
dfmedian = df.groupby([‘pid’, ‘municipal23’])[‘price’, ‘area’].median()
dfmedian_reset = dfmedian.reset_index(level=’municipal23′)

#Graph
fig = px.scatter(dfmedian_reset, x=’area’, y=’price’, color=’municipal23′, labels={‘municipal23’: ‘Special Wards’}, height=500, width=1000)

Study on Tokyo Rent Using Python (3-2)

Result excerpt

Statistical method 1

Data collection

Database

Programs and results