Nice to meet you. My name is S.I., a third-year university student belonging to the Department of Computer Science. My experience with Python is a bit of a university experiment.
The data science division of Cacco Inc., where I am an intern, has the task of creating crawlers during the trial period to collect, process, and visualize data, and to briefly discuss what I have learned.
My college friend is going to live alone. However, when I look at the real estate website, there are too many properties to choose from. Please solve it by data analysis.
Within 60 minutes of commuting time from JR Kanamachi Station
When searching for a property from your own property search experience, the "information you want to know" is I thought that it was a "condition for finding a property" to tell the real estate agent to mediate, and decided to solve it by data analysis.
This time, we will use Smighty's "Commuting / School Time Search" to crawl, and the search results will be within 60 minutes to Kanamachi Station.
The crawling code looks like this:
crawling.py
import requests
from bs4 import BeautifulSoup
import time
import os
import datetime
def crawling():
#Path of directory for saving html files
dirname = './html_files'
if not os.path.exists(dirname):
#Create directory if it does not exist
os.mkdir(dirname)
#Convert the first page to html
url = "https://sumaity.com/chintai/commute_list/list.php?search_type=c&text_from_stname%5B%5D=%E9%87%91%E7%94%BA&cost_time%5B%5D=60&price_low=&price_high="
response = requests.get(url)
time.sleep(1)
#Save to file
page_count = 1 #Page count
with open('./html_files/page{}.html'.format(page_count), 'w', encoding='utf-8') as file:
file.write(response.text)
#Total number of properties(Theoretical value)Acquisition (as an acceptance condition)
soup = BeautifulSoup(response.content, "lxml")
num_bukken = int(soup.find(class_='searchResultHit').contents[1].text.replace(',', ''))
print("Total number of properties within 60 minutes of commuting time:", num_bukken)
#Save the total number of properties in a text file as it will be used to check the acceptance conditions when scraping.
path = './data.txt'
with open(path, mode='w') as f:
f.write("{}\n".format(num_bukken))
#Crawling on the second and subsequent pages, continue until the next page runs out
while True:
page_count += 1
#Find the next url
next_url = soup.find("li", class_="next")
#Break and finish when the next page runs out
if next_url == None:
print("Total number of pages:", page_count-1)
with open(path, mode='a') as f:
f.write("{}\n".format(page_count-1))
break
#Get the next page url and save it as an html file
url = next_url.a.get('href')
response = requests.get(url)
time.sleep(1)
with open('./html_files/page{}.html'.format(page_count), 'w', encoding='utf-8') as file:
file.write(response.text)
#Prepare for analysis to get the url of the next page
soup = BeautifulSoup(response.content, "lxml")
#Crawling progress output
if page_count % 10 == 0:
print(page_count, 'Get page')
#Main function
if __name__ == "__main__":
date_now = datetime.datetime.now()
print("Start crawling:", date_now)
crawling()
date_now = datetime.datetime.now()
print("Finished crawling:", date_now)
The following are the variables scraped this time.
The scraping code looks like this:
scraping.py
from bs4 import BeautifulSoup
import datetime
import csv
import re
#Regular expression for dividing an address into a prefecture and a city
pat = '(...??[Prefectures])((?:Asahikawa|Date|Ishikari|Morioka|Oshu|Tamura|Minamisoma|Nasushiobara|Higashimurayama|Musashimurayama|Hamura|Tokamachi|Joetsu|Toyama|Nonoichi|Omachi|Gamagori|Yokkaichi|Himeji|Yamatokoriyama|Hatsukaichi|under>Pine|Iwakuni|Tagawa|Omura|Miyako|Furano|Beppu|Saiki|Kurobe|Komoro|Shiojiri|Tamano|Shunan)city|(?:余city|高city|[^city]{2,3}?)county(?:Tamamura|Omachi|.{1,5}?)[Towns and villages]|(?:.{1,4}city)?[^town]{1,4}?Ward|.{1,7}?[cityTowns and villages])(.+)'
def scraping(total_page, room_num):
#Initialization of the number of properties
room_count = 0
#Preparation of csv file (add header)
with open('room_data.csv', 'w', newline='', encoding='CP932') as file:
header = ['No', 'building_name', 'category', 'prefecture', 'city', 'station_num', 'station', 'method', 'time', 'age', 'total_stairs', 'stairs', 'layout', 'room_num', 'space', 'south', 'corner', 'rent', 'unit_price', 'url']
writer = csv.DictWriter(file, fieldnames=header)
writer.writeheader()
for page_num in range(total_page):
#Scraping progress output
if page_num % 10 == 0:
print(page_num , '/', total_page)
#Open the html file to be scraped with Beautiful Soup
with open('./html_files/page{}.html'.format(page_num + 1), 'r', encoding='utf-8') as file:
page = file.read()
soup = BeautifulSoup(page, "lxml")
#Get information for each building
building_list = soup.find_all("div", class_="building")
for building in building_list:
#Building category: Condominium or apartment or detached house
buildingCategory = building.find(class_="buildingCategory").getText()
#Building name
buildingName = building.find(class_="buildingName").h3.getText().replace("{}".format(buildingCategory), "").replace("New arrival", "")
#Extraction of candidates for the nearest station and the distance from the station
traffic = building.find("ul", class_="traffic").find_all("li")
#Number of nearest stations
station_num = len(traffic)
#Extract those with short walking time
min_time = 1000000 #Initialize the minimum required time
for j in range(station_num):
traffic[j] = traffic[j].text
figures = re.findall(r'\d+', traffic[j])
time = 0
for figure in figures:
#Calculation of required time
time += int(figure)
#Store minimum time required and index if minimum
if time < min_time:
min_time = time
index = j
#If you have station or route information
if len(traffic[index].split(' ')) > 1:
#Route decision
line = traffic[index].split(' ')[0]
#Determining the nearest station
station = traffic[index].split(' ')[1].split('station')[0]
#Obtaining transportation (bus, car, walking) to the station
if len(traffic[index].split(' ')) > 2:
if "bus" in traffic[index].split(' ')[1]:
method = "bus"
elif "car" in traffic[index].split(' ')[2]:
method = "car"
else:
method = "walk"
#No transportation information to the station
else:
method = None
#If there is no station or route information
else:
station = None
line = None
method = None
time = None
#Street address
address = building.find(class_="address").getText().replace('\n','')
address = re.split(pat, address)
if len(address) < 3:
prefecture = "Tokyo"
city = "Adachi Ward"
else:
prefecture = address[1]
city = address[2]
#Details of the building (age, structure, total number of floors)
building_detail = building.find(class_="detailData").find_all("td")
for j in range(len(building_detail)):
building_detail[j] = building_detail[j].text
# ----Get only the number of age----
#Age unknown
if 'Unknown construction' == building_detail[0]:
building_detail[0] = None
#0 years old
elif 'Less than' in building_detail[0]:
building_detail[0] = 0
#Normal value
else:
building_detail[0] = int(re.findall(r'\d+', building_detail[0])[0])
#Get only the total number of floors
building_detail[2] = int(re.findall(r'\d+', building_detail[2])[0])
# ----Get room details----
rooms = building.find(class_="detail").find_all("tr",
{'class': ['estate applicable', 'estate applicable gray']})
for j in range(len(rooms)):
#Counting the number of properties
room_count += 1
# ----Number of floors----
stairs = rooms[j].find("td", class_="roomNumber").text
#Get only numbers (delete "floor", process missing values)
if "-" == stairs:
stairs = None
else:
stairs = int(re.findall(r'\d+', stairs)[0])
#Make the rent an integer type
price = rooms[j].find(class_="roomPrice").find_all("p")[0].text
price = round(10000 * float(price.split('Ten thousand')[0]))
#Management fee
kanri_price = rooms[j].find(class_="roomPrice").find_all("p")[1].text
#Unification of notation (deletion of 10,000 yen notation, "-”And“ 0 yen ”missing value processing)
if "-" in kanri_price or "0 Yen" == kanri_price:
kanri_price = 0
else:
kanri_price = int(kanri_price.split('Circle')[0].replace(',',''))
#Room type (floor plan)
room_type = rooms[j].find(class_="type").find_all("p")[0].text
if room_type == "Studio":
room_type = "1R"
#number of rooms
num_of_rooms = int(re.findall(r'\d+', room_type)[0])
#Room area, deletion of unit "m2"
room_area = rooms[j].find(class_="type").find_all("p")[1].text
room_area = float(room_area.split('m')[0])
#South facing corner room
special = rooms[j].find_all("span", class_="specialLabel")
south = 0
corner = 0
for label in range(len(special)):
if "South facing" in special[label].text:
south = 1
if "Corner room" in special[label].text:
corner = 1
#Get detailed url
room_url = rooms[j].find("td", class_="btn").a.get('href')
#rent=Rent+Ask for management fee
rent = price + kanri_price
# 1m^Find the rent (unit price) for each 2
unit_price = rent / room_area
#Output to csv file: encoding default"utf-8", If you handle Japanese on windows"cp932"
with open('room_data.csv', 'a', newline='', encoding='CP932') as file:
writer = csv.DictWriter(file, fieldnames=header)
writer.writerow(
{'No':room_count, 'building_name':buildingName, 'category':buildingCategory, 'prefecture':prefecture, 'city':city, 'station_num':station_num, 'station':station,
'method':method, 'time':min_time, 'age':building_detail[0], 'total_stairs':building_detail[2], 'stairs':stairs,
'layout':room_type, 'room_num':num_of_rooms, 'space':room_area, 'south':south, 'corner':corner, 'rent':rent, 'unit_price':unit_price, 'url':room_url})
print("{}We have acquired the property data.".format(room_count))
#Confirmation of acceptance conditions
if room_count == room_num:
print("Clear acceptance conditions")
else:
print("{}There are differences. The acceptance conditions have not been cleared.".format(abs(room_count-room_num)))
if __name__ == "__main__":
date_now = datetime.datetime.now()
print("Start scraping:", date_now)
#Pass the total number of pages and the number of properties to the scraping function (acceptance condition)
path = './data.txt'
with open(path) as f:
data = f.readlines()
scraping(int(data[1].replace("\n","")), int(data[0].replace("\n","")))
date_now = datetime.datetime.now()
print("Finished scraping:", date_now)
First of all, I checked the histogram of how the rent is distributed, and removed the property whose rent was too high because it was considered unsuitable for living alone.
From here, let's see how each variable affects rent.
Let's look at the number of properties and the distribution of rent for each floor plan.
A bar graph of the number of properties for each floor plan shows that the floor plans from 1R to 3LDK account for 98% of the total. If you look at the distribution of rent for those floor plans on a violin plot, you can see that the distribution of rent differs for each floor plan. Therefore, the floor plan is likely to be a variable that affects rent.
Let's see where there are many properties.
By prefecture, most of them were in Tokyo and Chiba, and Saitama was about 3%. Looking at each city in more detail, Adachi-ku, Katsushika-ku, Matsudo-shi, Kashiwa-shi, Arakawa-ku has more than 1000 properties, which seems to be a good place to look for properties. Let's look at the distribution of rent in each of these districts.
Looking at the rent histogram by prefecture, we can see that although there are many properties in Tokyo, there are many properties with high rent, and there are many properties in Chiba that are cheaper. If you take a closer look at the rent box plot for each city, you can see that the green box in the Chiba area is located at the bottom. It seems that you can find cheap properties in Matsudo, Kashiwa, Nagareyama, Ichikawa, Abiko, Yoshikawa, and Soka. Looking at the boxplot, you can see that the distribution of rent differs depending on the district, so it seems that where the property is located also affects the rent.
There is a weak negative correlation between travel time and rent, and it seems that the longer the travel time, the cheaper the rent. We also illustrated the difference in rent depending on the means of transportation such as the bus or walking used at that time. It can be seen that the rent for walking in blue is higher than that for buses. Therefore, both transportation and travel time are likely to affect rent.
We grouped the ages every 5 years and put out a box plot of rent.
You can see that the rent is gradually getting cheaper from the property after 15 years old. Therefore, age is also likely to be a variable that affects rent.
Let's look at the distribution of rent by the total number of floors of the building and the histogram of the total number of floors.
Looking at the total number of floors and the distribution of rent, it seems that the rent of properties up to 2 stories is cheap. Looking at the histogram of the total number of floors, most of the two-story properties were apartments. Besides, 95% of the properties are within 10 stories, so I think that you may have a longing for a high-rise property for the first time living alone, but it seems difficult when it comes to a property for living alone. From the above results, it was found that building information also affects rent.
I will see if the southward orientation, which is a characteristic of the property, affects the rent. I've created a histogram of properties facing south and those that don't.
Looking at the histogram, the distributions are similar, so we tested whether the difference in rent was significant. There was a significant difference in the average rent for properties facing south and those not facing south. At this time, the homoscedasticity test was performed by the F-test, and the homoscedasticity was not rejected, so the t-test assuming homoscedasticity was performed. As a result, the south facing property is about 1500 yen cheaper. From the above, it was found that whether or not it faces south affects the rent.
Similarly, we will look at the impact on corner rooms.
Since the homoscedasticity test was performed by the F-test and the homoscedasticity was rejected, the t-test was performed assuming that the variances are not equal. As a result, it turned out that the difference was significant, and the price for the corner room was about 2000 yen higher. From the above, it was found that the corner room also affects the rent.
Based on the results so far, I would like to analyze again and specifically determine the conditions of the property that I recommend to my university friends who are actually in trouble.
The relationship between the number of properties and the average rent is plotted for each city.
You can see that Matsudo City has many properties and the average rent is low.
The number of properties and average rent for each floor plan are as follows.
The average rent for 1R, 2K and 3K is cheaper, but the number of 1K properties is overwhelming. If you live alone, you don't need that much space, so I think a 1K floor plan is good. The rent market price for 1K property in Matsudo was 56000 yen using the median.
A box plot of changes in rent distribution depending on the age of 1K properties in Matsudo City.
You can see that there are many properties below the market price if they are 15 years old or older. I think it's a good idea to look for it in about 15 years.
I made a bar graph by color-coding according to the means of transportation, showing how many properties are required from the station.
It can be seen that 95% of the properties for living alone are within a 20-minute walk. We plotted the number of properties and the average rent for each time required from the station. You can see that the rent will be cheaper if you extend it within 15 minutes.
Based on the above, the conditions for finding a property to tell a friend are as follows.
If you search under the above conditions, it will be a ** apartment ** type room. Under these conditions, I think you can find a good room by going to a real estate agent in Matsudo.
The point of reflection is that the data was analyzed without a policy being decided. It's okay if you do it as a hobby, but if you do it, you will end up with a diagram that you do not know where to use it, which is a huge waste of time, so you have to analyze the data with a purpose.
I feel that this issue was a valuable experience because I was able to give feedback by creating materials and making presentations, not just analysis. I noticed that it took so long to collect and analyze the data, and it was difficult to convey the results to the other party, so I would like to make use of it in the future.
that's all.