Nice to meet you, I am about to move next year. "I'm not good at trains, so it's good to be within walking distance from my commute." With that in mind, I was looking for a property, but ... ** There is no function to display the distance from the specified destination to the property! ** ** SUUMO has a function to search from a map, but it is difficult to use because I can not see the details of the property in the list. .. So, I would like to do it with the spirit that if there is none, I should make it. This is the first time for me to be a complete beginner who has touched scraping, so please watch it warmly. The code is horribly dirty, but forgive me> < You should do this here! We look forward to hearing from you!
Get the rent (including management fee and common service fee) of the property within x minutes on foot from the specified address, a few minutes on foot, the property name, rent, number of floors, floor plan, occupied area, rental site of the property, URL of google map To do. Also, if you do not make a phase estimate, you may be ducked, so get information on the same property from multiple rental sites.
Get the rent (including management fee and common service fee) of the property within x minutes on foot from the specified address, a few minutes on foot, the property name, rent, number of floors, floor plan, occupied area, rental site of the property, URL of google map To do.
Please note that there is not much explanation of the code. I will summarize it soon.
Windows 10 version 20H2 Python 3.7.4 jupyter notebook
pip install beautifulsoup4
pip install selenium
pip install openpyxl
pip install xlwt
You will need it to scrape with chrome. You can find it by searching for "Chrome driver download". You need to download a driver that matches your version of chrome.
This time, the rental site will use SUUMO. In the future, we plan to scrape including other sites. The travel time is calculated using google map.
Work location: Tokyo Sky Tree Where you want to live: Sumida Ward Working hours: within 15 minutes 1K Rent: 80,000 yen or less
I will write the code on the assumption that I will search for the property of.
First, import the modules you need.
import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
Next, define your destination and how many minutes you can walk.
#Destination Tokyo Sky Tree
DESTINATION = '1-1-2 Oshiage, Sumida-ku, Tokyo'
#How many minutes to allow on foot
DURATION = 15
Then access the SUUMO site.
#SUUMO scraping
suumo_br = webdriver.Chrome('C:\\Users\\hohgehoge\\chromedriver') #For Windows, pass the path to chromedriver
# suumo_br = webdriver.Chrome() #For Mac
suumo_br.implicitly_wait(3)
#URL of suumo property search results
url_suumo = "https://suumo.jp/jj/chintai/ichiran/FR301FC001/?ar=030&bs=040&ta=13&sc=13107&cb=0.0&ct=8.0&co=1&et=9999999&md=02&cn=9999999&mb=0&mt=9999999&shkr1=03&shkr2=03&shkr3=03&shkr4=03&fw2="
suumo_br.get(url_suumo)
time.sleep(5)
print('I visited SUUMO')
For the variable url_suumo, enter the URL after narrowing down according to your favorite conditions. In this article, we will narrow down by 1K rent 80,000 or less properties in Sumida-ku, Tokyo.
The suumo site is now open. Next, parse the html of this site to get the address list of the property.
soup = BeautifulSoup(suumo_br.page_source, 'html.parser')
#List of property addresses
addresses = [c.get_text() for c in soup.find_all('li', class_='cassetteitem_detail-col1')]
print(addresses)
Output result
['Ryogoku 2 in Sumida-ku, Tokyo', 'Midori 4 Sumida-ku, Tokyo', '2 Higashimukojima, Sumida-ku, Tokyo', '2 Higashimukojima, Sumida-ku, Tokyo', '5 Yahiro, Sumida-ku, Tokyo', 'Midori 4 Sumida-ku, Tokyo', '3 Chitose, Sumida-ku, Tokyo', '1 Kyojima, Sumida-ku, Tokyo', '6 Higashimukojima, Sumida-ku, Tokyo', '6 Higashimukojima, Sumida-ku, Tokyo', '4 Tachibana, Sumida-ku, Tokyo', '5 Higashimukojima, Sumida-ku, Tokyo', '6 Yahiro, Sumida-ku, Tokyo', '4 Tatekawa, Sumida-ku, Tokyo', 'Ryogoku 2 in Sumida-ku, Tokyo', '2 Yahiro, Sumida-ku, Tokyo', '2 Sumida, Sumida-ku, Tokyo', '4 Tachibana, Sumida-ku, Tokyo', '4 Tachibana, Sumida-ku, Tokyo', '1 Tachibana, Sumida-ku, Tokyo', '1 Tachibana, Sumida-ku, Tokyo', '5 Mukojima, Sumida-ku, Tokyo', '1 Kikukawa, Sumida-ku, Tokyo', '6 Higashimukojima, Sumida-ku, Tokyo', '5 Yahiro, Sumida-ku, Tokyo', '1 Higashimukojima, Sumida-ku, Tokyo', '1 Higashimukojima, Sumida-ku, Tokyo', '2 Bunka, Sumida-ku, Tokyo', '5 Mukojima, Sumida-ku, Tokyo', '5 Yahiro, Sumida-ku, Tokyo']
Then get the name of the property. In the case of SUUMO, the station name may be entered in the place of the property name, but this time it is through.
#Confirmation of the number of properties
properties = soup.find_all('table', class_='cassetteitem_other')
#Get the building name
buildings = [c.get_text() for c in soup.find_all('div', class_='cassetteitem_content-title')]
print(buildings)
Output result
['Exclusive ID Ryogoku', 'Soara Plaza Kinshicho', 'Tokyo Mito Street', 'Tobu Isesaki Line Hikifune Station 11 stories 16 years old', 'Dolce Forest', 'Katsu Palace', 'Higuchi Heights', 'Keisei Oshiage Line Keisei Hikifune Station 5 stories 30 years old', 'Graceful Place', 'Keisei Oshiage Line Yahiro Station 2 stories 8 years old', 'Lyric Court Hiraibashi', 'Crayno Bonnur II', 'Prosperity Sky Tree', 'Like Kikukawa East', 'JR Sobu Line Ryogoku Station 7 stories 12 years old', 'Rigale Sumida Levante', 'Tobu Isesaki Line Kanegafuchi Station 3 stories new construction', 'Tobu Kamedo Line Higashi Azuma Station 3 stories 13 years old', 'Rilassante Tachibana', 'Stall house', 'Tobu-Kameido Line Omurai Station 4 stories 2 years old', 'Live City Mukojima', 'Bonnard', 'Mallage Nine', 'Beakasa Hikifune', 'Belfort', 'Tobu Isesaki Line Hikifune Station 3 stories 3 years old', 'El Viento Earth Sumida Azuma', 'Tobu Isesaki Line Hikifune Station 3 stories 6 years old', 'Keisei Oshiage Line Yahiro Station 3 stories 15 years old']
Since the xpath differs depending on whether one property handles multiple rentals or one property handles a single rental, in order to deal with this, the number of rentals handled by one property is changed. Get it.
#Count the number of properties on the site
properties_num_list = []
for prop in properties:
prop = str(prop)
properties_num_list.append(prop.count('<tbody>'))
print(properties_num_list)
# [1, 12, 8, 8, 1, 2, 3, 1, 3, 3, 1, 1, 5, 4, 1, 4, 1, 1, 1, 1, 1, 5, 1, 1, 1, 2, 2, 3, 2, 1]
Go to google map.
browser = webdriver.Chrome('C:\\Users\\hogehoge\\chromedriver')
# browser = webdriver.Chrome() #For Mac
browser.implicitly_wait(3)
#googlemap url
url_map = "https://www.google.co.jp/maps/dir///@35.7130112,139.8029662,14.95z?hl=ja"
browser.get(url_map)
time.sleep(3)
print('I visited Google Map')
Since googlemap gives candidates for multiple travel times, define a function to get the shortest path.
#Function to check the shortest path time of Google Map
def shortest_path(travel_times):
for travel_time in travel_times:
min_travel_time = DURATION
travel_time = int(travel_time.replace('Minutes', ''))
if travel_time < min_travel_time:
min_travel_time = travel_time
# if min_travel_timee == DURATION:
# continue
return min_travel_time
Get the travel time to Skytree from the address list you got earlier. The browser is operated automatically and the movement time is acquired.
element = browser.find_element_by_xpath('/html/body/jsl/div[3]/div[9]/div[3]/div[1]/div[2]/div/div[3]/div[1]/div[1]/div[2]/div/div/input')
element.clear()
element.send_keys(DESTINATION)
#Calculate the distance from the destination
min_travel_times = []
map_url = []
for i, address in enumerate(addresses):
element = browser.find_element_by_xpath('/html/body/jsl/div[3]/div[9]/div[3]/div[1]/div[2]/div/div[3]/div[1]/div[2]/div[2]/div/div/input')
element.clear()
element.send_keys(address)
search_button = browser.find_element_by_xpath('/html/body/jsl/div[3]/div[9]/div[3]/div[1]/div[2]/div/div[3]/div[1]/div[2]/div[2]/button[1]')
search_button.click()
time.sleep(3)
soup = BeautifulSoup(browser.page_source, 'html.parser')
#List of distances to destination
travel_times = [c.get_text() for c in soup.find_all('div', class_='section-directions-trip-duration')]
#Output the shortest distance to the target
min_travel_times.append(shortest_path(travel_times))
#Save google map url
map_url.append(browser.current_url)
Set the path to get the required property information.
#Property url
#When there are multiple rentals for one building
path_1 = '//*[@id="js-bukkenList"]/ul['
path_2 = ']/li['
path_3 = ']/div/div[2]/table/tbody['
path_4 = ']/tr/td[9]/a'
#When there is only one property for one rental
path_mono_1 = '//*[@id="js-bukkenList"]/ul['
path_mono_2 = ']/li['
path_mono_3 = ']/div/div[2]/table/tbody/tr/td[9]/a'
#Number of floors
#When there are multiple rentals for one building
path_floor = ']/tr/td[3]'
#When there is only one property for one rental
path_mono_floor = ']/div/div[2]/table/tbody[1]/tr/td[3]'
#Rent
#When there are multiple rentals for one building
path_rent = ']/tr/td[4]/ul/li[1]/span/span'
#When there is only one property for one rental
path_mono_rent = ']/div/div[2]/table/tbody/tr/td[4]/ul/li[1]/span/span'
#Management fee
#When there are multiple rentals for one building
path_fee = ']/tr/td[4]/ul/li[2]/span'
#When there is only one property for one rental
path_mono_fee = ']/div/div[2]/table/tbody[1]/tr/td[4]/ul/li[2]/span'
#Floor plan
#When there are multiple rentals for one building
path_plan = ']/tr/td[6]/ul/li[1]/span'
#When there is only one property for one rental
path_mono_plan = ']/div/div[2]/table/tbody[1]/tr/td[6]/ul/li[1]/span'
#Occupied area
#When there are multiple rentals for one building
path_area = ']/tr/td[6]/ul/li[2]/span'
#When there is only one property for one rental
path_mono_area = ']/div/div[2]/table/tbody[1]/tr/td[6]/ul/li[2]/span'
Write a function that adds up the rent and management costs.
#A function that adds up rent and management costs
def calc_rent(rent, fee):
str_rent = rent.replace('Ten thousand yen', '')
float_rent = float(str_rent) * 10000
str_fee = fee.replace('Circle', '')
float_fee = float(str_fee)
return float_rent + float_fee
Next, get the information you want based on the path and make it into a data frame. (I'm sorry the code is dirty> <)
df = pd.DataFrame(columns=['Building name', 'Commuting time', 'rent', 'Number of floors', 'Floor plan', 'Occupied area', 'map', 'url'])
i, j = 1, 1
for prop_info in zip(min_travel_times, properties_num_list, buildings, map_url):
if prop_info[0] > DURATION:
#Continue if longer than allowed walking time
print('Out of Duration')
j += 1
if j % 6 == 0:
i += 1
j = 1
continue
if prop_info[1] == 1:
# url
path = path_mono_1 + str(i) + path_mono_2 + str(j) + path_mono_3
prop_url = suumo_br.find_element_by_xpath(path).get_attribute('href')
#Number of floors
path = path_mono_1 + str(i) + path_mono_2 + str(j) + path_mono_floor
prop_floor = suumo_br.find_element_by_xpath(path).text
#rent
path = path_mono_1 + str(i) + path_mono_2 + str(j) + path_mono_rent
temp_rent = suumo_br.find_element_by_xpath(path).text
#Management fee
path = path_mono_1 + str(i) + path_mono_2 + str(j) + path_mono_fee
temp_fee = suumo_br.find_element_by_xpath(path).text
#Floor plan
path = path_mono_1 + str(i) + path_mono_2 + str(j) + path_mono_plan
prop_plan = suumo_br.find_element_by_xpath(path).text
#Occupied area
path = path_mono_1 + str(i) + path_mono_2 + str(j) + path_mono_area
prop_area = suumo_br.find_element_by_xpath(path).text
prop_rent = calc_rent(temp_rent, temp_fee)
print(prop_url)
df = df.append({'Building name': prop_info[2], 'Commuting time': prop_info[0], 'rent': prop_rent, 'Number of floors': prop_floor, 'Floor plan': prop_plan, 'Occupied area': prop_area, 'map': prop_info[3], 'url': prop_url}, ignore_index=True)
else:
for k in range(1, prop_info[1] + 1):
path = path_1 + str(i) + path_2 + str(j) + path_3 + str(k) + path_4
prop_url = suumo_br.find_element_by_xpath(path).get_attribute('href')
#Number of floors
path = path_1 + str(i) + path_2 + str(j) + path_3 + str(k) + path_floor
prop_floor = suumo_br.find_element_by_xpath(path).text
#rent
path = path_1 + str(i) + path_2 + str(j) + path_3 + str(k) + path_rent
temp_rent = suumo_br.find_element_by_xpath(path).text
#Management fee
path = path_1 + str(i) + path_2 + str(j) + path_3 + str(k) + path_fee
temp_fee = suumo_br.find_element_by_xpath(path).text
#Floor plan
path = path_1 + str(i) + path_2 + str(j) + path_3 + str(k) + path_plan
prop_plan = suumo_br.find_element_by_xpath(path).text
#Occupied area
path = path_1 + str(i) + path_2 + str(j) + path_3 + str(k) + path_area
prop_area = suumo_br.find_element_by_xpath(path).text
prop_rent = calc_rent(temp_rent, temp_fee)
print(prop_url)
df = df.append({'Building name': prop_info[2], 'Commuting time': prop_info[0], 'rent': prop_rent, 'Number of floors': prop_floor, 'Floor plan': prop_plan, 'Occupied area': prop_area, 'map': prop_info[3], 'url': prop_url}, ignore_index=True)
j += 1
if j % 6 == 0:
i += 1
j = 1
I will check the contents.
df.head()
Finally, make an Excel file and finish.
df.to_excel('sample.xlsx', encoding='utf_8_sig', index=False)
Currently, only one page is loaded, so we will improve it so that it loads all pages. We will also scrape another rental site. I would appreciate it if you could let me know if there are other functions or property information that may be useful.