When I wanted to move in the future, I wanted to find out what kind of properties there are and if there are any bargains. Since it is troublesome to check each time by hand, I wanted to use the scraping I did last time to get the property information.
In the end, I would like to map the acquired information to a map, but I will start by acquiring the property information first.
To put it simply, scraping is "using a program to collect information on the Internet." Scraping is performed in the following two steps.
① Get html information → ② Extract necessary data
First of all, regarding ① In the first place, a web page is composed using a language called html. Click the arrow page in the upper right corner of Google Chrome and click If you press "Other tools"-> "Developer tools", a list of codes will be output on the right side of the screen. That's the code for drawing the screen, and I pull this code to my computer for scraping. And regarding (2), html has a nested structure, and it is distinguished for each element or labeled for each element. Therefore, you can get the required data from all the data by selecting the label or tag.
The execution environment is as follows.
For implementation, I referred to Other articles. Other articles I can get the scraping result just by using the one in the article, but it takes a lot of time, so I rewrote it a little.
The whole code is here.
output.py
from bs4 import BeautifulSoup
import requests
import csv
import time
#URL (please enter the URL here)
url = 'https://suumo.jp/jj/chintai/ichiran/FR301FC001/?ar=030&bs=040&ta=13&sc=13101&cb=0.0&ct=9999999&mb=0&mt=9999999&et=9999999&cn=9999999&shkr1=03&shkr2=03&shkr3=03&shkr4=03&sngz=&po1=09&pc=50'
result = requests.get(url)
c = result.content
soup = BeautifulSoup(c, 'html.parser')
summary = soup.find("div",{'id':'js-bukkenList'})
body = soup.find("body")
pages = body.find_all("div",{'class':'pagination pagination_set-nav'})
pages_text = str(pages)
pages_split = pages_text.split('</a></li>\n</ol>')
pages_split0 = pages_split[0]
pages_split1 = pages_split0[-3:]
pages_split2 = pages_split1.replace('>','')
pages_split3 = int(pages_split2)
urls = []
urls.append(url)
print('get all url...')
for i in range(pages_split3-1):
pg = str(i+2)
url_page = url + '&page=' + pg
urls.append(url_page)
print('num all urls is {}'.format(len(urls)))
f = open('output.csv', 'a')
for url in urls:
print('get data of url({})'.format(url))
new_list = []
result = requests.get(url)
c = result.content
soup = BeautifulSoup(c, "html.parser")
summary = soup.find("div",{'id':'js-bukkenList'})
apartments = summary.find_all("div",{'class':'cassetteitem'})
for apart in apartments:
room_number = len(apart.find_all('tbody'))
name = apart.find("div",{'class':'cassetteitem_content-title'}).text
address = apart.find("li", {'class':"cassetteitem_detail-col1"}).text
age_and_height = apart.find('li', class_='cassetteitem_detail-col3')
age = age_and_height('div')[0].text
height = age_and_height('div')[1].text
money = apart.find_all("span", {'class':"cassetteitem_other-emphasis ui-text--bold"})
madori = apart.find_all("span", {'class':"cassetteitem_madori"})
menseki = apart.find_all("span", {'class':"cassetteitem_menseki"})
floor = apart.find_all("td")
for i in range(room_number):
write_list = [name, address, age, height, money[i].text, madori[i].text, menseki[i].text, floor[2+i*9].text.replace('\t', '').replace('\r','').replace('\n', '')]
writer = csv.writer(f)
writer.writerow(write_list)
time.sleep(10)
Of the above code
#URL (please enter the URL here)
url = 'https://suumo.jp/jj/chintai/ichiran/FR301FC001/?ar=030&bs=040&ta=13&sc=13101&cb=0.0&ct=9999999&mb=0&mt=9999999&et=9999999&cn=9999999&shkr1=03&shkr2=03&shkr3=03&shkr4=03&sngz=&po1=09&pc=50'
Enter the url of the property information of suumo in the part of. Then, execute this, and if output.csv is output, it is successful.
The output should look like the following in output.csv.
output.csv(part)
Tokyo Metro Hanzomon Line Jimbocho Station 7 stories 16 years old,2 Kanda Jimbocho, Chiyoda-ku, Tokyo,16 years old,7 stories,6.90,000 yen,Studio,13.04m2,4th floor
Tokyo Metro Hanzomon Line Jimbocho Station 7 stories 16 years old,2 Kanda Jimbocho, Chiyoda-ku, Tokyo,16 years old,7 stories,7.70,000 yen,Studio,16.64m2,4th floor
Kudan Flower Home,4 Kudankita, Chiyoda-ku, Tokyo,42 years old,9 stories,7.50,000 yen,Studio,21.07m2,5th floor
Villa Royal Sanbancho,Sanbancho, Chiyoda-ku, Tokyo,44 years old,8 stories,8.50,000 yen,Studio,23.16m2,4th floor
Villa Royal Sanbancho,Sanbancho, Chiyoda-ku, Tokyo,44 years old,8 stories,8.50,000 yen,Studio,23.16m2,4th floor
The elements are output separated by commas, and the following elements are shown respectively.
[Building name],[Street address],[Age],[hierarchy],[rent],[Floor plan],[Breadth],[Number of floors in the room]
We have confirmed that information on Chiyoda Ward and Setagaya Ward can be retrieved from suumo.
I scraped and got the property information from suumo. It was a lot of fun to do things that have nothing to do with what I usually do. Ultimately, I think it will be more fun if we can map these property information on a map and perform various analyzes, so I will try it.
Recommended Posts