I tried searching on my smartphone to find rice in a strange land, but ...
――It's hard to find a highly rated restaurant from the list of tabelog ――If you use google map, you have to click each search candidate one by one to see the reviews. ――It's hard to tell if the review site is really delicious because it doesn't write negative opinions.
Even if you take out your smartphone and try to find a delicious restaurant in the city, it is difficult to find a restaurant that is highly rated at a glance. *** I want to know the highly rated shops around my current location! *** *** In such a case, the highly rated shop map introduced on this page is convenient.
By scraping, we collect information on the shops introduced in Tabelog's 100 Famous Shops. By importing that information into Google My Maps, we will create a highly rated shop map.
Extracting data from the Web and making it structured data that can be analyzed is called scraping, but there are some points to note.
If the collected data contains copyrighted material, you must consider the copyright. Do not pass the data collected by scraping to others or start a business based on the collected data. On the other hand, it seems that copying for private use is permitted.
In many cases, you will probably send multiple requests to your web server to collect structured data. Be aware that sending a large number of requests in a short period of time can cause the web server to puncture. In the past, there was precedent for being arrested unintentionally.
Please read the terms of use carefully as scraping may be prohibited. The website also has a text file called robots.txt for controlling crawlers. I will omit how to check this file, but before scraping, check the robot.txt of the target website and follow the contents described.
I have briefly explained the above points, but it is highly possible that it is not enough because it is an original compilation. Do your own research to make sure there are no problems before scraping.
The implementation uses python and its library, BeautifulSoup. These are the most major scraping articles and have many reference articles, so I chose them because they are easy to handle even for beginners.
main.py
import requests
from bs4 import BeautifulSoup
from time import sleep
import pandas
url = "https://award.tabelog.com/hyakumeiten/tonkatsu"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
shoplinks = soup.findAll('a', class_='list-shop__link-page')
rowdata = []
for shoplink in shoplinks:
url = shoplink.get("href")
rshop = requests.get(url)
soup = BeautifulSoup(rshop.content, "html.parser")
print("------------------")
print(url)
shopname = soup.find(
"div", class_="rdheader-rstname").find("h2").find("span")
print(shopname)
if shopname is not None:
shopname = shopname.get_text().strip()
address = soup.find("p", class_="rstinfo-table__address")
if address is not None:
address = address.get_text()
print(address)
point = soup.find("span", class_="rdheader-rating__score-val-dtl")
if point is not None:
point = point.get_text().strip()
print(point)
regholiday = soup.find("dd", class_="rdheader-subinfo__closed-text")
if regholiday is not None:
regholiday = regholiday.get_text().strip()[0:10]
print(regholiday)
rowdata.append([shopname, address, point, regholiday, url])
sleep(5)
print(rowdata)
df = pandas.DataFrame(
rowdata, columns=["shopname", "address", "point", "regular holiday", "url"])
df.to_csv("tonkatsu" + ".csv", index=False)
This code corresponds to 100 Famous Stores 2019. When I checked Hyakumeiten 2020, I could not get the data because the structure of the Web page has changed.
To run it, just install python and run this script. By executing this code, you can get the data of the Tonkatsu store selected as one of the 100 famous stores in csv. If you want to get dumpling data, you can change the end of the url to gyoza and the tonkatsu on the last line to gyoza.
All you have to do is import the generated csv into Google My Maps. After importing csv of all genres, 100 famous stores should be plotted as below.
In this way, we were able to extract the store information introduced in Tabelog 100 Famous Stores as structured data and map it by scraping. My Maps can be viewed on Andoroid, so you can instantly find the 100 best stores closest to your current location.
It's a script I made a long time ago, but when I look at it again, it's such a terrible code that being a python beginner is no excuse at all ... Actually, there is a script that allows you to get the genre from the runtime argument, but I stopped because it is a level that hesitates to publish. If the following is resolved, it may be published on github.
--Executable file --Logging --Read url from config file --Repository structure based on HitchHiker's guide to python --Test --Design review
https://docs.pyq.jp/column/crawler.html https://www.cric.or.jp/qa/hajime/hajime8.html https://ja.wikipedia.org/wiki/%E5%B2%A1%E5%B4%8E%E5%B8%82%E7%AB%8B%E4%B8%AD%E5%A4%AE%E5%9B%B3%E6%9B%B8%E9%A4%A8%E4%BA%8B%E4%BB%B6
Recommended Posts