** "Use scraping to find a delicious and powerful restaurant regardless of the tabelog score! 』**
As an index that reflects the voices of users, the score increases as more high evaluations are collected from users who have an influence. For example, if the degree of influence is the same, a store with 100 5-point evaluations will get a higher score than a store with only 2 5-point evaluations.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup
import math
import time
root_url = 'https://tabelog.com/'
res = requests.get(root_url)
soup = BeautifulSoup(res.content, 'html.parser')
a = soup.find_all('a', class_='rsttop-area-search__target js-area-swicher-target')
area_url = {}
for area in a:
area_dict = {}
splitted = area.get('data-swicher-area-list').split('"')
for i in range(int((len(splitted)-1)/8)):
area_dict[splitted[i*8+3]] = splitted[i*8+7]
area_url[area.get('data-swicher-city').split('"')[3]] = area_dict
visit_areas = ['Shibuya / Ebisu / Daikanyama']
url_dict = {}
for visit_area in visit_areas:
url = area_url['Tokyo'][visit_area]
time.sleep(1)
res = requests.get(root_url + url[1:])
soup = BeautifulSoup(res.content, 'html.parser')
a = soup.find_all('a', class_='c-link-arrow')
for area in a:
href = area['href']
if href[-21:-8]!=url:
continue
else:
url_dict[area.text] = href
max_page = 20
restaurant_data = []
for area, url in url_dict.items():
time.sleep(1)
res_area = requests.get(url)
soup_area = BeautifulSoup(res_area.content, 'html.parser')
rc_count = int(soup_area.find_all('span', class_='list-condition__count')[0].text)
print('There are ' + str(rc_count) + ' restaurants in ' + area)
for i in range(1,min(math.ceil(rc_count/20)+1,max_page+1,61)):
print('Processing... ' + str(i) + '/' + str(min(math.ceil(rc_count/20)+1,max_page+1,61)-1))
url_rc = url + 'rstLst/RC/' + str(i) + '/?Srt=D&SrtT=nod'
res_rc = requests.get(url_rc)
soup_rc = BeautifulSoup(res_rc.content, 'html.parser')
for rc_div in soup_rc.find_all('div', class_='list-rst__wrap js-open-new-window'):
rc_name = rc_div.find('a', class_='list-rst__rst-name-target cpy-rst-name').text
rc_url = rc_div.find('a', class_='list-rst__rst-name-target cpy-rst-name')['href']
rc_score = rc_div.find('span', class_='c-rating__val c-rating__val--strong list-rst__rating-val')
if rc_score is None:
rc_score = -1.
else:
rc_score = float(rc_score.text)
rc_review_num = rc_div.find('em', class_='list-rst__rvw-count-num cpy-review-count').text
if rc_review_num != ' - ':
page = 1
score = []
while True:
rc_url_pg = rc_url + 'dtlrvwlst/COND-2/smp0/?smp=0&lc=2&rvw_part=all&PG=' + str(page)
time.sleep(1)
res_pg = requests.get(rc_url_pg)
soup_pg = BeautifulSoup(res_pg.content, 'html.parser')
if 'I can't find the page I'm looking for' in soup_pg.find('title').text:
break
try:
station = soup_pg.find_all('dl', class_='rdheader-subinfo__item')[0].find('span', class_='linktree__parent-target-text').text
except:
try:
station = soup_pg.find_all('dl', class_='rdheader-subinfo__item')[0].find('dd', class_='rdheader-subinfo__item-text').text.replace('\n','').replace(' ','')
except:
station = ''
genre = '/'.join([genre_.text for genre_ in soup_pg.find_all('dl', class_='rdheader-subinfo__item')[1].find_all('span', class_='linktree__parent-target-text')])
price = soup_pg.find_all('dl', class_='rdheader-subinfo__item')[2].find('p', class_='rdheader-budget__icon rdheader-budget__icon--dinner').find('a', class_='rdheader-budget__price-target').text
score = score + [score_.next_sibling.next_sibling.text for score_ in soup_pg.find_all('span', class_='c-rating__time c-rating__time--dinner')]
page += 1
if page == math.ceil(int(rc_review_num)/100)+1:
break
restaurant_data.append([area, rc_count, rc_name, rc_url, rc_score, rc_review_num, station, genre, price, score])
Also, for code explanations and details, see Scraping and Tabelog-I want to find a good restaurant! ~ (Work), so please have a look if you are interested.
The result is as follows! The top 400 restaurants in Shibuya, Ebisu, and Daikanyama in order of new opening are targeted.
What you can read from this scatter plot is ...
From here on, I'll consider it a little more, with my curiosity in mind.
Looking at the scatter plot earlier, I was curious that the distribution of tabelog scores was distorted, and that restaurants were concentrated on a specific score. So if you look at the distribution I'm also concerned that restaurants are concentrated on a specific score, but I'm also concerned that there are extremely few restaurants on a specific score.
I investigated the relationship with the number of reviews, but I could not get a result that could explain this ... Perhaps the key is the data not acquired this time, such as the number of days since opening. On a rumor basis, there is a theory that the annual membership fee paid by the restaurant to the tabelog limits the tabelog score, which may also have an effect. However, depending on the way of thinking, the tabelog score is low due to the upper limit of the score, but the word-of-mouth score is a factor that causes high restaurants, and if you use the approach introduced this time, ** it is not becoming more popular than necessary. You can find a store **.
Recommended Posts