Nice to meet you. I'm N.D., a 4th year university student belonging to the Department of Physics. The experience of Python is self-taught and a little touched. It was my first time scraping and crawling.

The Data Science Division of Cacco Inc., which is currently an intern, has the task of creating crawlers during the trial period to collect, process, and visualize data, and briefly discuss what they have learned.

Task

theme

Visualize and consider the market prices of restaurants throughout Tokyo Also, acquire other variables that are likely to be taken and analyze them while comparing them with the budget.

_ Sub-theme _

Since the theme is abstract, I would like to set up the following specific situations. situation Use data to objectively show friends who come to Tokyo, "What is the market price of restaurants in Tokyo, and what genre is the most popular in that market?"

Other discovered events

Visualize and show what you know to compare your budget with other variables along with your sub-themes.

policy

Crawl the gourmet site "Hot Pepper Gourmet" and get the URL of each shop's detail page.
Save the html file from the url of each shop's detail page (number of shops = number of saved html)
Scraping each variable from the obtained HTML file
Visualization / data analysis
Present the answer to the theme

_ Crawling _

This time, we are crawling the search results of "Shops that can make online reservations throughout Tokyo" on the hot pepper gourmet site. Acquired 16475 stores on Wednesday, October 16, 2019.

** Crawling procedure **

Get the number of shops from the first page before crawling to check the acceptance result later
[1st page] Read the url of each shop's detail page (hereinafter, shop url) and save it in the python list.
[Transition to next page] Get the URL of the next page from pagination and transition
Read the shop url from the destination page and save it in the python list
Repeat a few steps until there are no more pages
Jump to the store url saved in the list and save the html file one by one
Finally, get the number of shops by scraping from the last page in the same way as 1.

The crawling code looks like this:

`crawling.py`


from bs4 import BeautifulSoup
import requests
import time
import os
# timer
t1 = time.time()

# function
# get number of shop
def get_num(soup):
    num = soup.find('p', {'class':'sercheResult fl'}).find('span', {'class':'fcLRed bold fs18 padLR3'}).text
    print('num:{}'.format(num))

# get url of shop
def get_shop_urls(tags):
    shop_urls = []
    # ignore the first shop because it is PR
    tags = tags[1:]
    for tag in tags:
        shop_url = tag.a.get('href')
        shop_urls.append(shop_url)
    return shop_urls

def save_shop_urls(shop_urls, dir_path=None, test=False):
    # make directry
    if test:
        if dir_path is None:
            dir_path = './html_dir_test'
    elif dir_path is None:
        dir_path = './html_dir'

    if not os.path.isdir(dir_path):
        os.mkdir(dir_path)

    for i, shop_url in enumerate(shop_urls):
        time.sleep(1)
        shop_url = 'https://www.hotpepper.jp' + shop_url
        r = requests.get(shop_url).text
        file_path = 'shop{:0>5}_url.html'.format(i)
        with open(dir_path + '/' + file_path, 'w') as f:
            f.write(r)
    # return last shop number
    return len(shop_urls)


start_url = 'https://www.hotpepper.jp/yoyaku/SA11/'
response = requests.get(start_url).text
soup = BeautifulSoup(response, 'html.parser')
tags = soup.find_all('h3', {'class':'detailShopNameTitle'})

# get last page number
last_page = soup.find('li', {'class':'lh27'}).text.replace('1/', '').replace('page', '')
last_page = int(last_page)
print('last page num:{}'.format(last_page))

# get the number of shops before crawling
get_num(soup)

# first page crawling
start_shop_urls = get_shop_urls(tags)

# from 2nd page
shop_urls = []
# last page(test)
last_page = 10 # test
for p in range(last_page-1):
    time.sleep(1)
    url = start_url + 'bgn' + str(p+2) + '/'
    r = requests.get(url).text
    soup = BeautifulSoup(r, 'html.parser')
    tags = soup.find_all('h3', {'class':'detailShopNameTitle'})
    shop_urls.extend(get_shop_urls(tags))
    # how speed
    if p % 100 == 0:
        percent = p/last_page*100
        print('{:.2f}% Done'.format(percent))

start_shop_urls.extend(shop_urls)
shop_urls = start_shop_urls

t2 = time.time()
elapsed_time = t2 - t1
print('time(get_page):{:.2f}s'.format(elapsed_time))
print('num(shop_num):{}'.format(len(shop_urls)))

# get the url of shop
last_num = save_shop_urls(shop_urls) # html_dir

# get the number of shops after crawling
get_num(soup)

t3 = time.time()
elapsed_time = t3 - t1
print('time(get_html):{:.2f}s'.format(elapsed_time))
print('num(shop_num):{}'.format(last_num))

Scraping

The following are the variables scraped this time.

procedure

Scrap the above 9 variables for each store
When all the variables are available, add them as records to the DataFrame of pandas.
Check if the number of records is consistent with the number of shops acquired by crawling

The scraping code looks like this:

`scraping.py`


from bs4 import BeautifulSoup
import glob
import requests
import time
import os
import pandas as pd
from tqdm import tqdm
import numpy as np


def get_shopinfo(category, soup):
    shopinfo_th = soup.find('div', {'class':'shopInfoDetail'}).find_all('th')
    # get 'category' from 'shopinfo_th'
    category_value = list(filter(lambda x: category in x , shopinfo_th))
    if not category_value:
        category_value = None
    else:
        category_value = category_value[0]
        category_index = shopinfo_th.index(category_value)
        shopinfo_td = soup.find('div', {'class':'shopInfoDetail'}).find_all('td')
        category_value = shopinfo_td[category_index].text.replace('\n', '').replace('\t', '')
    return category_value

# judge [] or in
def judge(category):
    if category is not None:
        category = category.text.replace('\n', '').replace('\t', '')
    else:
        category = np.nan
    return category

# judge [] or in
def judge_atag(category):
    if category is not None:
        category = category.a.text.replace('\n', '').replace('\t', '')
    else:
        category = np.nan
    return category

# judge [] or in
def judge_ptag(category):
    if category is not None:
        category = category.p.text.replace('\n', '').replace('\t', '')
    else:
        category = np.nan
    return category

# judge [] or in
def judge_spantag(category):
    if category is not None:
        category = category.span.text.replace('\n', '').replace('\t', '')
    else:
        category = 0
    return category

# available=1, not=0
def available(strlist):
    available_flg = 0
    if 'available' in strlist:
        available_flg = 1
    return available_flg

# categorize money
def category2index(category, range):
    if category in range:
        category = range.index(category)
    return category

def scraping(html, df, price_range):
    soup = BeautifulSoup(html, 'html.parser')
    dinner = soup.find('span', {'class':'shopInfoBudgetDinner'})
    dinner = judge(dinner)
    dinner = category2index(dinner, price_range)
    lunch = soup.find('span', {'class':'shopInfoBudgetLunch'})
    lunch = judge(lunch)
    lunch = category2index(lunch, price_range)
    genre_tag = soup.find_all('dl', {'class':'shopInfoInnerSectionBlock cf'})[1]
    genre = genre_tag.find('p', {'class':'shopInfoInnerItemTitle'})
    genre = judge_atag(genre)
    area_tag = soup.find_all('dl', {'class':'shopInfoInnerSectionBlock cf'})[2]
    area = area_tag.find('p', {'class':'shopInfoInnerItemTitle'})
    area = judge_atag(area)
    rating = soup.find('div', {'class':'ratingInfo'})
    rating = judge_ptag(rating)
    review = soup.find('p', {'class':'review'})
    review = judge_spantag(review)
    f_meter = soup.find_all('dl', {'class':'featureMeter cf'})
    # if 'f_meter' is nan, 'size'='customer'='people'='peek'=nan
    if f_meter == []:
        size = np.nan
        customer = np.nan
        people = np.nan
        peek = np.nan
    else:
        meterActive = f_meter[0].find('span', {'class':'meterActive'})
        size = f_meter[0].find_all('span').index(meterActive)
        meterActive = f_meter[1].find('span', {'class':'meterActive'})
        customer = f_meter[1].find_all('span').index(meterActive)
        meterActive = f_meter[2].find('span', {'class':'meterActive'})
        people = f_meter[2].find_all('span').index(meterActive)
        meterActive = f_meter[3].find('span', {'class':'meterActive'})
        peek = f_meter[3].find_all('span').index(meterActive)
    credits = get_shopinfo('credit card', soup)
    credits = available(credits)
    emoney = get_shopinfo('Electronic money', soup)
    emoney = available(emoney)
    data = [lunch, dinner, genre, area, float(rating), review, size, customer, people, peek, credits, emoney]
    s = pd.Series(data=data, index=df.columns, name=str(i))
    df = df.append(s)
    return df

columns = ['budget(Noon)', 'budget(Night)', "Genre", "area", 'Evaluation', 'Number of reviews', 'Shop size'
           , 'Customer base', 'Number of people/set', 'Peak hours', 'credit card', 'Electronic money']
base_url = 'https://www.hotpepper.jp/SA11/'
response = requests.get(base_url).text
soup = BeautifulSoup(response, 'html.parser')
# GET range of price
price_range = soup.find('ul', {'class':'samaColumnList'}).find_all('a')
price_range = [p.text for p in price_range]
# price_range = ['~500 yen', '501-1000 yen', '1001-1500 yen', '1501-2000 yen', '2001-3000 yen', '3001-4000 yen', '4001-5000 yen'
#             , '5001 to 7000 yen', '7001-10000 yen', '10001-15000 yen', '15001 ~ 20000 yen', '20001-30000 yen', '30001 yen ~']

num = 16475  # number of data
# num = 1000 # test
df = pd.DataFrame(data=None, columns=columns)

for i in range(num):
# for i in tqdm(lis):
    html = './html_dir/shop{:0>5}_url.html'.format(i)
    with open(html,"r", encoding='utf-8') as f:
        shop_html = f.read()

    df = scraping(shop_html, df, price_range)
    if i % 1600 == 0:
        percent = i/num*100
        print('{:.3f}% Done'.format(percent))

df.to_csv('shop_info.csv', encoding='shift_jis')

_ Acceptance result _

The acceptance results are as follows. スクリーンショット 2019-11-26 11.41.30.png

It took a little less than an hour to crawl, so the site was updated during that time. You can see that there is a difference between the number of stores that were initially and the number of stores after crawling.

Results for sub-themes

_ Confirm sub-theme _

"Visualize the market prices of restaurants in Tokyo, Clarify which genre of shops are the most popular in that price range. "

Conclusion for sub-theme

--The market price for dinner is "** 2000-4000 yen ". --The market price for lunch is " 500-1000 yen ". ――The genre with the highest percentage in each of the dinner and lunch prices is " Izakaya **". —— Also, at lunch, the “500-1000 yen izakaya” would be a ** double cropping shop **. Here, the market price of the budget is defined as "mode value, not average value".

The underlying data are shown below in order.

_ Budget quote _

We have visualized the market price of the budget separately for dinner and lunch. スクリーンショット 2019-11-26 11.56.37.png

Genre by price range

From the above results, we have found a rough market price for restaurants in Tokyo, so let's visualize the genres by price range. スクリーンショット 2019-11-27 15.07.47.png スクリーンショット 2019-11-27 15.09.30.png

** Genres included in "Other" ** For both dinner and lunch, the following genres with a small total number are included in "Other". [Okonomiyaki / Monja / Cafe / Sweets / Ramen / Korean / International / Western / Creative / Other Gourmet]

I thought that "Izakaya" in the price range of "500-1000 yen" was too cheap for lunch, so I will dig deeper here.

_ What is an izakaya for "500-1000 yen"?

As shown below, it can be seen that while calling itself "Izakaya", the menu for lunch is offered during the day. スクリーンショット 2019-11-26 12.10.38.png

Other discovered events

Conclusion

--The customer base of shops in the price range of "** 7,000 yen to " for dinner tends to be more male than female customers, and both dinner and lunch are " 1000 to 3000 yen **". The customer base of shops in the price range tends to be more female than male.

--Both dinner and lunch tend to be ** highly rated ** as they become ** higher priced **.

-In the ** high price range **, there are many stores that accept ** credit cards **

――Shops in the price range of "** 2000-4000 yen **" for dinner tend to have a wide ** capacity **.

The data that serves as the basis are shown below.

Customer base by price range

We compared price ranges by customer base. スクリーンショット 2019-11-26 12.17.30.png

From this, it can be said that the customer base of shops in the price range of "7,000 yen" tends to be more male than female customers at dinner, and the price of "1000-3000 yen" for both dinner and lunch. It was found that the customer base of the obi shop tends to have more female customers than male customers.

Evaluation by price range

We will plot the ratings for each of the dinner and lunch price ranges. At that time, there are many shops with the same evaluation in the same price range, so we adopted jittering and intentionally shifted the plot. The results of the t-test are shown below the graph. ** Definition of t-test ** Dinner: Grouped at shops under 4000 yen and shops over 4000 yen Lunch: Grouped by shops under 2000 yen and shops over 2000 yen スクリーンショット 2019-11-26 12.31.33.png スクリーンショット 2019-11-26 12.32.18.png You can see that the higher the price range for both dinner and lunch, the higher the rating tends to be. From the results of the t-test, it can be said that there is a difference in "** evaluation **" between the high price range and the low price range.

_ Credit card usage by price range _

We compared the usage status of credit cards by price range. スクリーンショット 2019-11-27 15.22.17.png

Again, instinctively, we found that a large percentage of stores in the ** high price range ** accept ** credit cards **. In addition, the price range of "10,000 yen ~" for lunch is not displayed because the number of cases that is sufficient for evaluation was not obtained as 4 cases.

Store size by price range

We compared the sizes of shops evaluated on a 5-point scale by price range. Since I could conclude only dinner here, it will be posted only there. The darker the blue, the wider the store. スクリーンショット 2019-11-26 13.04.13.png You can see that shops in the price range of "2000-4000 yen" tend to have a large capacity for dinner. Since the ratio of izakaya is large in this price range, it is thought that izakaya with a large capacity is large.

in conclusion

_ Reflections _

I realized for myself how difficult it is to "collect the information obtained by scraping and visualize it so that the conclusion can be conveyed to the other party." ** If you do it again ** Set a clear purpose for your analysis before writing code and back-calculate to plan your process

_What I learned from feedback _

** Received code review ** I received the following points, and I would like to improve it thereafter.

--Write code while being aware of the python code convention called pep8 ――Please submit after organizing unnecessary line breaks and comment outs.

** After the announcement review ** It was a process of "how to show the graph to make it easier to convey". We received feedback that it is important to create an "intuitive graph", such as saying that the higher the price range, the higher the price range, and that the density is expressed by jittering. I also learned that showing a story-like conclusion leads to the understanding of the other person. I will spend my analytical life in the future while being aware of how to connect the obtained results to real problems.

[First data science ⑥] I tried to visualize the market price of restaurants in Tokyo

Task

theme

_ Sub-theme _

Other discovered events

policy

_ Crawling _

crawling.py

Scraping

scraping.py

_ Acceptance result _

Results for sub-themes

_ Confirm sub-theme _

Conclusion for sub-theme

_ Budget quote _

Genre by price range

_ What is an izakaya for "500-1000 yen"?

Other discovered events

Conclusion

Customer base by price range

Evaluation by price range

_ Credit card usage by price range _

Store size by price range

in conclusion

_ Reflections _

_What I learned from feedback _

`crawling.py`

`scraping.py`