Preface

Around 2020, the specifications of the i-town page changed, so I created a script corresponding to it. Create a script like the one below

Enter keywords and areas and search on the i-town page → Get the store name and address from the search results and output in csv format

Note

Web scraping may be prohibited by the terms of use of the site. The following are mainly prohibited in the i-town page that is the subject of this time.

--Acts that have a great impact on the service of i-town page --The act of repeatedly accessing the i-town page using a program that automatically accesses --Acts that put a load on the server by using malicious programs and scripts

The program introduced in this article does not perform continuous access at a speed that greatly exceeds the speed that users normally use, so it does not fall under the prohibited items (it seems).

Also, since it is prohibited to copy it for use in an environment that can be viewed by a third party, Images of the site etc. will not be posted at the time of explanation

i Town Page

The site will be updated around 2020, and as you scroll down the search results, a button called ** Show more ** will appear. You can no longer get all search results (maximum display 1000) unless you press this many times.

Program overview

For the time being, I will post a brief explanation and the entire program. (Detailed explanation will be described later)

Operating environment:
- Python3.9
Library:
- selenium 3.141.0
- pandas 1.1.4
- PySimpleGUI 4.30.0
- beautifulsoup4 4.9.3 --Software:
- Firefox 82.0.2 --geckodriver 0.28.0 (when using firefox)
- Chrome 86.0.4240.183
  --chromedriver 86.0.4240.22 (when using chrome)

Create an input interface using PysimpleGUI (it doesn't matter if you don't have it) Start chrome (or firefox) using selenium webdriver, display the corresponding page & press all the display buttons Use beautifulsoup to get the necessary elements (this time, two types, store name and address) Mold data using pandas

`main.py`


#It is python3's app
#install selenium, beautifulsoup4, pandas with pip3
#download firefox, geckodriver
from selenium import webdriver
#from selenium.webdriver.firefox.options import Options
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
from bs4 import BeautifulSoup
import pandas as pd
import time
import re
import csv
import PySimpleGUI as sg

#plese download browserdriver and writedown driver's path
#bdriverpath='./chromedriver'
bdriverpath="C:\chromedriver.exe"

#make popup window
layout= [
    [sg.Text('Area >> ', size=(15,1)), sg.InputText('Machida')],
    [sg.Text('Keyword >> ', size=(15,1)), sg.InputText('convenience store')],
    [sg.Submit(button_text='OK')]
]
window = sg.Window('Area and Keyword', layout)

#popup
while True:
    event, values = window.read()

    if event is None:
        print('exit')
        break

    if event == 'OK':
        show_message = "Area is " + values[0] + "\n"
        show_message += "Keyword is " + values[1] + "\n"
        print(show_message)
        sg.popup(show_message)
        break

window.close()
area =values[0]
keyword = values[1]

#initialize webdriver
options = Options()
options.add_argument('--headless')
driver=webdriver.Chrome(options=options, executable_path=bdriverpath)

#search page with keyword and area
driver.get('https://itp.ne.jp')
driver.find_element_by_id('keyword-suggest').find_element_by_class_name('a-text-input').send_keys(keyword)
driver.find_element_by_id('area-suggest').find_element_by_class_name('a-text-input').send_keys(area)
driver.find_element_by_class_name('m-keyword-form__button').click()
time.sleep(5)

#find & click readmore button
try:
    while driver.find_element_by_class_name('m-read-more'):
        button = driver.find_element_by_class_name('m-read-more')
        button.click()
        time.sleep(1)
except NoSuchElementException:
    pass
res = driver.page_source
driver.quit()

#output with html
with open(area + '_' + keyword + '.html', 'w', encoding='utf-8') as f:
    f.write(res)

#parse with beautifulsoup
soup = BeautifulSoup(res, "html.parser")
shop_names = [n.get_text(strip=True) for n in soup.select('.m-article-card__header__title')]
shop_locates = [n.get_text(strip=True) for n in soup.find_all(class_='m-article-card__lead__caption', text=re.compile("Street address"))]

#incorporation lists with pandas
df = pd.DataFrame([shop_names, shop_locates])
df = df.transpose()

#output with csv
df.to_csv(area + '_' + keyword + '.csv', quoting=csv.QUOTE_NONE, index=False, encoding='utf_8_sig')

sg.popup("finished")

Explanation for each block

Environment

The following is the library group impoprt this time. All can be installed with pip3. The commented out part is whether to use chrome or firefox, so please rewrite it according to your preference and environment.

`import.py`


from selenium import webdriver
#from selenium.webdriver.firefox.options import Options
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
from bs4 import BeautifulSoup
import pandas as pd
import time
import re
import csv
import PySimpleGUI as sg

driver If you want to use the webdriver explained later, you need chromedriver for chrome and geckodriver for firefox. Please download the applicable one from the following site. https://github.com/mozilla/geckodriver/releases https://chromedriver.chromium.org/downloads Also, at this time, it will not work unless the ** browser, python, and driver 3 versions ** you are using are engaged.

--First of all, please use the latest browser. --Download the driver accordingly --Use the latest with gecko (probably) --If it is chrome, the version name of chrome and the version name of driver are linked, so select the same one.

After downloading the driver, pass the path as an environment variable, or place it in an easy-to-understand location and write the path in the program. In my windows environment, it is placed directly under the C drive. I commented it out, but on linux (mac) I put it in the same place where the program is put and used it.

`driver.py`


#plese download browserdriver and writedown driver's path
#bdriverpath='./chromedriver'
bdriverpath="C:\chromedriver.exe"

PySimpleGUI References If you use Tkinter, try using PySimpleGUI

Decide the layout and write the default input (Machida, convenience store)

`layout.py`


#make popup window
layout= [
    [sg.Text('Area >> ', size=(15,1)), sg.InputText('Machida')],
    [sg.Text('Keyword >> ', size=(15,1)), sg.InputText('convenience store')],
    [sg.Submit(button_text='OK')]
]

Create a window and keep loading in a loop. When the OK button in the window is pressed, the input contents are read into values []. After the processing is completed, exit with window.close () and pass the input contents to the variables in the program.

`window.py`


window = sg.Window('Area and Keyword', layout)

#popup
while True:
    event, values = window.read()

    if event is None:
        print('exit')
        break

    if event == 'OK':
        show_message = "Area is " + values[0] + "\n"
        show_message += "Keyword is " + values[1] + "\n"
        print(show_message)
        sg.popup(show_message)
        break

window.close()
area =values[0]
keyword = values[1]

Start webdriver

webdriver (selenium) is a library for operating a normal browser (firefox, chrome, etc.) programmatically.

First, add --headless to the startup options. This is an option to run the browser in the background. If you want the browser to work automatically, comment out options.add_argument ('--headless'). Then launch chrome with driver = webdriver.Chrome (). At the same time, enter the option and the driver path. options = options, executable_path = briverpath

`init.py`


#initialize webdriver
options = Options()
options.add_argument('--headless')
driver=webdriver.Chrome(options=options, executable_path=briverpath)

Search with webdriver

Go to the top of the town page with driver.get. Find the input box for entering keywords and areas in driver.find ~, and also enter in .send_keys (). Also, find the search start button in the same way and press the button with .click ().

HTML can be shown by using developer tools (view the source) with the i-town page site open in chrome.

`search.py`


#search page with keyword and area
driver.get('https://itp.ne.jp')
driver.find_element_by_id('keyword-suggest').find_element_by_class_name('a-text-input').send_keys(keyword)
driver.find_element_by_id('area-suggest').find_element_by_class_name('a-text-input').send_keys(area)
driver.find_element_by_class_name('m-keyword-form__button').click()
time.sleep(5)

html example

For example, on the following page, the keyword input box has an id of keyword-suggest and a class of a-text-input.

`keyword.html`


<div data-v-1wadada="" id="keyword-suggest" class="m-suggest" data-v-1dadas14="">
<input data-v-dsadwa3="" type="text" autocomplete="off" class="a-text-input" placeholder="Enter a keyword" data-v-1bbdb50e=""> 
<!---->
</div>

Push the display further

Use a loop to keep pressing the more display button class_name = m-read-more as long as you find it. Also, if you try to find the same button immediately after pressing the button, the new button will not be loaded yet and will end in the middle, so set a waiting time with time.sleep (1) If the button is not found, the webdriver will cause an error and the program will end, so predict the error except in advance. After except, proceed to the next as it is, put the obtained html (all the display is pressed) in res,driver.quit () The web driver will exit

`button.py`


from selenium.common.exceptions import NoSuchElementException

#find & click readmore button
try:
    while driver.find_element_by_class_name('m-read-more'):
        button = driver.find_element_by_class_name('m-read-more')
        button.click()
        time.sleep(1)
except NoSuchElementException:
    pass
res = driver.page_source
driver.quit()

Output html

Just in case, I will output the html I got. Not required

`html.py`


#output with html
with open(area + '_' + keyword + '.html', 'w', encoding='utf-8') as f:
    f.write(res)

html analysis

Pass the html you got earlier to beautifulsoup. Search for an element with soup.select and get only the store name (address) with.get_text (). If you just use get_text (), line breaks and spaces will be included, but if you add the strip = True option, you will get only the characters you want. Regarding the address, on the town page site, the class class_name = m-article-card__lead__caption was set not only for the address but also for the telephone number and the nearest station, so only the address can be extracted by character string. I have it. text = re.compile ("address ")

`parse.py`


#parse with beautifulsoup
soup = BeautifulSoup(res, "html.parser")
shop_names = [n.get_text(strip=True) for n in soup.select('.m-article-card__header__title')]
shop_locates = [n.get_text(strip=True) for n in soup.find_all(class_='m-article-card__lead__caption', text=re.compile("Street address"))]

Data molding

I'm using pandas to organize my data. The data obtained by beautifulsoup is a list, so combine the two. That alone will result in landscape data, so use transpose () to make it portrait.

`pandas.py`


#incorporation lists with pandas
df = pd.DataFrame([shop_names, shop_locates])
df = df.transpose()

Data output

This time, I output it in csv format. Use the user-entered area and keyword for the file name. When the pandas data is output, it is numbered vertically, but since it is an obstacle, it is erased with index = False. Also, there is a problem that the output data is garbled when opened in Excel, so avoid it with encoding ='utf_8_sig'.

`csv.py`


#output with csv
df.to_csv(area + '_' + keyword + '.csv', quoting=csv.QUOTE_NONE, index=False, encoding='utf_8_sig')

At the end

I tried web scraping using selenium, but the impression was that the operation was not stable. Since the browser is actually running, the operation after loading or pressing the button is not guaranteed. This time I used time.sleep to avoid it. (I originally used selenium's implicit / explicit wait, but it didn't work for me.) Also, when I downloaded the webdriver, it was an old version for some reason, and I was suffering from an error for about 2 days without noticing it, so I was very angry (to myself)

I-town page scraping with selenium

Preface

Note

i Town Page

Program overview

main.py

Explanation for each block

Environment

import.py

driver.py

layout.py

window.py