Around 2020, the specifications of the i-town page changed, so I created a script corresponding to it. Create a script like the one below
Enter keywords and areas and search on the i-town page → Get the store name and address from the search results and output in csv format
--Acts that have a great impact on the service of i-town page --The act of repeatedly accessing the i-town page using a program that automatically accesses --Acts that put a load on the server by using malicious programs and scripts
The program introduced in this article does not perform continuous access at a speed that greatly exceeds the speed that users normally use, so it does not fall under the prohibited items (it seems).
Also, since it is prohibited to copy it for use in an environment that can be viewed by a third party, Images of the site etc. will not be posted at the time of explanation
The site will be updated around 2020, and as you scroll down the search results, a button called ** Show more ** will appear. You can no longer get all search results (maximum display 1000) unless you press this many times.
For the time being, I will post a brief explanation and the entire program. (Detailed explanation will be described later)
Create an input interface using PysimpleGUI (it doesn't matter if you don't have it) Start chrome (or firefox) using selenium webdriver, display the corresponding page & press all the display buttons Use beautifulsoup to get the necessary elements (this time, two types, store name and address) Mold data using pandas
main.py
#It is python3's app
#install selenium, beautifulsoup4, pandas with pip3
#download firefox, geckodriver
from selenium import webdriver
#from selenium.webdriver.firefox.options import Options
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
from bs4 import BeautifulSoup
import pandas as pd
import time
import re
import csv
import PySimpleGUI as sg
#plese download browserdriver and writedown driver's path
#bdriverpath='./chromedriver'
bdriverpath="C:\chromedriver.exe"
#make popup window
layout= [
[sg.Text('Area >> ', size=(15,1)), sg.InputText('Machida')],
[sg.Text('Keyword >> ', size=(15,1)), sg.InputText('convenience store')],
[sg.Submit(button_text='OK')]
]
window = sg.Window('Area and Keyword', layout)
#popup
while True:
event, values = window.read()
if event is None:
print('exit')
break
if event == 'OK':
show_message = "Area is " + values[0] + "\n"
show_message += "Keyword is " + values[1] + "\n"
print(show_message)
sg.popup(show_message)
break
window.close()
area =values[0]
keyword = values[1]
#initialize webdriver
options = Options()
options.add_argument('--headless')
driver=webdriver.Chrome(options=options, executable_path=bdriverpath)
#search page with keyword and area
driver.get('https://itp.ne.jp')
driver.find_element_by_id('keyword-suggest').find_element_by_class_name('a-text-input').send_keys(keyword)
driver.find_element_by_id('area-suggest').find_element_by_class_name('a-text-input').send_keys(area)
driver.find_element_by_class_name('m-keyword-form__button').click()
time.sleep(5)
#find & click readmore button
try:
while driver.find_element_by_class_name('m-read-more'):
button = driver.find_element_by_class_name('m-read-more')
button.click()
time.sleep(1)
except NoSuchElementException:
pass
res = driver.page_source
driver.quit()
#output with html
with open(area + '_' + keyword + '.html', 'w', encoding='utf-8') as f:
f.write(res)
#parse with beautifulsoup
soup = BeautifulSoup(res, "html.parser")
shop_names = [n.get_text(strip=True) for n in soup.select('.m-article-card__header__title')]
shop_locates = [n.get_text(strip=True) for n in soup.find_all(class_='m-article-card__lead__caption', text=re.compile("Street address"))]
#incorporation lists with pandas
df = pd.DataFrame([shop_names, shop_locates])
df = df.transpose()
#output with csv
df.to_csv(area + '_' + keyword + '.csv', quoting=csv.QUOTE_NONE, index=False, encoding='utf_8_sig')
sg.popup("finished")
The following is the library group impoprt this time. All can be installed with pip3. The commented out part is whether to use chrome or firefox, so please rewrite it according to your preference and environment.
import.py
from selenium import webdriver
#from selenium.webdriver.firefox.options import Options
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
from bs4 import BeautifulSoup
import pandas as pd
import time
import re
import csv
import PySimpleGUI as sg
driver If you want to use the webdriver explained later, you need chromedriver for chrome and geckodriver for firefox. Please download the applicable one from the following site. https://github.com/mozilla/geckodriver/releases https://chromedriver.chromium.org/downloads Also, at this time, it will not work unless the ** browser, python, and driver 3 versions ** you are using are engaged.
--First of all, please use the latest browser. --Download the driver accordingly --Use the latest with gecko (probably) --If it is chrome, the version name of chrome and the version name of driver are linked, so select the same one.
driver.py
#plese download browserdriver and writedown driver's path
#bdriverpath='./chromedriver'
bdriverpath="C:\chromedriver.exe"
PySimpleGUI References If you use Tkinter, try using PySimpleGUI
Decide the layout and write the default input (Machida, convenience store)
layout.py
#make popup window
layout= [
[sg.Text('Area >> ', size=(15,1)), sg.InputText('Machida')],
[sg.Text('Keyword >> ', size=(15,1)), sg.InputText('convenience store')],
[sg.Submit(button_text='OK')]
]
window.py
window = sg.Window('Area and Keyword', layout)
#popup
while True:
event, values = window.read()
if event is None:
print('exit')
break
if event == 'OK':
show_message = "Area is " + values[0] + "\n"
show_message += "Keyword is " + values[1] + "\n"
print(show_message)
sg.popup(show_message)
break
window.close()
area =values[0]
keyword = values[1]
webdriver (selenium) is a library for operating a normal browser (firefox, chrome, etc.) programmatically.
First, add --headless
to the startup options. This is an option to run the browser in the background.
If you want the browser to work automatically, comment out options.add_argument ('--headless')
.
Then launch chrome with driver = webdriver.Chrome ()
.
At the same time, enter the option and the driver path. options = options, executable_path = briverpath
init.py
#initialize webdriver
options = Options()
options.add_argument('--headless')
driver=webdriver.Chrome(options=options, executable_path=briverpath)
Go to the top of the town page with driver.get
.
Find the input box for entering keywords and areas in driver.find ~
, and also enter in .send_keys ()
.
Also, find the search start button in the same way and press the button with .click ()
.
search.py
#search page with keyword and area
driver.get('https://itp.ne.jp')
driver.find_element_by_id('keyword-suggest').find_element_by_class_name('a-text-input').send_keys(keyword)
driver.find_element_by_id('area-suggest').find_element_by_class_name('a-text-input').send_keys(area)
driver.find_element_by_class_name('m-keyword-form__button').click()
time.sleep(5)
For example, on the following page, the keyword input box has an id of keyword-suggest
and a class of a-text-input
.
keyword.html
<div data-v-1wadada="" id="keyword-suggest" class="m-suggest" data-v-1dadas14="">
<input data-v-dsadwa3="" type="text" autocomplete="off" class="a-text-input" placeholder="Enter a keyword" data-v-1bbdb50e="">
<!---->
</div>
Use a loop to keep pressing the more display button class_name = m-read-more
as long as you find it.
Also, if you try to find the same button immediately after pressing the button, the new button will not be loaded yet and will end in the middle, so set a waiting time with time.sleep (1)
If the button is not found, the webdriver will cause an error and the program will end, so predict the error except
in advance.
After except, proceed to the next as it is, put the obtained html (all the display is pressed) in res
,driver.quit ()
The web driver will exit
button.py
from selenium.common.exceptions import NoSuchElementException
#find & click readmore button
try:
while driver.find_element_by_class_name('m-read-more'):
button = driver.find_element_by_class_name('m-read-more')
button.click()
time.sleep(1)
except NoSuchElementException:
pass
res = driver.page_source
driver.quit()
Just in case, I will output the html I got. Not required
html.py
#output with html
with open(area + '_' + keyword + '.html', 'w', encoding='utf-8') as f:
f.write(res)
Pass the html you got earlier to beautifulsoup.
Search for an element with soup.select
and get only the store name (address) with.get_text ()
.
If you just use get_text ()
, line breaks and spaces will be included, but if you add the strip = True
option, you will get only the characters you want.
Regarding the address, on the town page site, the class class_name = m-article-card__lead__caption
was set not only for the address but also for the telephone number and the nearest station, so only the address can be extracted by character string. I have it. text = re.compile ("address ")
parse.py
#parse with beautifulsoup
soup = BeautifulSoup(res, "html.parser")
shop_names = [n.get_text(strip=True) for n in soup.select('.m-article-card__header__title')]
shop_locates = [n.get_text(strip=True) for n in soup.find_all(class_='m-article-card__lead__caption', text=re.compile("Street address"))]
I'm using pandas to organize my data.
The data obtained by beautifulsoup is a list, so combine the two.
That alone will result in landscape data, so use transpose ()
to make it portrait.
pandas.py
#incorporation lists with pandas
df = pd.DataFrame([shop_names, shop_locates])
df = df.transpose()
This time, I output it in csv format.
Use the user-entered area
and keyword
for the file name.
When the pandas data is output, it is numbered vertically, but since it is an obstacle, it is erased with index = False
.
Also, there is a problem that the output data is garbled when opened in Excel, so avoid it with encoding ='utf_8_sig'
.
csv.py
#output with csv
df.to_csv(area + '_' + keyword + '.csv', quoting=csv.QUOTE_NONE, index=False, encoding='utf_8_sig')
I tried web scraping using selenium, but the impression was that the operation was not stable.
Since the browser is actually running, the operation after loading or pressing the button is not guaranteed.
This time I used time.sleep
to avoid it.
(I originally used selenium's implicit / explicit wait, but it didn't work for me.)
Also, when I downloaded the webdriver, it was an old version for some reason, and I was suffering from an error for about 2 days without noticing it, so I was very angry (to myself)
Recommended Posts