background

Excel called Kenshakun to comprehensively obtain specific information on i Town Page The macro was distributed, but it can no longer be used due to the specification change on the i-town page side in November 2019. So, as a practice of scraping, a Python beginner (a little touched on in a university lecture) challenged programming with a breath of breath. The ultimate goal is to "get the store name, address, and category from a specific small category."

Confirmation of terms

The following two points are prohibited in the terms of the Town Page. ・ Acts that have a great impact on the service of i-town page ・ The act of repeatedly accessing the i-town page using a program that automatically accesses https://itp.ne.jp/guide/web/notice/ This time, it's a fucking code that just pushes the button to load the continuation of the page to the bottom, so I assume that it is in the category of normal use rather than repeated access (If this is not possible, even normal use can not reach the bottom specification……).

environment

Windows10
Spyder3.3.6(Python3.7)
Selenium
Google Chrome 79

code

Install Selenium

On Anaconda Prompt

pip install selenium

Specifying the driver ・ Starting Chrome

driver = webdriver.Chrome(executable_path='/Users/*****/*****/Selenium/chromedriver/chromedriver.exe')
driver.get('https://itp.ne.jp/genre/?area=13&genre=13&subgenre=177&sort=01&sbmap=false')

After specifying the driver, launch the URL specified in the browser. Since I searched pachinko parlors this time, it is the URL when searching in the area "Tokyo" and the category "indoor play".

Display to the bottom line

while True:
    try:
        driver.find_element_by_class_name('m-read-more__text').click()
    except:
        print('☆ "More display" end of repeated hits ☆')
        break

Press the "Show more" button repeatedly until you get an error (= to the bottom line). This will display all the hit store information on the HTML.

Collect category names

elist = []
elems = driver.find_elements_by_class_name("m-article-card__header__category")
for e in elems:
     elist.append(e.text)
print(elist)

str_ = '\n'.join(elist)
print(str_)
with open("str_.txt",'w')as f:
    f.write(str_)

Create an empty list and throw the innerText of the element whose Class name is m-article-card__header__category into it. After that, the list is converted into sentences with line breaks one element at a time and output as text.

the end

flist = []
elems2 = driver.find_elementｓ_by_class_name("m-article-card__header__title__link")
for f in elems2:
     flist.append(f.text)
print(flist)

str2_ = '\n'.join(flist)
print(str2_)
with open("str2_.txt",'w')as f:
    f.write(str2_)


glist = []
elems3 = elems2 = driver.find_elementｓ_by_class_name("m-article-card__lead__caption")
for g in elems3:
     glist.append(g.text)
print(glist)

str3_ = '\n'.join(glist)
print(str3_)
with open("str3_.txt",'w')as f:
    f.write(str3_)


print('success')
driver.quit()

The title and caption (address, phone number, nearest station) are also output in the same way.

Since the caption has a different number of lines from the title and category, it was formatted in Excel.

Where it gets stuck

Where it was resolved

・ Forget to add: (colon) after for ・ I don't know how to repeat it infinitely → While True: It was. I got stuck even when True started with a capital letter ・ I don't know how to write to a file → The cause was that the file name was not enclosed in "". ・ I don't know how to specify the location of chromeDriver → I only specified the folder that contains the Chrome driver. Of course, specify Chromedriver.exe -Even if you execute pip install selenium at the command prompt, it cannot be executed on Spyder. → Must be executed on the Anaconda Prompt side ...... Many others

Unresolved issues

・ I don't know what pip is ・ I don't understand the structure of html → I didn't understand after all, so I decided to run the browser with Selenium ・ In addition to [Address], [Telephone number] and [Nearest station] are included. → This is quite fatal, and it is probably better to write out one store as a group (this time it was processed in Excel). I plan to rewrite it if necessary in earnest

Other

・ The URL to display all stores nationwide is https://itp.ne.jp/genre/ ・ The URL for displaying stores in Tokyo is https://itp.ne.jp/genre/?area=13

2019/12/20 postscript

I noticed that it is easier to format by collecting by store with class = "o-result-article-list__item" instead of collecting by category, title, and address.

i-Town Page Scraping: I Wanted To Replace Wise-kun