Excel called Kenshakun to comprehensively obtain specific information on i Town Page The macro was distributed, but it can no longer be used due to the specification change on the i-town page side in November 2019. So, as a practice of scraping, a Python beginner (a little touched on in a university lecture) challenged programming with a breath of breath. The ultimate goal is to "get the store name, address, and category from a specific small category."
The following two points are prohibited in the terms of the Town Page. ・ Acts that have a great impact on the service of i-town page ・ The act of repeatedly accessing the i-town page using a program that automatically accesses https://itp.ne.jp/guide/web/notice/ This time, it's a fucking code that just pushes the button to load the continuation of the page to the bottom, so I assume that it is in the category of normal use rather than repeated access (If this is not possible, even normal use can not reach the bottom specification……).
On Anaconda Prompt
pip install selenium
driver = webdriver.Chrome(executable_path='/Users/*****/*****/Selenium/chromedriver/chromedriver.exe')
driver.get('https://itp.ne.jp/genre/?area=13&genre=13&subgenre=177&sort=01&sbmap=false')
After specifying the driver, launch the URL specified in the browser. Since I searched pachinko parlors this time, it is the URL when searching in the area "Tokyo" and the category "indoor play".
while True:
try:
driver.find_element_by_class_name('m-read-more__text').click()
except:
print('☆ "More display" end of repeated hits ☆')
break
Press the "Show more" button repeatedly until you get an error (= to the bottom line). This will display all the hit store information on the HTML.
elist = []
elems = driver.find_elements_by_class_name("m-article-card__header__category")
for e in elems:
elist.append(e.text)
print(elist)
str_ = '\n'.join(elist)
print(str_)
with open("str_.txt",'w')as f:
f.write(str_)
Create an empty list and throw the innerText of the element whose Class name is m-article-card__header__category into it. After that, the list is converted into sentences with line breaks one element at a time and output as text.
flist = []
elems2 = driver.find_elements_by_class_name("m-article-card__header__title__link")
for f in elems2:
flist.append(f.text)
print(flist)
str2_ = '\n'.join(flist)
print(str2_)
with open("str2_.txt",'w')as f:
f.write(str2_)
glist = []
elems3 = elems2 = driver.find_elements_by_class_name("m-article-card__lead__caption")
for g in elems3:
glist.append(g.text)
print(glist)
str3_ = '\n'.join(glist)
print(str3_)
with open("str3_.txt",'w')as f:
f.write(str3_)
print('success')
driver.quit()
The title and caption (address, phone number, nearest station) are also output in the same way.
・ Forget to add: (colon) after for ・ I don't know how to repeat it infinitely → While True: It was. I got stuck even when True started with a capital letter ・ I don't know how to write to a file → The cause was that the file name was not enclosed in "". ・ I don't know how to specify the location of chromeDriver → I only specified the folder that contains the Chrome driver. Of course, specify Chromedriver.exe -Even if you execute pip install selenium at the command prompt, it cannot be executed on Spyder. → Must be executed on the Anaconda Prompt side ...... Many others
・ I don't know what pip is ・ I don't understand the structure of html → I didn't understand after all, so I decided to run the browser with Selenium ・ In addition to [Address], [Telephone number] and [Nearest station] are included. → This is quite fatal, and it is probably better to write out one store as a group (this time it was processed in Excel). I plan to rewrite it if necessary in earnest
・ The URL to display all stores nationwide is https://itp.ne.jp/genre/ ・ The URL for displaying stores in Tokyo is https://itp.ne.jp/genre/?area=13
I noticed that it is easier to format by collecting by store with class = "o-result-article-list__item" instead of collecting by category, title, and address.
Recommended Posts