Good evening, I'm writing while watching Amazon Prime's Batcheller Japan, but I also want to be robbed.
Suddenly, in my data analysis work, I needed to scrape data from the outside, so I scraped it. I had a little trouble with the website with Lazy Load, so I will summarize that part as a memorandum for the time being. It was quite difficult to find out.
The environment is as follows.
macOS 10.2.3
python 3.6.0
phantomjs 2.1.1
selenium 3.0.2
I've been addicted to anime lately, so I decided to get the anime genre from the d anime store for practice.
lazy_load_scrape.py
import lxml.html as lh
import requests as rq
import cssselect
from selenium import webdriver
import time
#Time to wait for lazy load to load
pause = 5
#Get root
t_url = 'https://anime.dmkt-sp.jp/animestore/gen_sel_pc'
t_html = rq.get(turl).text
root = lh.fromstring(t_html)
#Get genre text and links
ls = []
for i in root.cssselect('.btnList > a'):
ls.append({'genre': i.text_content(), 'https://anime.dmkt-sp.jp/animestore' + i.attrib['href']})
#Current ls
# [{'genre': '\nSF/Fantasy(733)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=11'}, {'genre': '\n robot/Mecha(214)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=12'}, {'genre': '\n action/battle(606)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=13'}, {'genre': '\n comedy/gag(466)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=14'}, {'genre': '\n romance/Romantic comedy(370)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=15'}, {'genre': '\n everyday/Heartwarming(112)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=23'}, {'genre': '\n sports/Competition(122)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=16'}, {'genre': '\n horror/Suspense/Detective(160)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=17'}, {'genre': '\n history/Senki(75)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=18'}, {'genre': '\n war/military(55)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=19'}, {'genre': '\n drama/Youth(556)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=20'}, {'genre': '\n short(218)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=22'}, {'genre': '\n stage/live/etc.(87)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=24'}]
#For statement to access each child element
for l in ls:
genre = l['genre']
c_url = l['child_url']
#Specify phantomJS in the selenium driver and feed the url
driver = webdriver.PhantomJS()
driver.get(curl)
#Get root for child resource
croot1 = lh.fromstring(driver.page_source)
#Get elements using cssselect
t_element = croot1.cssselect('.webkit2LineClamp')
#Scroll down to load the lazy loaded part
lastHeight = driver.execute_script("return document.body.scrollHeight") #The part that determines whether it is scrolled
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") #Scroll down
time.sleep(pause) #Wait for it to load
#The part that determines whether it is scrolled
newHeight = driver.execute_script("return document.body.scrollHeight")
if newHeight == lastHeight:
break
lastHeight = newHeight
#Get root when everything is loaded
croot = lh.fromstring(driver.page_source)
#Get by specifying an element using cssselect
ts = croot.cssselect('.webkit2LineClamp')
#Store the title of the work you wanted in the list
c_elements = [t.text_content() for t in ts]
#List of genres and works associated with them that were targeted
{genre: c_elements}
For the time being, this will give you a list of genres and works.
I think the main thing I'm addicted to is the asynchronous part. If you take the height before and after loading like this, you can scroll down. This seems to be okay for the time being.
I have confirmed that I can get it with this, but the data I got may be mixed other than what I want.
There aren't many sites that are trying hard to divide the genres of anime, isn't it?
While writing this, my favorite pretty girl with collarbones disappeared at Batcheller Japan. I'm about to cry in shock.
Recommended Posts