[Python, Selenium, PhantomJS] A story when scraping a website with lazy load

Good evening, I'm writing while watching Amazon Prime's Batcheller Japan, but I also want to be robbed.

Suddenly, in my data analysis work, I needed to scrape data from the outside, so I scraped it. I had a little trouble with the website with Lazy Load, so I will summarize that part as a memorandum for the time being. It was quite difficult to find out.

The environment is as follows.

macOS 10.2.3
python 3.6.0
phantomjs 2.1.1
selenium 3.0.2

Lazy Load processing

I've been addicted to anime lately, so I decided to get the anime genre from the d anime store for practice.

lazy_load_scrape.py



import lxml.html as lh
import requests as rq
import cssselect
from selenium import webdriver
import time

#Time to wait for lazy load to load
pause = 5


#Get root
t_url = 'https://anime.dmkt-sp.jp/animestore/gen_sel_pc'
t_html = rq.get(turl).text
root = lh.fromstring(t_html)

#Get genre text and links
ls = []
for i in root.cssselect('.btnList > a'):
    ls.append({'genre': i.text_content(), 'https://anime.dmkt-sp.jp/animestore' + i.attrib['href']})

#Current ls
# [{'genre': '\nSF/Fantasy(733)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=11'}, {'genre': '\n robot/Mecha(214)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=12'}, {'genre': '\n action/battle(606)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=13'}, {'genre': '\n comedy/gag(466)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=14'}, {'genre': '\n romance/Romantic comedy(370)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=15'}, {'genre': '\n everyday/Heartwarming(112)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=23'}, {'genre': '\n sports/Competition(122)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=16'}, {'genre': '\n horror/Suspense/Detective(160)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=17'}, {'genre': '\n history/Senki(75)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=18'}, {'genre': '\n war/military(55)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=19'}, {'genre': '\n drama/Youth(556)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=20'}, {'genre': '\n short(218)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=22'}, {'genre': '\n stage/live/etc.(87)\n\n', 'child_url': 'https://anime.dmkt-sp.jp/animestore/gen_pc?genreCd=24'}]

#For statement to access each child element
for l in ls:
  genre = l['genre']
  c_url = l['child_url']

  #Specify phantomJS in the selenium driver and feed the url
  driver = webdriver.PhantomJS()
  driver.get(curl)

  #Get root for child resource
  croot1 = lh.fromstring(driver.page_source)

  #Get elements using cssselect
  t_element = croot1.cssselect('.webkit2LineClamp')
  
  #Scroll down to load the lazy loaded part
  lastHeight = driver.execute_script("return document.body.scrollHeight")  #The part that determines whether it is scrolled
  while True:
      driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")  #Scroll down
      
      time.sleep(pause)  #Wait for it to load

      #The part that determines whether it is scrolled
      newHeight = driver.execute_script("return document.body.scrollHeight")
      if newHeight == lastHeight:
          break
      lastHeight = newHeight

  #Get root when everything is loaded
  croot = lh.fromstring(driver.page_source)

  #Get by specifying an element using cssselect
  ts = croot.cssselect('.webkit2LineClamp')

  #Store the title of the work you wanted in the list
  c_elements = [t.text_content() for t in ts]

  #List of genres and works associated with them that were targeted
  {genre: c_elements}

For the time being, this will give you a list of genres and works.

Summary

I think the main thing I'm addicted to is the asynchronous part. If you take the height before and after loading like this, you can scroll down. This seems to be okay for the time being.

I have confirmed that I can get it with this, but the data I got may be mixed other than what I want.

There aren't many sites that are trying hard to divide the genres of anime, isn't it?

Finally

While writing this, my favorite pretty girl with collarbones disappeared at Batcheller Japan. I'm about to cry in shock.

Recommended Posts

[Python, Selenium, PhantomJS] A story when scraping a website with lazy load
Scraping with Python + PhantomJS
Scraping with Selenium [Python]
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
Scraping with Selenium in Python
Scraping with Selenium + Python Part 2
How to not load images when using PhantomJS with Selenium
Scraping with Selenium in Python (Basic)
Scraping with Python, Selenium and Chromedriver
Scraping a website using JavaScript in Python
Try HTML scraping with a Python library
Practice web scraping with Python and Selenium
Use selenium phantomjs webdriver with python unittest
Scraping with selenium
Scraping with selenium ~ 2 ~
Scraping with Python
Scraping with Python
Scraping with Selenium
Error when installing a module with Python pip
[Python3] A story stuck with time zone conversion
A story stuck with handling Python binary data
A memo when creating a python environment with miniconda
A story when a Python user passes a JSON file
Freeze with send_keys of file selection when running Selenium WebDriver in Python [PhantomJS]
A story that I was addicted to when I made SFTP communication with python
Successful scraping with Selenium
Scraping with Python (preparation)
Try scraping with Python.
A story about making 3D space recognition with Python
A story about making Hanon-like sheet music with Python
Problems when creating a csv-json conversion tool with python
I was addicted to scraping with Selenium (+ Python) in 2020
A story about trying a (Golang +) Python monorepo with Bazel
ScreenShot with Selenium (Python)
Troublesome story when using Python3 with VScode on ubuntu
I tried various things with Python: scraping (Beautiful Soup + Selenium + PhantomJS) and morphological analysis.
A memo of misunderstanding when trying to load the entire self-made module with Python3
Python web scraping selenium
Scraping with Python + PyQuery
Scraping RSS with Python
A story that went missing when I specified a path starting with a tilde (~) in python open
A story that stumbled when I made a chatbot with Transformer
Recommendations for django, wagtail ~ Why develop a website with python ~
A memo when face is detected with Python + OpenCV quickly
[python] A note when trying to use numpy with Cython
Get a list of purchased DMM eBooks with Python + Selenium
Use a macro that runs when saving python with vscode
A story about an amateur making a breakout with python (kivy) ②
A story about an amateur making a breakout with python (kivy) ①
[Selenium] Change log output destination when executing phantomjs in python3
A story about a python beginner stuck with No module named'http.server'
I tried scraping with Python
Web scraping with python + JupyterLab
Scraping with chromedriver in python
Story when iterating python tuple
Festive scraping with Python, scrapy
Python: Working with Firefox with selenium
Stumble story with Python array
Scraping with Tor in Python
Make a fortune with Python