Try using Selenium

Selenium first experience

I had to scrape sites with dynamic elements, so I had no choice but to start learning Selenium.

pip install selenium

Since the browser wants to use Chrome, download the Chrome Driver and move it under the virtual environment. I moved to / bin.

https://sites.google.com/a/chromium.org/chromedriver/downloads

I will try to see if it works immediately. Verification uses Yahoo! as the URL.

test.py


import os
import time
from selenium import webdriver

DRIVER_PATH = os.path.join(os.path.dirname(__file__), 'chromedriver')
browser = webdriver.Chrome(DRIVER_PATH)
browser.get('https://www.yahoo.co.jp')

try:
    elem_1 = browser.find_element_by_class_name('emphasis')
    print ('<{}>Discover!'.format(elem_1.text))
    time.sleep(3)
except:
    print ('No')
(flaskworks) $ python test.py
<GDP year 1.0%Downward revision to increase
Contradiction photo NEW to the Prime Minister's answer
Uncle angry testimony photo of British terrorist suspect
Mt. Fuji in Gunma?Misleading station name photo NEW
Former Idol Bartender No.1 photo
Tanaka learn the language Commentator apology photo NEW
Honda Photograph of passive smoking immediately after the game
Yamazaki Anna Photograph admitted to dating with Obata NEW>Discover!

Confirm that it works safely. I will also try page turning.

test.py


import os
import time
from selenium import webdriver

DRIVER_PATH = os.path.join(os.path.dirname(__file__), 'chromedriver')
browser = webdriver.Chrome(DRIVER_PATH)
browser.get('https://www.yahoo.co.jp')

try:
    link_elem = browser.find_element_by_link_text('See more')
    link_elem.click()
    text_elem = browser.find_element_by_class_name('ttl')
    print (text_elem.text)
    time.sleep(3)
except:
    print ('No')
(flaskworks)$ python test.py
North Korea launches unknown projectile

that? You can only get one case.

link_elem = browser.find_element_by_class_name('list')

When rewritten,

(flaskworks) $ python test.py
North Korea launches unknown projectile
international
6/8(wood) 7:42
Nishikori's defeat regret is a tiebreaker
Sports
6/8(wood) 5:10
Nishikori reversal defeated French Open 4 not strong
Sports
6/8(wood) 2:12
North Korean ballistic missile launch signs
international
....Omitted below

I see. Maybe this is easier than Beautiful Soup.

Custom that automatically patrols multiple pages

It's just a rough addition of page parameters. After all, I clicked Next, so it's not beautiful as a process. I think there is a better way, but this is the limit because it's just the beginning.

test.py


import os
import time
from selenium import webdriver

DRIVER_PATH = os.path.join(os.path.dirname(__file__), 'chromedriver')
browser = webdriver.Chrome(DRIVER_PATH)
url = 'https://news.yahoo.co.jp/list/?c=domestic&p='

a = 0
i = 1
while a < 5:
    a += 1
    try:
        browser.get(url)
        link_elem = browser.find_element_by_link_text('next')
        link_elem.click()
        text_elem = browser.find_element_by_css_selector('.list')
        print (text_elem.text)
        time.sleep(3)
        i += 1
        url = 'https://news.yahoo.co.jp/list/?c=domestic&p=' + str(i)
    except:
        print ('No')

Recommended Posts

Try using Selenium
Try tweeting automatically using Selenium.
Try using Tkinter
Try using docker-py
Try using cookiecutter
Try using PDFMiner
Try using geopandas
Try using scipy
Try using pandas.DataFrame
Try using django-swiftbrowser
Try using matplotlib
Try using tf.metrics
Try using PyODE
Try using virtualenv (virtualenvwrapper)
[Azure] Try using Azure Functions
Try using virtualenv now
Try using W & B
Try using Django templates.html
[Kaggle] Try using LGBM
Try using Python's feedparser.
Try using Python's Tkinter
Try using Tweepy [Python2.7]
Try using Pytorch's collate_fn
Try using PythonTex with Texpad.
[Python] Try using Tkinter's canvas
Try using Jupyter's Docker image
Try function optimization using Hyperopt
Try using matplotlib with PyCharm
Try using Azure Logic Apps
Try using Kubernetes Client -Python-
[Kaggle] Try using xg boost
Try using the Twitter API
selenium
Start to Selenium using python
Try using AWS SageMaker Studio
Try using SQLAlchemy + MySQL (Part 1)
Try using the Twitter API
Try using SQLAlchemy + MySQL (Part 2)
Try using Django's template feature
Web scraping using Selenium (Python)
Try using the PeeringDB 2.0 API
Try using Pelican's draft feature
Try using pytest-Overview and Samples-
Try Selenium Grid with Docker
Try using folium with anaconda
Try using Janus gateway's Admin API
[Statistics] [R] Try using quantile regression.
Try using Spyder included in Anaconda
Try using design patterns (exporter edition)
Try using Pillow on iPython (Part 2)
Try using Pleasant's API (python / FastAPI)
Try using LevelDB in Python (plyvel)
Try using pynag to configure Nagios
Try using PyCharm's remote debugging feature
Try using ArUco on Raspberry Pi
Try using cheap LiDAR (Camsense X1)
[Sakura rental server] Try using flask.
Try using Pillow on iPython (Part 3)
Reinforcement learning 8 Try using Chainer UI
Try to get statistics using e-Stat
Try using Python argparse's action API