It's easy to use a headless browser because phantom.js still works comfortably for doing it locally on Mac os. I wanted to use it with Cloud Run, so when I tried to check the operation with the official python image, it was unexpectedly complicated, so make a note
Apparently phantom.js would stop updating, so I decided to use headless-chrome quietly. I didn't want to use man-hours, so I caught other people's articles But I'm stupid, so I didn't understand it anyway when I saw the article, so I decided to do something myself
It's not particularly difficult, and it works easily if the following conditions are met.
Download chrome body
Download the driver version that matches the chrome body
Set startup options appropriately
Dockerfile Base image used
# Use the official Python image.
# https://hub.docker.com/_/python
FROM python:3.7
Be sure to check the version at the time of installation when downloading Chrome itself.
RUN sh -c 'echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add -
RUN apt update
RUN apt install google-chrome-stable -y
Find the driver download that is closest to your version of Chrome first https://chromedriver.storage.googleapis.com/ Then, look for the latest version that is closest to the version of the main unit, because this time it was 80 units https://chromedriver.storage.googleapis.com/LATEST_RELEASE_80
Download and unzip with the number you find
RUN wget https://chromedriver.storage.googleapis.com/80.0.3987.106/chromedriver_linux64.zip
RUN unzip chromedriver_linux64.zip -d /usr/bin/
Of course, make sure you can see the PATH for both at this stage.
which chromedriver
witch google-chrome
Just this is OK After that, just use it, write a usage example for the time being
app.py
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
URL = "https://example.jp"
def get_trends():
try:
options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=options)
driver.get(URL)
html = driver.page_source.encode('utf-8') # more sophisticated methods may be available
soup = BeautifulSoup(html, "lxml")
Bumpy memo
Recommended Posts