I want to load js by scraping

It's easy to use a headless browser because phantom.js still works comfortably for doing it locally on Mac os. I wanted to use it with Cloud Run, so when I tried to check the operation with the official python image, it was unexpectedly complicated, so make a note

I don't understand the information is overflowing

Apparently phantom.js would stop updating, so I decided to use headless-chrome quietly. I didn't want to use man-hours, so I caught other people's articles But I'm stupid, so I didn't understand it anyway when I saw the article, so I decided to do something myself

Conclusion

It's not particularly difficult, and it works easily if the following conditions are met.

Download chrome body
Download the driver version that matches the chrome body
Set startup options appropriately

Dockerfile Base image used

# Use the official Python image.
# https://hub.docker.com/_/python
FROM python:3.7

Be sure to check the version at the time of installation when downloading Chrome itself.

RUN sh -c 'echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add -
RUN apt update
RUN apt install google-chrome-stable -y

Find the driver download that is closest to your version of Chrome first https://chromedriver.storage.googleapis.com/ Then, look for the latest version that is closest to the version of the main unit, because this time it was 80 units https://chromedriver.storage.googleapis.com/LATEST_RELEASE_80

Download and unzip with the number you find

RUN wget https://chromedriver.storage.googleapis.com/80.0.3987.106/chromedriver_linux64.zip
RUN unzip chromedriver_linux64.zip -d /usr/bin/

Of course, make sure you can see the PATH for both at this stage.

which chromedriver
witch google-chrome

Just this is OK After that, just use it, write a usage example for the time being

`app.py`


from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup


URL = "https://example.jp"


def get_trends():
    try:
        options = Options()
        options.add_argument('--headless')
        options.add_argument('--no-sandbox')
        options.add_argument('--disable-dev-shm-usage')

        driver = webdriver.Chrome(options=options)
        driver.get(URL)
        html = driver.page_source.encode('utf-8')  # more sophisticated methods may be available
        soup = BeautifulSoup(html, "lxml")

the end

Bumpy memo

[DOCKER] Run headless-chrome on a Debian-based image

I want to load js by scraping

I don't understand the information is overflowing

Conclusion

app.py

the end

`app.py`