Let's scrape a dynamic site with Docker

Introduction

This is the article on the 13th day of Kinki University Advent Calendar 2019.

First of all, it's important to note that basically scraping is a last resort that shouldn't be done if you don't need it. This time we will scrape Qiita's tag ranking, but Qiita has an api and you get the tag ranking there. I did scraping because there was no api to do (as of December 8, 2019). If you can get the information you want using api, get it with api. Also, when scraping, wait at the time of connection Let's set aside time and allow time for connection.

Things necessary

Docker

The version I'm using is below. Docker version 19.03.5, build 633a0ea

docker-compose.yml

docekr-compsoe.yml


version: '3'

services:
    selenium-hub:
        image: selenium/hub
        container_name: selenium-hub
        ports:
          - "4444:4444"
      
    chrome:
        image: selenium/node-chrome-debug
        depends_on:
          - selenium-hub
        environment:
          - HUB_PORT_4444_TCP_ADDR=selenium-hub 
          - HUB_PORT_4444_TCP_PORT=4444
    
    python:
        build: .
        container_name: python
        volumes: 
            - .:/workspace
        command: /bin/bash
        tty: true
        stdin_open: true
FROM python:3.7
WORKDIR /workspace

RUN pip install \
    selenium \
    beautifulsoup4

dockerfile and compose description

I will not write what dokcer-compose and dockerfile are anymore, so if you do not understand I tried using Elixir's Phoenix and PostgreSQL on Docker Please refer to that because it is written in detail in.

Let's start with docker-compose.yml.

I think the basic Selenium option is now for scraping dynamic sites.

Create a container with the images selenium / hub and selenium / node-chrome-debug. Here, ʻenvironment is set in selenium / node-chrome-debug`. Please note that you cannot scrape without this.

The Dockerfile builds a python environment. The RUN command is used to download the required libraries.

Place these files and the code below on the same level and run docker-compose up -d --build to launch the container.

Code example

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pprint

class QiitaGetRanking():
    """
Class to get ranking data from Qiita.
    """

    def get_tag_ranking(self, browser: webdriver) -> dict:
        """
Function to get information about tag ranking from Qiita.
        
        Parameters
        ----------
        browser: webdrive
Webdriver object for scraping
        
        Returns
        -------
        tag_ranking_data: dict
Dictionary object containing tag ranking.
        """
        html = browser.page_source.encode('utf-8')
        soup = BeautifulSoup(html, "html.parser")
        ra_tag_names = soup.find_all(class_='ra-Tag_name pr-1')
        tag_ranking_data = {}
        for i, ra_tag_name in enumerate(ra_tag_names):
            tag_ranking_data[i+1] = [ra_tag_name.text, 
            'https://qiita.com/tags/%s'%(ra_tag_name.text.lower())]
        return tag_ranking_data

if __name__ == "__main__":
    """
main statement.The browser should be closed as soon as the html is acquired.The same applies when an error occurs.
    """
    
    try:
        browser = webdriver.Remote(
            command_executor='http://selenium-hub:4444/wd/hub',
            desired_capabilities=DesiredCapabilities.CHROME)
        print("start scrape")
        browser.get('https://qiita.com')
        #Wait until all javascript is loaded.Timeout judgment if reading does not finish even after 15 seconds.
        WebDriverWait(browser, 15).until(EC.presence_of_all_elements_located)
        print("generate object")
        qgr = QiitaGetRanking()
        ranking_data = qgr.get_tag_ranking(browser)
        browser.close()
        browser.quit()
        pprint.pprint(ranking_data)
    except:
        browser.close()
        browser.quit()

I think the code is pretty simple. When scraping with docker, building a selenium server and scraping is easier and lighter than building a selenium environment on ubuntu.

For webdrriver.Remote, refer to 2.5. Using Selenium with Remote WebDriver. Please try.

Execution result

Make sure you have 3 containers in docker ps, then run the program in docker exec -it python python qiita.py.

start scrape
generate object
{1: ['Python', 'https://qiita.com/tags/python'],
 2: ['JavaScript', 'https://qiita.com/tags/javascript'],
 3: ['AWS', 'https://qiita.com/tags/aws'],
 4: ['Rails', 'https://qiita.com/tags/rails'],
 5: ['Ruby', 'https://qiita.com/tags/ruby'],
 6: ['Beginner', 'https://qiita.com/tags/Beginner'],
 7: ['Docker', 'https://qiita.com/tags/docker'],
 8: ['PHP', 'https://qiita.com/tags/php'],
 9: ['Vue.js', 'https://qiita.com/tags/vue.js'],
 10: ['Go', 'https://qiita.com/tags/go']}

If it is displayed like this, it is a complete victory. Thank you for your hard work.

at the end

This time I showed you how to scrape a dynamic site using Docker. Let's enjoy scraping within the limits!

Recommended Posts

Let's scrape a dynamic site with Docker
Creating a Flask server with Docker
Build a deb file with Docker
Deploy a Django application with Docker
Let's make a breakout with wxPython
Django Tips-Create a ranking site with Django-
Let's make a graph with python! !!
Let's make a supercomputer with xCAT
Let's make a shiritori game with Python
Set up a Samba server with Docker
Let's create a free group with Python
Let's try gRPC with Go and Docker
Get a local DynamoDB environment with Docker
Let's make a voice slowly with Python
Let's make a simple language with PLY 1
[Linux] Build a jenkins environment with Docker
Let's make a multilingual site using flask-babel
Run a Python web application with Docker
Let's make a web framework with Python! (1)
Create a web service with Docker + Flask
Let's make a tic-tac-toe AI with Pylearn 2
Let's make a Twitter Bot with Python!
Let's make a web framework with Python! (2)
[Linux] Build a Docker environment with Amazon Linux 2
[Piyopiyokai # 1] Let's play with Lambda: Creating a Lambda function
Start a simple Python web server with Docker
Let's replace UWSC with Python (5) Let's make a Robot
Access a site with client certificate authentication with Requests
I made a ready-to-use syslog server with Play with Docker
[Let's play with Python] Make a household account book
Launch a Python web application with Nginx + Gunicorn with Docker
Let's feel like a material researcher with machine learning
Let's make dependency management with pip a little easier
Create a Layer for AWS Lambda Python with Docker
Let's make a Mac app with Tkinter and py2app
Let's make a spherical grid with Rhinoceros / Grasshopper / GHPython
[Piyopiyokai # 1] Let's play with Lambda: Get a Twitter account
A memo about building a Django (Python) application with Docker
Launch Django on a Docker container with docker-compose up
Build a development environment with Poetry Django Docker Pycharm
[Piyopiyokai # 1] Let's play with Lambda: Creating a Python script
[Super easy] Let's make a LINE BOT with Python.
[Linux] Create a self-signed certificate with Docker and apache
Tftp server with Docker
A4 size with python-pptx
Use python with docker
Proxy server with Docker
Hello, World with Docker
Carry a Docker container
Dynamic analysis with Valgrind
Decorate with a decorator
Let's make a websocket client with Python. (Access token authentication)
Build a Django development environment with Docker! (Docker-compose / Django / postgreSQL / nginx)
[Memo] Build a development environment for Django + Nuxt.js with Docker
Nostalgic, let's reproduce a character game like CBM-3032 with ncursesw.
Let's create a script that registers with Ideone.com in Python.
Let's create a PRML diagram with Python, Numpy and matplotlib.
[Django] Build a Django container (Docker) development environment quickly with PyCharm
Let's create a Docker environment that stores Qiita trend information!
Create a simple Python development environment with VSCode & Docker Desktop
Create a Todo app with Django â‘  Build an environment with Docker