This is the article on the 13th day of Kinki University Advent Calendar 2019.
First of all, it's important to note that basically scraping is a last resort that shouldn't be done if you don't need it. This time we will scrape Qiita's tag ranking, but Qiita has an api and you get the tag ranking there. I did scraping because there was no api to do (as of December 8, 2019). If you can get the information you want using api, get it with api. Also, when scraping, wait at the time of connection Let's set aside time and allow time for connection.
Docker
The version I'm using is below.
Docker version 19.03.5, build 633a0ea
docker-compose.yml
docekr-compsoe.yml
version: '3'
services:
selenium-hub:
image: selenium/hub
container_name: selenium-hub
ports:
- "4444:4444"
chrome:
image: selenium/node-chrome-debug
depends_on:
- selenium-hub
environment:
- HUB_PORT_4444_TCP_ADDR=selenium-hub
- HUB_PORT_4444_TCP_PORT=4444
python:
build: .
container_name: python
volumes:
- .:/workspace
command: /bin/bash
tty: true
stdin_open: true
FROM python:3.7
WORKDIR /workspace
RUN pip install \
selenium \
beautifulsoup4
I will not write what dokcer-compose and dockerfile are anymore, so if you do not understand I tried using Elixir's Phoenix and PostgreSQL on Docker Please refer to that because it is written in detail in.
Let's start with docker-compose.yml.
I think the basic Selenium
option is now for scraping dynamic sites.
Create a container with the images selenium / hub
and selenium / node-chrome-debug
.
Here, ʻenvironment is set in
selenium / node-chrome-debug`. Please note that you cannot scrape without this.
The Dockerfile builds a python environment. The RUN
command is used to download the required libraries.
Place these files and the code below on the same level and run docker-compose up -d --build
to launch the container.
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pprint
class QiitaGetRanking():
"""
Class to get ranking data from Qiita.
"""
def get_tag_ranking(self, browser: webdriver) -> dict:
"""
Function to get information about tag ranking from Qiita.
Parameters
----------
browser: webdrive
Webdriver object for scraping
Returns
-------
tag_ranking_data: dict
Dictionary object containing tag ranking.
"""
html = browser.page_source.encode('utf-8')
soup = BeautifulSoup(html, "html.parser")
ra_tag_names = soup.find_all(class_='ra-Tag_name pr-1')
tag_ranking_data = {}
for i, ra_tag_name in enumerate(ra_tag_names):
tag_ranking_data[i+1] = [ra_tag_name.text,
'https://qiita.com/tags/%s'%(ra_tag_name.text.lower())]
return tag_ranking_data
if __name__ == "__main__":
"""
main statement.The browser should be closed as soon as the html is acquired.The same applies when an error occurs.
"""
try:
browser = webdriver.Remote(
command_executor='http://selenium-hub:4444/wd/hub',
desired_capabilities=DesiredCapabilities.CHROME)
print("start scrape")
browser.get('https://qiita.com')
#Wait until all javascript is loaded.Timeout judgment if reading does not finish even after 15 seconds.
WebDriverWait(browser, 15).until(EC.presence_of_all_elements_located)
print("generate object")
qgr = QiitaGetRanking()
ranking_data = qgr.get_tag_ranking(browser)
browser.close()
browser.quit()
pprint.pprint(ranking_data)
except:
browser.close()
browser.quit()
I think the code is pretty simple. When scraping with docker, building a selenium server and scraping is easier and lighter than building a selenium environment on ubuntu.
For webdrriver.Remote, refer to 2.5. Using Selenium with Remote WebDriver. Please try.
Make sure you have 3 containers in docker ps
, then run the program in docker exec -it python python qiita.py
.
start scrape
generate object
{1: ['Python', 'https://qiita.com/tags/python'],
2: ['JavaScript', 'https://qiita.com/tags/javascript'],
3: ['AWS', 'https://qiita.com/tags/aws'],
4: ['Rails', 'https://qiita.com/tags/rails'],
5: ['Ruby', 'https://qiita.com/tags/ruby'],
6: ['Beginner', 'https://qiita.com/tags/Beginner'],
7: ['Docker', 'https://qiita.com/tags/docker'],
8: ['PHP', 'https://qiita.com/tags/php'],
9: ['Vue.js', 'https://qiita.com/tags/vue.js'],
10: ['Go', 'https://qiita.com/tags/go']}
If it is displayed like this, it is a complete victory. Thank you for your hard work.
This time I showed you how to scrape a dynamic site using Docker. Let's enjoy scraping within the limits!
Recommended Posts