[2020 version] Development procedure for personal crawlers and precautions

Introduction

The purpose of this article is to help anyone interested in automating web content collection develop the ideal crawler. To that end, here are 12 steps to developing a crawler. A crawler is an automated program that crawls websites and records and collects content. Among them, I think that the ideal crawler is one that complies with the law and the ethics of each person and does not interfere with the operation of the target website. Non-ideal crawlers can be denied access from websites or forced to transition to an error screen. Let your computer do the tedious work, avoid troubles, and increase your disposable time.

■ Reference site (Python introductory site)

01. The purpose of content collection is to either "use between individuals or families," "provide Web search services," or "analyze information."

All web content is the copyrighted work of someone. Therefore, it is subject to copyright law. Under copyright law, the following purposes are for free use of a work.

■ Reference site

In particular, the latter two have become more legal in the 2018 Revised Copyright Law. The restrictions have also been clarified, so please check the details.

■ Reference site

However, even for those purposes, the following points should be noted in common.

02. The target website should contain legal content.

It is prohibited to download illegally uploaded content while recognizing it as illegal. Revision of Copyright Law (enforced on January 1, 3rd year of Reiwa), the type of content (ex. Text, image , Audio, video) does not matter. It is highly recommended that you target websites for which the author has agreed to provide the content.

■ Reference site

03. When targeting content for members of a website, follow the terms of use of that website.

If you agree to the terms of use in order to become a member of the website, you must comply with those terms of use. If the terms of use prohibit crawlers, the website cannot be crawled.

04. If the website gives instructions to the crawler, follow those instructions.

A website may direct the crawler to work for the entire website, for a specified web page, or for a specific screen item. There are the following instructions, and you need to follow them.

05. It is not recommended to hide the connection source IP address in VPN or proxy server.

The IP address of the connection source is recorded on the website of the connection destination and may be used for access blocking or identification. To avoid this, you can use a VPN or proxy server (a service that puts an intermediary between the connection destination and the connection source) to hide the IP address of the connection source. (To be exact, the IP address of the intermediary is notified to the connection destination) However, some websites may block access from VPNs and proxy servers. In addition, since all communications are entrusted to an intermediary, using unreliable services may lead to information leakage. And since access cannot be blocked by IP address on the website side, there is a high possibility that higher-level measures will be taken.

06. Use the browser's automatic operation tool "Selenium" as a crawler.

From here, we will talk about specific crawler development. Many websites assume Javascript processing before the information is displayed. In order to get the information after the processing, let the browser (to be exact, the rendering engine) process it. Therefore, it is recommended to use the browser's automatic operation tool "Selenium" as a crawler. The following site was very helpful for the installation procedure and basic operation of Selenium.

[Complete version] Cheat sheet that automatically operates (crawling / scraping) the browser with Python and Selenium | Tanuhack

The author uses Firefox as the target browser, but please use whatever you like, especially Google Chrome or Microsoft Edge. In this article, we will explain on the premise of Firefox.

Firefox download page

Firefox Web Driver Download Page

It is possible to perform processing in the background (headless setting), but it is recommended to display the screen at the beginning. Sometimes it fails due to screen transition, and sometimes it is not possible to get the specified screen item. In addition, due to the renewal of the website, it may be necessary to completely renovate.

Start Firefox


from selenium import webdriver
from selenium.webdriver.firefox.options import Options

#Setting options
options = Options()
#options.set_headless() #If you want to hide the screen, uncomment.

#Launch browser
driver = webdriver.Firefox(executable_path='Firefox WebDriver path (ex.~/geckodriver)', options=options)

#Wait time setting
driver.implicitly_wait(5) #Maximum wait seconds for screen items to be displayed
driver.set_page_load_timeout(180) #Maximum waiting seconds until screen display

07. Access with the cookie of the target Web service.

Most web services issue a cookie to each visitor's browser and use that cookie to manage the visitor. And since most crawlers access with the initial visit (without cookies), just having cookies increases the possibility of being treated as a general user. Create a new profile (browser user information) in your browser and manually access the target website.

After that, by specifying the path of the previous profile in Selenium, you can move the crawler in the state of revisit (with cookies). Firefox profiles are stored in folders in the following location, and if there are multiple profiles, you can determine the newly created one by the update date and time. C: / Users / username / AppData / Roaming / Mozilla / Firefox / Profiles /

Set profile


#Launch browser
profiler = webdriver.FirefoxProfile('Firefox profile path (ex.~/Firefox/Profiles/FOLDER-NAME)')
driver = webdriver.Firefox(executable_path='Firefox WebDriver path', options=options, firefox_profile=profiler)

08. To transition to any other page on the same site, rewrite the URL of the hyperlink and click it.

Selenium has commands for screen transitions, and use the following commands when transitioning to the target website. This command is the same as entering the URL directly in the URL field of the browser and transitioning to the screen.

Transition to another site page


driver.get('Any URL')

To move within the same site, use commands that imitate normal operations such as clicking hyperlinks and buttons.

Transition to another page on the same site


#Get hyperlink
link = driver.find_element_by_css_selector('CSS selector')
#Click the previous item
link.click()

However, you may want to move to any page even within the same site. (Ex. After acquiring the content from the detail screen of product A, the link of product B is presented as the related product, and the screen transitions to the detail screen of product B) In that case, rewrite the transition destination URL with a hyperlink or button with Javascript, and transition by clicking the item.

Rewrite the URL of the hyperlink and move to another page on the same site


#Get hyperlink
link = driver.find_element_by_css_selector('CSS selector')
#Scroll to the position where the previous item can be seen on the screen
driver.execute_script("arguments[0].scrollIntoView()", link)
#Rewrite the transition destination URL of the previous item to an arbitrary one
driver.execute_script("arguments[0].setAttribute('href','{}')".format('Any URL'), link)
#Click the previous item
link.click()

This is to avoid unnatural behavior such as entering the URL directly in the URL field and making the transition even though the transition is within the same site. Technically speaking, the purpose is to make a transition while raising a Javascript click event with the referrer (screen transition source URL) of the same domain set.

09. After the screen transition, wait for a random second.

After the screen transition, add a process to wait for a random second. This is not only to reduce the load on the website, but also to prevent screen transition troubles. Depending on the waiting time after the screen transition, there was a problem that the screen transition could not be performed normally because it was operated before the processing with Javascript. I think it's a good idea to change this waiting time depending on the website.

Wait for random seconds


from time import sleep
import random

def get_wait_secs():
  """Get screen wait seconds"""
  max_wait = 7.0   #Maximum wait seconds
  min_wait = 3.0   #Minimum wait seconds
  mean_wait = 5.0  #Average wait seconds
  sigma_wait = 1.0 #Standard deviation (blurring width)
  return min([max_wait, max([min_wait, round(random.normalvariate(mean_wait, sigma_wait))])])

sleep(get_wait_secs())

10. When downloading the file, attach the user agent and referrer to the request.

Downloading files, especially image files, may have the above restrictions set to avoid direct links and chaotic downloads. In normal browser operation, the user agent (information about the connection source OS, browser, etc.) and referrer (screen transition source URL) are set automatically.

Download image file


import requests
import shutil

img = driver.find_element_by_css_selector('CSS selector to img tag')
src = img.get_attribute('src')
r = requests.get(src, stream=True, headers={'User-Agent':'User agent' , 'Referer':driver.current_url})
if r.status_code == 200:
  with open('Screen save destination path', 'wb') as f:
    r.raw.decode_content = True
    shutil.copyfileobj(r.raw, f)

11. Make the crawler start time irregular. Also, the startup time at one time should be about several hours.

You can start the crawler manually, but you can also start it by appointment on a regular basis. The following tools are provided as standard, depending on the OS.

However, if you continue to access the website at exactly the same time, the load on the website will be concentrated at that time, so it is recommended to prepare a certain amount of fluctuation in the access time. One way to do this is to start the reservation and then stop it for a random time using the same procedure as in 08.

Also, since access for more than a few hours is the same, it is recommended to add a process to stop the crawler after accessing for a while.

12. Add crawler operational rules to your ethics.

In addition to the above, if you find an operation rule that does not interfere with the operation of the target website, we will introduce it. Thank you for using the website.

(As of September 6, 2020)

Recommended Posts

[2020 version] Development procedure for personal crawlers and precautions
Installation procedure for Python and Ansible with a specific version
Precautions for handling png and jpg images
6 Python libraries for faster development and debugging
Roadmap and reference materials for web development self-study
Precautions for cv2.cvtcolor
I tried pipenv and asdf for Python version control
Personal best practices for VS Code-fronted Python development environments