Introduction

The purpose of this article is to help anyone interested in automating web content collection develop the ideal crawler. To that end, here are 12 steps to developing a crawler. A crawler is an automated program that crawls websites and records and collects content. Among them, I think that the ideal crawler is one that complies with the law and the ethics of each person and does not interfere with the operation of the target website. Non-ideal crawlers can be denied access from websites or forced to transition to an error screen. Let your computer do the tedious work, avoid troubles, and increase your disposable time.

This article assumes individual crawler development.
The description of the law includes the interpretation of the author who is not a lawyer, so please use it as a reference only. I can't take responsibility.
Knowledge of Python will help you understand this article.

■ Reference site (Python introductory site)

01. The purpose of content collection is to either "use between individuals or families," "provide Web search services," or "analyze information."

All web content is the copyrighted work of someone. Therefore, it is subject to copyright law. Under copyright law, the following purposes are for free use of a work.

Used by individuals and families.
Provide a web search service.
Analyze information.

■ Reference site

When the copyrighted work can be used freely | Agency for Cultural Affairs

In particular, the latter two have become more legal in the 2018 Revised Copyright Law. The restrictions have also been clarified, so please check the details.

■ Reference site

[Commentary by an attorney: "Impact" on business by the "2018 revised copyright law" (1/5) --ITmedia NEWS](https://www.itmedia.co.jp/news/articles/1904/04 /news009.html)

However, even for those purposes, the following points should be noted in common.

02. The target website should contain legal content.

It is prohibited to download illegally uploaded content while recognizing it as illegal. Revision of Copyright Law (enforced on January 1, 3rd year of Reiwa), the type of content (ex. Text, image , Audio, video) does not matter. It is highly recommended that you target websites for which the author has agreed to provide the content.

■ Reference site

Revised copyright law passed. Expanding the scope of illegal downloads: Education and ICT Onlin

03. When targeting content for members of a website, follow the terms of use of that website.

If you agree to the terms of use in order to become a member of the website, you must comply with those terms of use. If the terms of use prohibit crawlers, the website cannot be crawled.

04. If the website gives instructions to the crawler, follow those instructions.

A website may direct the crawler to work for the entire website, for a specified web page, or for a specific screen item. There are the following instructions, and you need to follow them.

Robots.txt …… Crawler patrol rules for each website (text file) ⇒ Robots.txt specifications|Google Search Developer Guide| Google Developers
robots meta tags, data-nosnippet, X-Robots-Tag …… Web page-based patrol rules (keywords in the response) ⇒ Robots meta tags, data-nosnippet, X-Robots-Tag specifications | Google Search Developer Guide
rel attribute of a tag …… A patrol rule for each hyperlink of a Web page. (HTML tag attributes) ⇒ Tell Google the relationship of external links --Search Console Help

05. It is not recommended to hide the connection source IP address in VPN or proxy server.

The IP address of the connection source is recorded on the website of the connection destination and may be used for access blocking or identification. To avoid this, you can use a VPN or proxy server (a service that puts an intermediary between the connection destination and the connection source) to hide the IP address of the connection source. (To be exact, the IP address of the intermediary is notified to the connection destination) However, some websites may block access from VPNs and proxy servers. In addition, since all communications are entrusted to an intermediary, using unreliable services may lead to information leakage. And since access cannot be blocked by IP address on the website side, there is a high possibility that higher-level measures will be taken.

06. Use the browser's automatic operation tool "Selenium" as a crawler.

From here, we will talk about specific crawler development. Many websites assume Javascript processing before the information is displayed. In order to get the information after the processing, let the browser (to be exact, the rendering engine) process it. Therefore, it is recommended to use the browser's automatic operation tool "Selenium" as a crawler. The following site was very helpful for the installation procedure and basic operation of Selenium.

[Complete version] Cheat sheet that automatically operates (crawling / scraping) the browser with Python and Selenium | Tanuhack

The author uses Firefox as the target browser, but please use whatever you like, especially Google Chrome or Microsoft Edge. In this article, we will explain on the premise of Firefox.

Firefox download page

Firefox Web Driver Download Page

Download the one for your OS.

It is possible to perform processing in the background (headless setting), but it is recommended to display the screen at the beginning. Sometimes it fails due to screen transition, and sometimes it is not possible to get the specified screen item. In addition, due to the renewal of the website, it may be necessary to completely renovate.

`Start Firefox`


from selenium import webdriver
from selenium.webdriver.firefox.options import Options

#Setting options
options = Options()
#options.set_headless() #If you want to hide the screen, uncomment.

#Launch browser
driver = webdriver.Firefox(executable_path='Firefox WebDriver path (ex.~/geckodriver）', options=options)

#Wait time setting
driver.implicitly_wait(5) #Maximum wait seconds for screen items to be displayed
driver.set_page_load_timeout(180) #Maximum waiting seconds until screen display

07. Access with the cookie of the target Web service.

Most web services issue a cookie to each visitor's browser and use that cookie to manage the visitor. And since most crawlers access with the initial visit (without cookies), just having cookies increases the possibility of being treated as a general user. Create a new profile (browser user information) in your browser and manually access the target website.

Use Profile Manager to create or delete Firefox profiles | Firefox Help

After that, by specifying the path of the previous profile in Selenium, you can move the crawler in the state of revisit (with cookies). Firefox profiles are stored in folders in the following location, and if there are multiple profiles, you can determine the newly created one by the update date and time. C: / Users / username / AppData / Roaming / Mozilla / Firefox / Profiles /

`Set profile`


#Launch browser
profiler = webdriver.FirefoxProfile('Firefox profile path (ex.~/Firefox/Profiles/FOLDER-NAME)')
driver = webdriver.Firefox(executable_path='Firefox WebDriver path', options=options, firefox_profile=profiler)

08. To transition to any other page on the same site, rewrite the URL of the hyperlink and click it.

Selenium has commands for screen transitions, and use the following commands when transitioning to the target website. This command is the same as entering the URL directly in the URL field of the browser and transitioning to the screen.

`Transition to another site page`


driver.get('Any URL')

To move within the same site, use commands that imitate normal operations such as clicking hyperlinks and buttons.

`Transition to another page on the same site`


#Get hyperlink
link = driver.find_element_by_css_selector('CSS selector')
#Click the previous item
link.click()

However, you may want to move to any page even within the same site. (Ex. After acquiring the content from the detail screen of product A, the link of product B is presented as the related product, and the screen transitions to the detail screen of product B) In that case, rewrite the transition destination URL with a hyperlink or button with Javascript, and transition by clicking the item.

`Rewrite the URL of the hyperlink and move to another page on the same site`


#Get hyperlink
link = driver.find_element_by_css_selector('CSS selector')
#Scroll to the position where the previous item can be seen on the screen
driver.execute_script("arguments[0].scrollIntoView()", link)
#Rewrite the transition destination URL of the previous item to an arbitrary one
driver.execute_script("arguments[0].setAttribute('href','{}')".format('Any URL'), link)
#Click the previous item
link.click()

This is to avoid unnatural behavior such as entering the URL directly in the URL field and making the transition even though the transition is within the same site. Technically speaking, the purpose is to make a transition while raising a Javascript click event with the referrer (screen transition source URL) of the same domain set.

09. After the screen transition, wait for a random second.

After the screen transition, add a process to wait for a random second. This is not only to reduce the load on the website, but also to prevent screen transition troubles. Depending on the waiting time after the screen transition, there was a problem that the screen transition could not be performed normally because it was operated before the processing with Javascript. I think it's a good idea to change this waiting time depending on the website.

`Wait for random seconds`


from time import sleep
import random

def get_wait_secs():
  """Get screen wait seconds"""
  max_wait = 7.0   #Maximum wait seconds
  min_wait = 3.0   #Minimum wait seconds
  mean_wait = 5.0  #Average wait seconds
  sigma_wait = 1.0 #Standard deviation (blurring width)
  return min([max_wait, max([min_wait, round(random.normalvariate(mean_wait, sigma_wait))])])

sleep(get_wait_secs())

10. When downloading the file, attach the user agent and referrer to the request.

Downloading files, especially image files, may have the above restrictions set to avoid direct links and chaotic downloads. In normal browser operation, the user agent (information about the connection source OS, browser, etc.) and referrer (screen transition source URL) are set automatically.

`Download image file`


import requests
import shutil

img = driver.find_element_by_css_selector('CSS selector to img tag')
src = img.get_attribute('src')
r = requests.get(src, stream=True, headers={'User-Agent':'User agent' , 'Referer':driver.current_url})
if r.status_code == 200:
  with open('Screen save destination path', 'wb') as f:
    r.raw.decode_content = True
    shutil.copyfileobj(r.raw, f)

11. Make the crawler start time irregular. Also, the startup time at one time should be about several hours.

You can start the crawler manually, but you can also start it by appointment on a regular basis. The following tools are provided as standard, depending on the OS.

Windows-Task Scheduler [[Windows 10 compatible] Automate regular work with task scheduler (1/2): Tech TIPS-@IT](https://www.atmarkit.co.jp/ait/articles/1305/31/news049 .html)
MacOS/Linux - cron Cron Configuration Guide

However, if you continue to access the website at exactly the same time, the load on the website will be concentrated at that time, so it is recommended to prepare a certain amount of fluctuation in the access time. One way to do this is to start the reservation and then stop it for a random time using the same procedure as in 08.

Also, since access for more than a few hours is the same, it is recommended to add a process to stop the crawler after accessing for a while.

12. Add crawler operational rules to your ethics.

In addition to the above, if you find an operation rule that does not interfere with the operation of the target website, we will introduce it. Thank you for using the website.

(As of September 6, 2020)

[2020 version] Development procedure for personal crawlers and precautions

Introduction

01. The purpose of content collection is to either "use between individuals or families," "provide Web search services," or "analyze information."

02. The target website should contain legal content.

03. When targeting content for members of a website, follow the terms of use of that website.

04. If the website gives instructions to the crawler, follow those instructions.

05. It is not recommended to hide the connection source IP address in VPN or proxy server.

06. Use the browser's automatic operation tool "Selenium" as a crawler.

Start Firefox

07. Access with the cookie of the target Web service.

Set profile

08. To transition to any other page on the same site, rewrite the URL of the hyperlink and click it.

Transition to another site page

Transition to another page on the same site

Rewrite the URL of the hyperlink and move to another page on the same site

09. After the screen transition, wait for a random second.

Wait for random seconds

10. When downloading the file, attach the user agent and referrer to the request.

Download image file

11. Make the crawler start time irregular. Also, the startup time at one time should be about several hours.

12. Add crawler operational rules to your ethics.

`Start Firefox`

`Set profile`

`Transition to another site page`

`Transition to another page on the same site`

`Rewrite the URL of the hyperlink and move to another page on the same site`

`Wait for random seconds`

`Download image file`