The purpose of this article is to help anyone interested in automating web content collection develop the ideal crawler. To that end, here are 12 steps to developing a crawler. A crawler is an automated program that crawls websites and records and collects content. Among them, I think that the ideal crawler is one that complies with the law and the ethics of each person and does not interfere with the operation of the target website. Non-ideal crawlers can be denied access from websites or forced to transition to an error screen. Let your computer do the tedious work, avoid troubles, and increase your disposable time.
■ Reference site (Python introductory site)
All web content is the copyrighted work of someone. Therefore, it is subject to copyright law. Under copyright law, the following purposes are for free use of a work.
■ Reference site
In particular, the latter two have become more legal in the 2018 Revised Copyright Law. The restrictions have also been clarified, so please check the details.
■ Reference site
However, even for those purposes, the following points should be noted in common.
It is prohibited to download illegally uploaded content while recognizing it as illegal. Revision of Copyright Law (enforced on January 1, 3rd year of Reiwa), the type of content (ex. Text, image , Audio, video) does not matter. It is highly recommended that you target websites for which the author has agreed to provide the content.
■ Reference site
If you agree to the terms of use in order to become a member of the website, you must comply with those terms of use. If the terms of use prohibit crawlers, the website cannot be crawled.
A website may direct the crawler to work for the entire website, for a specified web page, or for a specific screen item. There are the following instructions, and you need to follow them.
Robots.txt …… Crawler patrol rules for each website (text file) ⇒ Robots.txt specifications|Google Search Developer Guide| Google Developers
robots meta tags, data-nosnippet, X-Robots-Tag …… Web page-based patrol rules (keywords in the response) ⇒ Robots meta tags, data-nosnippet, X-Robots-Tag specifications | Google Search Developer Guide
rel attribute of a tag …… A patrol rule for each hyperlink of a Web page. (HTML tag attributes) ⇒ Tell Google the relationship of external links --Search Console Help
The IP address of the connection source is recorded on the website of the connection destination and may be used for access blocking or identification. To avoid this, you can use a VPN or proxy server (a service that puts an intermediary between the connection destination and the connection source) to hide the IP address of the connection source. (To be exact, the IP address of the intermediary is notified to the connection destination) However, some websites may block access from VPNs and proxy servers. In addition, since all communications are entrusted to an intermediary, using unreliable services may lead to information leakage. And since access cannot be blocked by IP address on the website side, there is a high possibility that higher-level measures will be taken.
From here, we will talk about specific crawler development. Many websites assume Javascript processing before the information is displayed. In order to get the information after the processing, let the browser (to be exact, the rendering engine) process it. Therefore, it is recommended to use the browser's automatic operation tool "Selenium" as a crawler. The following site was very helpful for the installation procedure and basic operation of Selenium.
The author uses Firefox as the target browser, but please use whatever you like, especially Google Chrome or Microsoft Edge. In this article, we will explain on the premise of Firefox.
Firefox Web Driver Download Page
It is possible to perform processing in the background (headless setting), but it is recommended to display the screen at the beginning. Sometimes it fails due to screen transition, and sometimes it is not possible to get the specified screen item. In addition, due to the renewal of the website, it may be necessary to completely renovate.
Start Firefox
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
#Setting options
options = Options()
#options.set_headless() #If you want to hide the screen, uncomment.
#Launch browser
driver = webdriver.Firefox(executable_path='Firefox WebDriver path (ex.~/geckodriver)', options=options)
#Wait time setting
driver.implicitly_wait(5) #Maximum wait seconds for screen items to be displayed
driver.set_page_load_timeout(180) #Maximum waiting seconds until screen display
Most web services issue a cookie to each visitor's browser and use that cookie to manage the visitor. And since most crawlers access with the initial visit (without cookies), just having cookies increases the possibility of being treated as a general user. Create a new profile (browser user information) in your browser and manually access the target website.
After that, by specifying the path of the previous profile in Selenium, you can move the crawler in the state of revisit (with cookies). Firefox profiles are stored in folders in the following location, and if there are multiple profiles, you can determine the newly created one by the update date and time.
C: / Users / username / AppData / Roaming / Mozilla / Firefox / Profiles /
Set profile
#Launch browser
profiler = webdriver.FirefoxProfile('Firefox profile path (ex.~/Firefox/Profiles/FOLDER-NAME)')
driver = webdriver.Firefox(executable_path='Firefox WebDriver path', options=options, firefox_profile=profiler)
Selenium has commands for screen transitions, and use the following commands when transitioning to the target website. This command is the same as entering the URL directly in the URL field of the browser and transitioning to the screen.
Transition to another site page
driver.get('Any URL')
To move within the same site, use commands that imitate normal operations such as clicking hyperlinks and buttons.
Transition to another page on the same site
#Get hyperlink
link = driver.find_element_by_css_selector('CSS selector')
#Click the previous item
link.click()
However, you may want to move to any page even within the same site. (Ex. After acquiring the content from the detail screen of product A, the link of product B is presented as the related product, and the screen transitions to the detail screen of product B) In that case, rewrite the transition destination URL with a hyperlink or button with Javascript, and transition by clicking the item.
Rewrite the URL of the hyperlink and move to another page on the same site
#Get hyperlink
link = driver.find_element_by_css_selector('CSS selector')
#Scroll to the position where the previous item can be seen on the screen
driver.execute_script("arguments[0].scrollIntoView()", link)
#Rewrite the transition destination URL of the previous item to an arbitrary one
driver.execute_script("arguments[0].setAttribute('href','{}')".format('Any URL'), link)
#Click the previous item
link.click()
This is to avoid unnatural behavior such as entering the URL directly in the URL field and making the transition even though the transition is within the same site. Technically speaking, the purpose is to make a transition while raising a Javascript click event with the referrer (screen transition source URL) of the same domain set.
After the screen transition, add a process to wait for a random second. This is not only to reduce the load on the website, but also to prevent screen transition troubles. Depending on the waiting time after the screen transition, there was a problem that the screen transition could not be performed normally because it was operated before the processing with Javascript. I think it's a good idea to change this waiting time depending on the website.
Wait for random seconds
from time import sleep
import random
def get_wait_secs():
"""Get screen wait seconds"""
max_wait = 7.0 #Maximum wait seconds
min_wait = 3.0 #Minimum wait seconds
mean_wait = 5.0 #Average wait seconds
sigma_wait = 1.0 #Standard deviation (blurring width)
return min([max_wait, max([min_wait, round(random.normalvariate(mean_wait, sigma_wait))])])
sleep(get_wait_secs())
Downloading files, especially image files, may have the above restrictions set to avoid direct links and chaotic downloads. In normal browser operation, the user agent (information about the connection source OS, browser, etc.) and referrer (screen transition source URL) are set automatically.
Download image file
import requests
import shutil
img = driver.find_element_by_css_selector('CSS selector to img tag')
src = img.get_attribute('src')
r = requests.get(src, stream=True, headers={'User-Agent':'User agent' , 'Referer':driver.current_url})
if r.status_code == 200:
with open('Screen save destination path', 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
You can start the crawler manually, but you can also start it by appointment on a regular basis. The following tools are provided as standard, depending on the OS.
Windows-Task Scheduler [[Windows 10 compatible] Automate regular work with task scheduler (1/2): Tech TIPS-@IT](https://www.atmarkit.co.jp/ait/articles/1305/31/news049 .html)
MacOS/Linux - cron Cron Configuration Guide
However, if you continue to access the website at exactly the same time, the load on the website will be concentrated at that time, so it is recommended to prepare a certain amount of fluctuation in the access time. One way to do this is to start the reservation and then stop it for a random time using the same procedure as in 08.
Also, since access for more than a few hours is the same, it is recommended to add a process to stop the crawler after accessing for a while.
In addition to the above, if you find an operation rule that does not interfere with the operation of the target website, we will introduce it. Thank you for using the website.
(As of September 6, 2020)
Recommended Posts