Introduction

The first thing to keep in mind when starting scraping is to comply with laws and rules (including explicit and implicit things). I think scraping is so convenient that such rules tend to be despised (** especially beginners **). This article is not about scraping rules, so please refer to the following articles for them.

Reference article

List of precautions for web scraping --Qiita

About robots.txt

I won't touch on the rules, but I'll briefly touch on robots.txt, which is also the subject of the article. robots.txt is a document that contains instructions for the scraping program. robots.txt is customarily placed directly under the URL, but since this is not a separate obligation, there are cases where it is not placed in the first place. For example, Qiita is located in here.

`Qiita robots.txt`


User-agent: *
Disallow: /*/edit$
Disallow: /api/*
Allow:    /api/*/docs$

I will briefly explain with reference to the above Qiita's robots.txt, but User-Agent indicates the type of target crawler. * is an instruction for everyone. Next, Allow / Disallow indicates whether to allow or prohibit crawling to the specified path. In the above example, you can see that https://qiita.com/api/* is prohibited from crawling, but https://qiita.com/api/*/docs$ is allowed. Also, depending on the site, Crawl-delay may be set, but even if it is not set, it is desirable to allow ** 1 second ** until the next request as a tacit understanding. If you want to know the specifications about robots.txt in more detail, please refer to here.

Program creation

Step 1. Read robots.txt

The standard python library urllib provides urllib.robotparser for reading robots.txt. This time, we will use this to create a program. For urllib.robotparser, see here.

import urllib.robotparser

# robots.Read txt
robots_txt_url = 'https://qiita.com/robots.txt'
rp = urllib.robotparser.RobotFileParser()
rp.set_url(robots_txt_url)
rp.read()

# robots.URL to investigate from txt information, User-Check if the Agent can crawl
user_agent = '*'
url = 'https://qiita.com/api/*'
result = rp.can_fetch(user_agent, url)
print(result)

`Execution result`


False

In the above program, first create a RobotFileParser object, specify the URL of robots.txt with the set_url function, and read with read based on it. Next, by giving the User-Agent and URL you want to investigate to the can_fetch function, you can get the boolean value whether access is permitted or not. In the above program, as confirmed earlier, crawl to https://qiita.com/api/* is not allowed, so False is output.

Step 2. Get a link to robots.txt using a regular expression

Most of the basic part of the program is completed in Step 1, but this only uses the functions of the library and is not useful as a program. Therefore, I would like to automatically generate a link for robots.txt using a regular expression. That said, I don't think it will give you a lot of image, so I'll explain it using a concrete example. For example, if the URL you want to check if crawl is allowed is https://qiita.com/api/*, the link https://qiita.com/robots.txt is based on this URL. Is to generate. This is because robots.txt is customarily placed directly under the site as mentioned earlier, so from https://qiita.com/api/* to https://qiita.com If you can extract the part, you can create a link just by adding the characters /robots.txt to it. For the python regular expression re, refer to here.

import re

#Get site URL with regular expression
def get_root_url(url):
    pattern = r'(?P<root>https?://.*?)\/.*'
    result = re.match(pattern, url)
    if result is not None:
        return result.group('root')

#Robots from the site URL.Generate txt URL
def get_robots_txt_path(root_url):
    return root_url + '/robots.txt'

url = 'https://qiita.com/api/*'
root_url = get_root_url(url)
robots_txt_url = get_robots_txt_path(root_url)
print(f'ROOT URL -> {root_url}')
print(f'ROBOTS.TXT URL -> {robots_txt_url}')

`Execution result`


ROOT URL -> https://qiita.com
ROBOTS.TXT URL -> https://qiita.com/robots.txt

Step 3. Add functions and summarize

Based on Step1 and Step2, use the function of urllib.robotparser to add functions such as getting Crawl-Delay, and classify and summarize functions. You can put the code here, but it's a bit long, so put it on GitHub. There are about 60 lines, so if you want to check the contents, please.

The library called colorama, which is imported in the GitHub code, is used to color characters in the terminal, so it is not particularly important for its function.

Create a tool to check scraping rules (robots.txt) in Python

Introduction

Reference article

About robots.txt

Qiita robots.txt

Program creation

Step 1. Read robots.txt

Execution result

Step 2. Get a link to robots.txt using a regular expression

Execution result

Step 3. Add functions and summarize

Reference article

`Qiita robots.txt`

`Execution result`

`Execution result`