The first thing to keep in mind when starting scraping is to comply with laws and rules (including explicit and implicit things). I think scraping is so convenient that such rules tend to be despised (** especially beginners **). This article is not about scraping rules, so please refer to the following articles for them.
I won't touch on the rules, but I'll briefly touch on robots.txt
, which is also the subject of the article.
robots.txt
is a document that contains instructions for the scraping program. robots.txt
is customarily placed directly under the URL, but since this is not a separate obligation, there are cases where it is not placed in the first place. For example, Qiita is located in here.
Qiita robots.txt
User-agent: *
Disallow: /*/edit$
Disallow: /api/*
Allow: /api/*/docs$
I will briefly explain with reference to the above Qiita's robots.txt
, but User-Agent
indicates the type of target crawler. *
is an instruction for everyone. Next, Allow / Disallow
indicates whether to allow or prohibit crawling to the specified path. In the above example, you can see that https://qiita.com/api/*
is prohibited from crawling, but https://qiita.com/api/*/docs$
is allowed. Also, depending on the site, Crawl-delay
may be set, but even if it is not set, it is desirable to allow ** 1 second ** until the next request as a tacit understanding.
If you want to know the specifications about robots.txt
in more detail, please refer to here.
The standard python library urllib
provides urllib.robotparser
for reading robots.txt
. This time, we will use this to create a program.
For urllib.robotparser
, see here.
import urllib.robotparser
# robots.Read txt
robots_txt_url = 'https://qiita.com/robots.txt'
rp = urllib.robotparser.RobotFileParser()
rp.set_url(robots_txt_url)
rp.read()
# robots.URL to investigate from txt information, User-Check if the Agent can crawl
user_agent = '*'
url = 'https://qiita.com/api/*'
result = rp.can_fetch(user_agent, url)
print(result)
Execution result
False
In the above program, first create a RobotFileParser
object, specify the URL of robots.txt
with the set_url
function, and read with read
based on it. Next, by giving the User-Agent and URL you want to investigate to the can_fetch
function, you can get the boolean value whether access is permitted or not. In the above program, as confirmed earlier, crawl to https://qiita.com/api/*
is not allowed, so False is output.
Most of the basic part of the program is completed in Step 1, but this only uses the functions of the library and is not useful as a program. Therefore, I would like to automatically generate a link for robots.txt
using a regular expression. That said, I don't think it will give you a lot of image, so I'll explain it using a concrete example.
For example, if the URL you want to check if crawl is allowed is https://qiita.com/api/*
, the link https://qiita.com/robots.txt
is based on this URL. Is to generate. This is because robots.txt
is customarily placed directly under the site as mentioned earlier, so from https://qiita.com/api/*
to https://qiita.com
If you can extract the part, you can create a link just by adding the characters /robots.txt
to it.
For the python regular expression re, refer to here.
import re
#Get site URL with regular expression
def get_root_url(url):
pattern = r'(?P<root>https?://.*?)\/.*'
result = re.match(pattern, url)
if result is not None:
return result.group('root')
#Robots from the site URL.Generate txt URL
def get_robots_txt_path(root_url):
return root_url + '/robots.txt'
url = 'https://qiita.com/api/*'
root_url = get_root_url(url)
robots_txt_url = get_robots_txt_path(root_url)
print(f'ROOT URL -> {root_url}')
print(f'ROBOTS.TXT URL -> {robots_txt_url}')
Execution result
ROOT URL -> https://qiita.com
ROBOTS.TXT URL -> https://qiita.com/robots.txt
Based on Step1 and Step2, use the function of urllib.robotparser
to add functions such as getting Crawl-Delay, and classify and summarize functions. You can put the code here, but it's a bit long, so put it on GitHub. There are about 60 lines, so if you want to check the contents, please.
Recommended Posts