Things to keep in mind when developing crawlers in Python

Crawler characteristics

Crawler with state

HTTP is a statelessly designed protocol If you want to have a state, use cookies. It is not always necessary to implement the sending and receiving of cookies by creating a crawler. Use Session object of Request library. In addition, Referer can also express the state.

Used for implementing login etc.

Crawler interpreting JavaScript

For SPA etc., the content is not included in HTML. In that case, it is necessary to interpret JavaScript.

-Selenium (Tool for automatic browser qualification from a program) -Puppeteer (Node.js library for automatic operation of Google Chrome)

Etc. are available as automatic operation tools.

Crawler for an unspecified number of websites

Google bot etc.

There are these three characteristics, but you should be aware of the following points regardless of the pattern of the crawler.

Be careful when using the collected data

Notes on crawling load

--Number of simultaneous connections --Crawl interval You have to be aware of the load and be aware of the load.

robots.txt Robots.txt and robots meta tags are widely used to instruct website administrators not to crawl a particular page.

robots.txt: A text file located in the top directory of your website robots meta tag: Contains instructions to the crawler.

You can get information about robots.txt using a Python library called urllib.robotparser.

XML site map

An XML file for website administrators to present a list of URLs they want the crawler to crawl.

Crawling with reference to an XML sitemap is efficient because you only need to crawl the pages that need to be crawled.

Clarification of contact information

Enter an arbitrary character string in the User-Agent header to access it.

Status code and error handling

By changing the error processing depending on the status code, it is possible to perform processing such as retrying in the case of a network error (such as not being able to connect).

Recommended Posts

Things to keep in mind when developing crawlers in Python

Things to keep in mind when copying Python lists

Things to keep in mind when processing strings in Python2

Things to keep in mind when processing strings in Python3

Things to keep in mind when using Python with AtCoder

Things to keep in mind when using cgi with python.

Things to keep in mind when using Python for those who use MATLAB

Things to keep in mind when building automation tools for the manufacturing floor in Python

Things to keep in mind when deploying Keras on your Mac

Things to keep in mind when converting row vectors to column vectors with ndarray

Things to note when initializing a list in Python