HTTP is a statelessly designed protocol If you want to have a state, use cookies. It is not always necessary to implement the sending and receiving of cookies by creating a crawler. Use Session object of Request library. In addition, Referer can also express the state.
Used for implementing login etc.
For SPA etc., the content is not included in HTML. In that case, it is necessary to interpret JavaScript.
-Selenium (Tool for automatic browser qualification from a program) -Puppeteer (Node.js library for automatic operation of Google Chrome)
Etc. are available as automatic operation tools.
Google bot etc.
There are these three characteristics, but you should be aware of the following points regardless of the pattern of the crawler.
--Number of simultaneous connections --Crawl interval You have to be aware of the load and be aware of the load.
robots.txt Robots.txt and robots meta tags are widely used to instruct website administrators not to crawl a particular page.
robots.txt: A text file located in the top directory of your website robots meta tag: Contains instructions to the crawler.
You can get information about robots.txt using a Python library called urllib.robotparser.
An XML file for website administrators to present a list of URLs they want the crawler to crawl.
Crawling with reference to an XML sitemap is efficient because you only need to crawl the pages that need to be crawled.
Enter an arbitrary character string in the User-Agent header to access it.
By changing the error processing depending on the status code, it is possible to perform processing such as retrying in the case of a network error (such as not being able to connect).
Recommended Posts