Learning summary of "Python crawling & scraping [enhanced revised edition] practical development guide for data collection and analysis" This Chapter 4 was titled "Methods for Practical Use" and focused on points to note when making crawlers.
--If you want to crawl a site that requires login, create a crawler that supports cookies. --Python's Requests library automatically sends a cookie to the server using a Session object
JavaScript to crawl sites created as SPA (Single Page Application) Need to be interpreted. To do this, use tools such as Selenium and Puppeteer to automatically operate the browser. In addition, browsers such as Chrome and FireFox have a headless mode that can be run without a GUI, which can be useful for creating crawlers.
Something like Googlebot. It is more difficult than a crawler that targets a specific site. A mechanism that does not depend on the page structure is required.
Copyrights to be aware of when creating crawlers → reproduction rights, adaptation rights, public transmission rights With the revision of the Copyright Law in 2009, copying for the purpose of information analysis and copying, adaptation, and automatic public transmission for the purpose of providing search engine services can be performed without the permission of the copyright holder.
A story about observing the terms of the site. Personal information will be managed based on the Personal Information Protection Law.
How to not put a load on the crawl destination. [Okazaki Municipal Central Library Case-Wikipedia](https://ja.wikipedia.org/wiki/Okazaki Municipal Central Library Case) What happened like this
--Number of simultaneous connections --Recent browsers have up to 6 simultaneous connections per host, but the crawler gets multiple pages for a long time, so it should be reduced. --Crawl interval --It is customary to set an interval of 1 second or more. Example: Crawler operated by the National Diet Library --If there is a means to acquire information other than HTML such as RSS and XML, use that.
<meta name =" robots "content =" <attribute value> ">
--Attribute values include nofollow
, which does not allow following links within the page, noarchive
, which does not allow saving, and noindex
, which does not allow search engines to index.Regarding netkeiba who is always scraping, there seems to be no particular instructions in robots.txt or meta tag.
An XML file that tells the crawler the URL you want it to crawl More efficient than following links and crawling Describe in Sitemap directive of robots.txt
Contact information such as email address and URL can be described in the User-Agent header of the request sent by the crawler.
Error handling is important to avoid putting an extra load on the crawl destination When retrying when an error occurs, take measures such as increasing the retry interval exponentially. There are many standard descriptions for error handling, but it can be described concisely by using a library called tenacity.
--HTTP cache policy
--The HTTP server can specify the cache policy in detail by adding a header related to the cache in the response.
--These headers can be divided into two types: "strong cache" and "weak cache".
--Strong cache → Cache-Control (detailed specification such as whether to cache), Expires (content expiration date) The client side does not send a request while the cache is valid. Use the cached response during the expiration date.
--Weak cache → Last-Modified (last modified date), ETag (identifier) The client side sends a request every time, but if it is not updated, it uses the cached response.
--In Python, a library called CacheControl can handle cache-related processing concisely pip install" CacheControl [filecache] "
--Validize with regular expression
--Validate with JSON Schema
--In Python, you can write validation rules in JSON format called JSON Schema by using a library called jsonschema pip install jsonschema
If a change can be detected in this way, the crawler will be terminated by notifying by e-mail.
abridgement
I wasn't motivated and the posting interval was vacant, but for the time being, it was an article that confirmed survival (?)
Recommended Posts