Introduction

Learning summary of "Python crawling & scraping [enhanced revised edition] practical development guide for data collection and analysis" This Chapter 4 was titled "Methods for Practical Use" and focused on points to note when making crawlers.

4.1 Crawler characteristics

4.1.1 Crawler with state

--If you want to crawl a site that requires login, create a crawler that supports cookies. --Python's Requests library automatically sends a cookie to the server using a Session object

Referer --HTTP header to send the URL of the previously viewed page to the server -(Example) When you access Qiita from Google search results, it looks like this when you check it with Chrome's verification tool

4.1.2 Crawler interpreting JavaScript

JavaScript to crawl sites created as SPA (Single Page Application) Need to be interpreted. To do this, use tools such as Selenium and Puppeteer to automatically operate the browser. In addition, browsers such as Chrome and FireFox have a headless mode that can be run without a GUI, which can be useful for creating crawlers.

4.1.3 Crawler for an unspecified number of websites

Something like Googlebot. It is more difficult than a crawler that targets a specific site. A mechanism that does not depend on the page structure is required.

4.2 Precautions regarding the use of collected data

4.2.1 Copyright

Copyrights to be aware of when creating crawlers → reproduction rights, adaptation rights, public transmission rights With the revision of the Copyright Law in 2009, copying for the purpose of information analysis and copying, adaptation, and automatic public transmission for the purpose of providing search engine services can be performed without the permission of the copyright holder.

4.2.2 Terms of use and personal information

A story about observing the terms of the site. Personal information will be managed based on the Personal Information Protection Law.

4.3 Precautions regarding the load at the crawl destination

How to not put a load on the crawl destination. [Okazaki Municipal Central Library Case-Wikipedia](https://ja.wikipedia.org/wiki/Okazaki Municipal Central Library Case) What happened like this

4.3.1 Number of simultaneous connections and crawl interval

--Number of simultaneous connections --Recent browsers have up to 6 simultaneous connections per host, but the crawler gets multiple pages for a long time, so it should be reduced. --Crawl interval --It is customary to set an interval of 1 second or more. Example: Crawler operated by the National Diet Library --If there is a means to acquire information other than HTML such as RSS and XML, use that.

4.3.2 Instructions to crawlers by robots.txt

robots.txt --Write instructions (directives) to crawlers in a standardized format called Robots Exclusion Protocol --RobotFile for robots.txt parsing in Python's standard library urllib.robotparser Has a Parser class --robots meta tag --Write instructions to the crawler in the HTML meta tag <meta name =" robots "content =" <attribute value> "> --Attribute values include nofollow, which does not allow following links within the page, noarchive, which does not allow saving, and noindex, which does not allow search engines to index.

Regarding netkeiba who is always scraping, there seems to be no particular instructions in robots.txt or meta tag.

4.3.3 XML sitemap

An XML file that tells the crawler the URL you want it to crawl More efficient than following links and crawling Describe in Sitemap directive of robots.txt

4.3.4 Clarification of contact information

Contact information such as email address and URL can be described in the User-Agent header of the request sent by the crawler.

4.3.5 Status Code and Error Handling

Error handling is important to avoid putting an extra load on the crawl destination When retrying when an error occurs, take measures such as increasing the retry interval exponentially. There are many standard descriptions for error handling, but it can be described concisely by using a library called tenacity.

4.4 Designed for repeated execution

4.4.1 Get only updated data

--HTTP cache policy --The HTTP server can specify the cache policy in detail by adding a header related to the cache in the response. --These headers can be divided into two types: "strong cache" and "weak cache". --Strong cache → Cache-Control (detailed specification such as whether to cache), Expires (content expiration date) The client side does not send a request while the cache is valid. Use the cached response during the expiration date. --Weak cache → Last-Modified (last modified date), ETag (identifier) The client side sends a request every time, but if it is not updated, it uses the cached response. --In Python, a library called CacheControl can handle cache-related processing concisely pip install" CacheControl [filecache] "

4.4.2 Detect changes in the crawl destination

--Validize with regular expression --Validate with JSON Schema --In Python, you can write validation rules in JSON format called JSON Schema by using a library called jsonschema pip install jsonschema

If a change can be detected in this way, the crawler will be terminated by notifying by e-mail.

4.5 Summary

abridgement

in conclusion

I wasn't motivated and the posting interval was vacant, but for the time being, it was an article that confirmed survival (?)

Python Crawling & Scraping Chapter 4 Summary