When creating a crawler, you should keep iterative execution in mind. Therefore, I am concerned about the cache of the site at the crawler destination.
Even if you execute it again with the cache applied, the data that can be obtained by viewing the same page is the same.
This time, I would like to summarize the HTTP cache in Python.
The HTTP cache is defined in RFC7234. By adding a header related to the cache to the response, the HTTP server can add to the HTTP client and instruct the content caching policy.
HTTP header | Right align |
---|---|
Cache-Control | Detailed instructions on cache policy, such as whether content can be cached |
Expires | Indicates the expiration date of the content |
Etag | Represents a content identifier. Etag values change as the content changes |
Last-Modified | Represents the latest update date and time of the content |
Pragma | Cache-Similar to Control, but is now left for backward compatibility only |
Vary | Indicates that the response returned by the server changes when the value of the request header included in the value changes. |
Once the client caches the response, it does not send a request until it expires and uses the cached response.
Once the client caches the response, it will send a conditional request next time, and the server will return an empty response body with a status code of 304 if there are no updates.
import requests
from cachecontrol import CacheControl
from cachecontrol.caches import FileCache
session = requests.Session()
#cached wrapping session_Make a session.
#Cache as a file.Save in the webcache directory.
cached_session = CacheControl(session, cache=FileCache('.webcache'))
response = cached_session.get('URL')
# response.from_You can get the response obtained from the cache with the cache attribute.
print(f'from_cache: {response.from_cache}')
print(f'status_code: {response.status_code}')
From the second time, the cached contents will be returned.
Recommended Posts