Use HTTP cache in Python

When creating a crawler, you should keep iterative execution in mind. Therefore, I am concerned about the cache of the site at the crawler destination.

Even if you execute it again with the cache applied, the data that can be obtained by viewing the same page is the same.

This time, I would like to summarize the HTTP cache in Python.

HTTP cache

The HTTP cache is defined in RFC7234. By adding a header related to the cache to the response, the HTTP server can add to the HTTP client and instruct the content caching policy.

HTTP header	Right align
Cache-Control	Detailed instructions on cache policy, such as whether content can be cached
Expires	Indicates the expiration date of the content
Etag	Represents a content identifier. Etag values change as the content changes
Last-Modified	Represents the latest update date and time of the content
Pragma	Cache-Similar to Control, but is now left for backward compatibility only
Vary	Indicates that the response returned by the server changes when the value of the request header included in the value changes.

Strong cache

Cache-Control
Expires

Once the client caches the response, it does not send a request until it expires and uses the cached response.

Weak cache

Last-Modified
Etag

Once the client caches the response, it will send a conditional request next time, and the server will return an empty response body with a status code of 304 if there are no updates.

Use HTTP cache in Python

import requests
from cachecontrol import CacheControl 
from cachecontrol.caches import FileCache

session = requests.Session()
#cached wrapping session_Make a session.
#Cache as a file.Save in the webcache directory.
cached_session = CacheControl(session, cache=FileCache('.webcache'))

response = cached_session.get('URL') 

# response.from_You can get the response obtained from the cache with the cache attribute.
print(f'from_cache: {response.from_cache}') 
print(f'status_code: {response.status_code}')

From the second time, the cached contents will be returned.