This article is Day 19 of the Crawler / Web Scraping Advent Calendar 2015.
This time, I will show you how to write a script that uses a browser to go around various pages and download only when the URL matches a specific regular expression.
Use mitmproxy
.
mitmproxy
?mitmproxy
stands for Man In The Middle Proxy.
As the name implies, it acts as a proxy between the browser and the server.
However, it is not just a proxy, it is also possible to detect a request or response and perform automatic processing, or to rewrite the request or response character string with a certain rule.
Since mitmproxy
is written in Python
, the processing on it must also be written in Python
.
The purpose of this article is to automatically download the content based on the following assumptions.
The following script was created for the above purpose.
download.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import urllib
import urlparse
import re
import sys
import datetime
import subprocess
def request(context, flow):
request = flow.request
if ".pdf" in request.path:
cmdline = "wget"
for k, v in request.headers.items():
if k == "User-Agent":
cmdline = cmdline + ( " --user-agent=\"%s\"" % ( v ) )
elif k == "Referer":
cmdline = cmdline + ( " --referer=\"%s\"" % ( v ) )
elif k == "Host":
pass
elif k == "Proxy-Connection":
pass
elif k == "Accept-Encoding":
pass
else:
cmdline = cmdline + ( " --header \"%s: %s\"" % ( k, v.replace(":", " ")))
url = request.url
cmdline = cmdline + " \"" + url + "\""
now = datetime.datetime.strftime(datetime.datetime.now(), '%Y%m%d%H%M%S')
filename = now + "-" + os.path.basename(request.path)
cmdline = cmdline + " -O " + filename
cmdline = "( cd ~/pdf; " + cmdline + ")"
subprocess.call(cmdline)
Execute the above script using mitmproxy as follows.
$ mitmproxy -s download.py
The mitmproxy
will then start acting as an HTTP proxy on port 8080.
Then change your browser's proxy settings to localhost: 8080 and it will automatically be accessed via mitmproxy
.
Since the above script is running, when accessing the path including .pdf
, it will be downloaded using wget
.
With the above script, the following can be achieved.
I'd like you to find out about the processing related to the command line options of wget
and HTTP request headers in other documents, but I will explain only one point, ʻAccept-Encoding. If you do not describe the process to skip this ʻAccept-Encoding
, the downloaded content will be compressed with gzip, and you will not be able to use it without additional effort, and you will be lightly addicted to it. This header specifies what kind of compression should be done to reduce traffic.
By skipping the ʻAccept-Encoding` header, uncompressed files will be saved, which can reduce the trouble.
Recommended Posts