Download if the URL matches certain criteria while patrolling in your browser

Introduction

This article is Day 19 of the Crawler / Web Scraping Advent Calendar 2015.

This time, I will show you how to write a script that uses a browser to go around various pages and download only when the URL matches a specific regular expression.

Use mitmproxy.

What is `mitmproxy`?

mitmproxy stands for Man In The Middle Proxy. As the name implies, it acts as a proxy between the browser and the server. However, it is not just a proxy, it is also possible to detect a request or response and perform automatic processing, or to rewrite the request or response character string with a certain rule. Since mitmproxy is written in Python, the processing on it must also be written in Python.

What I want to do in this article

The purpose of this article is to automatically download the content based on the following assumptions.

Select and access the content you want to download by directly operating the browser
Conversely, not all specific links. If you do so, there are too many targets.
Or it's not easy to create a list of links you want to download.
For example, on a site that requires login, it is troublesome to automate everything such as the description of the login process.
Too many to save as a file with the browser function once and for all.
I want to assign a number according to a specific rule such as naming a file and save it.
Troublesome to do by hand
Use wget for the download itself

Created script

The following script was created for the above purpose.

`download.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib
import urlparse
import re
import sys
import datetime
import subprocess

def request(context, flow):
  request = flow.request
  if ".pdf" in request.path:
    cmdline = "wget"
    for k, v in request.headers.items():
      if k == "User-Agent":
        cmdline = cmdline + ( " --user-agent=\"%s\"" % ( v ) )
      elif k == "Referer":
        cmdline = cmdline + ( " --referer=\"%s\"" % ( v ) )
      elif k == "Host":
        pass
      elif k == "Proxy-Connection":
        pass
      elif k == "Accept-Encoding":
        pass
      else:
        cmdline = cmdline + ( " --header \"%s: %s\"" % ( k, v.replace(":", " ")))
    url = request.url
    cmdline = cmdline + " \"" + url + "\""
    now = datetime.datetime.strftime(datetime.datetime.now(), '%Y%m%d%H%M%S')
    filename = now + "-" + os.path.basename(request.path)
    cmdline = cmdline + " -O " + filename
    cmdline = "( cd ~/pdf; " + cmdline + ")"
    subprocess.call(cmdline)

Execute the above script using mitmproxy as follows.

$ mitmproxy -s download.py

The mitmproxy will then start acting as an HTTP proxy on port 8080.

Then change your browser's proxy settings to localhost: 8080 and it will automatically be accessed via mitmproxy.

Since the above script is running, when accessing the path including .pdf, it will be downloaded using wget.

Commentary

With the above script, the following can be achieved.

Download processing is performed using HTTP request headers such as the actual browser user agent and referrer.
The logged-in cookie information can be used as it is on the browser side.
No need to manually implement login process etc. on the script
Just the description about the download process you really want to do is OK
When sorting by file name, I wanted to sort in the order of download, so add the download timing to the beginning of the file name.

Supplement

I'd like you to find out about the processing related to the command line options of wget and HTTP request headers in other documents, but I will explain only one point, ʻAccept-Encoding. If you do not describe the process to skip this ʻAccept-Encoding, the downloaded content will be compressed with gzip, and you will not be able to use it without additional effort, and you will be lightly addicted to it. This header specifies what kind of compression should be done to reduce traffic.

By skipping the ʻAccept-Encoding` header, uncompressed files will be saved, which can reduce the trouble.