Download if the URL matches certain criteria while patrolling in your browser

Introduction

This article is Day 19 of the Crawler / Web Scraping Advent Calendar 2015.

This time, I will show you how to write a script that uses a browser to go around various pages and download only when the URL matches a specific regular expression.

Use mitmproxy.

What is mitmproxy?

mitmproxy stands for Man In The Middle Proxy. As the name implies, it acts as a proxy between the browser and the server. However, it is not just a proxy, it is also possible to detect a request or response and perform automatic processing, or to rewrite the request or response character string with a certain rule. Since mitmproxy is written in Python, the processing on it must also be written in Python.

What I want to do in this article

The purpose of this article is to automatically download the content based on the following assumptions.

Created script

The following script was created for the above purpose.

download.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib
import urlparse
import re
import sys
import datetime
import subprocess

def request(context, flow):
  request = flow.request
  if ".pdf" in request.path:
    cmdline = "wget"
    for k, v in request.headers.items():
      if k == "User-Agent":
        cmdline = cmdline + ( " --user-agent=\"%s\"" % ( v ) )
      elif k == "Referer":
        cmdline = cmdline + ( " --referer=\"%s\"" % ( v ) )
      elif k == "Host":
        pass
      elif k == "Proxy-Connection":
        pass
      elif k == "Accept-Encoding":
        pass
      else:
        cmdline = cmdline + ( " --header \"%s: %s\"" % ( k, v.replace(":", " ")))
    url = request.url
    cmdline = cmdline + " \"" + url + "\""
    now = datetime.datetime.strftime(datetime.datetime.now(), '%Y%m%d%H%M%S')
    filename = now + "-" + os.path.basename(request.path)
    cmdline = cmdline + " -O " + filename
    cmdline = "( cd ~/pdf; " + cmdline + ")"
    subprocess.call(cmdline)

Execute the above script using mitmproxy as follows.

$ mitmproxy -s download.py

The mitmproxy will then start acting as an HTTP proxy on port 8080.

Then change your browser's proxy settings to localhost: 8080 and it will automatically be accessed via mitmproxy.

Since the above script is running, when accessing the path including .pdf, it will be downloaded using wget.

Commentary

With the above script, the following can be achieved.

Supplement

I'd like you to find out about the processing related to the command line options of wget and HTTP request headers in other documents, but I will explain only one point, ʻAccept-Encoding. If you do not describe the process to skip this ʻAccept-Encoding, the downloaded content will be compressed with gzip, and you will not be able to use it without additional effort, and you will be lightly addicted to it. This header specifies what kind of compression should be done to reduce traffic.

By skipping the ʻAccept-Encoding` header, uncompressed files will be saved, which can reduce the trouble.

Recommended Posts

Download if the URL matches certain criteria while patrolling in your browser
Check if the URL exists in Python
Specify the view URL in your Django template
How to find the first element that matches your criteria in a Python list
Download the file while viewing the progress in Python 3.x
Download the file in Python
Check if the password hash generated by PHP matches in Python