Download files in any format using Python

This post is for December 24th of Crawler / Scraping Advent Calendar 2014.

Introduction

When browsing websites, you may want to download files (zip, pdf) of any format at once.

You can download it manually, but in this case you can easily write the process by using a scripting language such as Python or Ruby.

This time I wrote a script to download using Python.

Library

Actually, only the standard library is fine, but this time I used the following library.

Library installation

pip install requests
pip install BeautifulSoup

Source code

The processing contents are as follows.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import requests
import time

from BeautifulSoup import BeautifulSoup

BASE_URL = u"http://seanlahman.com/"
EXTENSION = u"csv.zip"

urls = [
    u"http://seanlahman.com/baseball-archive/statistics/",
]

for url in urls:

    download_urls = []
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    links = soup.findAll('a')

    #URL extraction
    for link in links:

        href = link.get('href')

        if href and EXTENSION in href:
            download_urls.append(href)

    #File download (limited to 3 for the time being)
    for download_url in download_urls[:3]:
		 
        #1 second sleep
        time.sleep(1)

        file_name = download_url.split("/")[-1]

        if BASE_URL in download_url:
            r = requests.get(download_url)
        else:
            r = requests.get(BASE_URL + download_url)
        
        #Save file
        if r.status_code == 200:
            f = open(file_name, 'w')
            f.write(r.content)
            f.close()

At the end

There are many improvements such as error handling and adjustment of the download URL, but For the time being, you can now download files in any file format (zip, pdf, etc.).

If you use Python etc., you can scrape very easily, so I think it's a good idea to increase the number of scripts you have while improving it according to the site.

Reference link

About Sean Lahman Database

Recommended Posts

Download files in any format using Python
Download Google Drive files in Python
format in python
Image format in Python
Easily format JSON in Python
Download the file in Python
How to download files from Selenium in Python in Chrome
Find this week's date in any format with python
Translate using googletrans in Python
Using Python mode in Processing
Transpose CSV files in Python Part 1
[Python] Loading csv files using pandas
Precautions when using pit in Python
Automatically format Python code in Vim
Handle NetCDF format data in Python
Handle GDS II format in Python
Try using LevelDB in Python (plyvel)
Using global variables in python functions
Sort large text files in Python
Let's see using input in python
Infinite product in Python (using functools)
Edit videos in Python using MoviePy
Read files in parallel with Python
Export and output files in Python
Handwriting recognition using KNN in Python
Try using Leap Motion in Python
Depth-first search using stack in Python
When using regular expressions in Python
Extract strings from files in Python
GUI creation in python using tkinter 2
Download python
Regularly upload files to Google Drive using the Google Drive API in Python
Mouse operation using Windows API in Python
[AWS] Using ini files with Lambda [Python]
Notes using cChardet and python3-chardet in Python 3.3.1.
Try using the Wunderlist API in Python
GUI creation in python using tkinter part 1
Get Suica balance in Python (using libpafe)
(Bad) practice of using this in Python
Slowly hash passwords using bcrypt in Python
Try using the Kraken API in Python
Using venv in Windows + Docker environment [Python]
Find files like find on linux in Python
[FX] Hit oanda-API in Python using Docker
Type annotations for Python2 in stub files!
Tweet using the Twitter API in Python
[Python] [Windows] Serial communication in Python using DLL
I tried using Bayesian Optimization in Python
Log in to Slack using requests in Python
Referencing INI files in Python or Ruby
Get Youtube data in Python using Youtube Data API
Download files on the web with Python
Using physical constants in Python scipy.constants ~ constants e ~
Scraping a website using JavaScript in Python
Download images from URL list in Python
Develop slack bot in python using chat.postMessage
Read and write JSON files in Python
Sample for handling eml files in Python
Write python modules in fortran using f2py
Draw a tree in Python 3 using graphviz
Notes for using python (pydev) in eclipse