Collect cat images at the speed of a second and aim for the Cat Hills tribe

Overview

The script I made the other day The cat image is automatically picked up by feedparser Thanks to that, the days when cat image collection progresses ...

However, the execution speed is slow. Not fast ... This script isn't fast enough. So I'll try to modify it to see if it can be made faster.

First, analyze the current situation

For the time being, I measured how slow the current script is. Below, the source that incorporates the measurement logic.

`get_cat.py`


# -*- coding: utf-8 -*-
import feedparser
import urllib
import os
import  time

def download_picture(q, count=10):
    u"""Fetch count images of q."""
    count = str(count)
    feed = feedparser.parse("https://picasaweb.google.com/data/feed/base/all?q=" + q + "&max-results=" + count)
    pic_urls = []
    for entry in feed['entries']:
        url = entry.content[0].src
        if not os.path.exists(os.path.join(os.path.dirname(__file__), q)):
            os.mkdir(os.path.join(os.path.dirname(__file__), q))
        urllib.urlretrieve(url, os.path.join(os.path.dirname(__file__), q, os.path.basename(url)))
        print('download:' + url)

if __name__ == "__main__":
    time1=time.time()
    download_picture("cat", 10)
    time1_2=str(time.time()-time1)
    print("complete!("+time1_2+")")

`result`


complete!(6.05635690689)

It took 6 seconds to download 10 copies. I tried it several times, but after all it was about 6 seconds. Now you can only download 14400 copies in 24 hours. It's far from ideal.

httplib2 I learned about the existence of a library called httplib2 from wind rumors. Rather, this one seems to be more standard than the standard one. Features of httplib2 (↓).

You can use the cache.
You can use gzip compression.

Isn't it wonderful? Let's use it now.

$ sudo pip install httplib2

Install quickly. And fix the program.

`get_cat2.py`


# -*- coding: utf-8 -*-
import feedparser
import httplib2
import os
import time


def download_picture(q, count=10):
    u"""Fetch count images of q."""
    count = str(count)
    feed = feedparser.parse("https://picasaweb.google.com/data/feed/base/all?q=" + q + "&max-results=" + count)
    pic_urls = []
    http = httplib2.Http(".chache")
    for entry in feed['entries']:
        url = entry.content[0].src
        open(os.path.join(os.path.join(os.path.dirname(__file__), q),os.path.basename(url)),'wb').write(http.request(url)[1])
        print('download:' + url)

if __name__ == "__main__":
    time1=time.time()
    download_picture("cat", 10)
    time1_2=str(time.time()-time1)
    print("complete!("+time1_2+")")

A strange place.

The import statement has changed.
urllib.urlretrieve () is now http.request ().
No longer checks the existence of folders one by one

Run it right away!

`Execution result`


First of all, the original program. Well like this.
complete!(5.79861092567)

revised edition. Hmm?
complete!(5.06348490715)
The second improved version. Oh...
complete!(1.20627403259)
The third improved version. Uooo!
complete!(0.768098115921)

It's fast! This is practical enough. It's not a lie to say the speed per second.

** Conclusion: Use httplib2 to improve cat image collection. ** **