I tried to refactor the template code posted in "Getting images from Flickr API with Python" (Part 2)

Preamble

I tried to refactor the template code posted in "Getting images from Flickr API with Python" (Part 1). The process of acquiring images using the Flickr API is over. However, since it is processed sequentially, it is difficult to acquire many images using many keywords. Here, I would like to modify it to parallel processing and check how much the processing speed feels good.

Last code

from flickrapi import FlickrAPI
import requests
import os, time, sys
import configparser
import time

#Image folder path
imgdir = os.path.join(os.getcwd(), "images")

#Use the Flickr API
def request_flickr(keyword, count=100, license=None):
    #Create a connected client and perform a search
    config = configparser.ConfigParser()
    config.read('secret.ini')

    flickr = FlickrAPI(config["private"]["key"], config["private"]["secret"], format='parsed-json')
    result = flickr.photos.search(
        text = keyword,           #Search keyword
        per_page = count,           #Number of acquired data
        media = 'photos',         #Collect photos
        sort = 'relevance',       #Get from the latest
        safe_search = 1,          #Avoid violent images
        extras = 'url_l, license' #Extra information to get(URL for download, license)
    )

    return list(filter(lambda x : multiConditionLicenses(int(x["license"]), license), result["photos"]["photo"]))


def multiConditionLicenses(src, license=None):

    dst = []
    if license is None:
        dst.append(lambda x : 0 <= x)
    else :
        license_types = license.split("|")
        for t in license_types:
            if t == "All_Rights_Reserved": #Copywriter
                dst.append(lambda x : x == 0)
            elif t == "NonCommercial": #Non-commercial
                dst.append(lambda x : 1 <= x and x <= 3)
            elif t == "Commercial": #Commercialization
                dst.append(lambda x : 4 <= x and x <= 6)
            elif t == "UnKnown": #Commercialization
                dst.append(lambda x : x == 7)
            elif t == "US_Government_Work": #Commercialization
                dst.append(lambda x : x == 8)
            elif t == "PublicDomain": #Commercialization
                dst.append(lambda x : 9<= x and x <= 10)

    return 0 < sum([item(src) for item in dst])


#Download from image link
def download_img(url, file_name):
    r = requests.get(url, stream=True)
    if r.status_code == 200:
        with open(file_name, 'wb') as f:
            f.write(r.content)

if __name__ == "__main__":

    #Start processing time measurement
    start = time.time()

    #Get query
    query = None
    with open("query.txt") as fin:
        query = fin.readlines()
    query = [ q.strip() for q in query]

    #Save folder
    for keyword in query:
        savedir = os.path.join(imgdir, keyword)
        #If not, create a folder
        if not os.path.isdir(savedir):
            os.mkdir(savedir)

        photos = request_flickr(keyword, count=500, license="NonCommercial|Commercial")

        for photo in filter(lambda p : "url_l" in p.keys(),  photos):
            url = photo['url_l']
            filepath = os.path.join(os.path.join(imgdir, keyword), photo['id'] + '.jpg')
            download_img(url, filepath)
            time.sleep(1)

    print('processing time', (time.time() - start), "Seconds")

The part you want to fix

--I want to process multiple keyword searches using the Flickr API in parallel. --I want to process the process of downloading from the image link in parallel.

Method

I thought I would use concurrent.futures.ThreadPoolExecutor for parallel processing. The description of joblib is simpler, so I will use it. You can write in one line in list comprehension as follows.

Parallel(n_jobs=8)([delayed({callback_func})(param1, param2, ...) for {element} in {list}])

Here, we will try to layer and parallelize two major processes: requesting multiple keywords to the flickr API and acquiring multiple image URLs from one keyword after acquiring the API response.


#Processing in the parent hierarchy
def main_process(keyword, count=100, wait_time=1):
    #Retrieving and storing results
    photos = request_flickr(keyword, count=count)

    #Download image
    #to key"url_l"Extract only those that contain(Caller of child hierarchy process)
    Parallel(n_jobs=-1)([delayed(sub_process)(photo, keyword=keyword, wait_time=wait_time) for photos])

#Processing in child hierarchy
def sub_process(src, keyword, wait_time=1):
    url = "https://farm{farm_id}.staticflickr.com/{server_id}/{id}_{secret}.jpg " \
            .format(farm_id=src["farm"],
                    server_id=src["server"],
                    id=src["id"],
                    secret=src["secret"])
    filepath = os.path.join(os.path.join(imgdir, keyword), src['id'] + '.jpg')
    download_img(url, filepath)
    time.sleep(wait_time)

if __name__ == "__main__":
    ...
    query = ["Ikebukuro","Otsuka","Sugamo","Komagome","Tabata"]
    #Request multiple keywords to flickr API(Caller of parent hierarchy process)
    Parallel(n_jobs=-1)([delayed(main_process)(keyword, count=500, wait_time=1) for keyword in query])
    ...

The parameter of n_jobs represents the number of processes. If it is 1, you can specify the actual sequential processing, and if it is -1, you can specify the maximum number of CPU processes to be executed.

I actually tried it

Preparation

As a keyword, I used the station name of the Yamanote line.

query.txt


Ikebukuro
Otsuka
Sugamo
Komagome
Tabata
Nishinippori
Nippori
Uguisudani
Ueno
Okachimachi
Akihabara
Kanda
Tokyo
Yurakucho
Shimbashi
Hamamatsucho
Tamachi
Shinagawa
Osaki
Gotanda
Meguro
Ebisu
Shibuya
Harajuku
Yoyogi
Shinjuku
Shin-Okubo
Takadanobaba
Mejiro

Whole code


from flickrapi import FlickrAPI
from urllib.request import urlretrieve
import requests
import os, time, sys
import configparser
import time
from joblib import Parallel, delayed

#Image folder path
imgdir = os.path.join(os.getcwd(), "images")
__JOB_COUNT__ = 1

#Use the Flickr API
def request_flickr(keyword, count=100, license=None):
    #Create a connected client and perform a search
    config = configparser.ConfigParser()
    config.read('secret.ini')

    flickr = FlickrAPI(config["private"]["key"], config["private"]["secret"], format='parsed-json')
    result = flickr.photos.search(
        text = keyword,           #Search keyword
        per_page = count,           #Number of acquired data
        media = 'photos',         #Collect photos
        sort = 'relevance',       #Get from the latest
        safe_search = 1,          #Avoid violent images
        extras = 'license' #Extra information to get(URL for download, license)
    )
    return list(filter(lambda x : multiConditionLicenses(int(x["license"]), license), result["photos"]["photo"]))


def multiConditionLicenses(src, license=None):

    dst = []
    if license is None:
        dst.append(lambda x : 0 <= x)
    else :
        license_types = license.split("|")
        for t in license_types:
            if t == "All_Rights_Reserved": #Copywriter
                dst.append(lambda x : x == 0)
            elif t == "NonCommercial": #Non-commercial
                dst.append(lambda x : 1 <= x and x <= 3)
            elif t == "Commercial": #Commercialization
                dst.append(lambda x : 4 <= x and x <= 6)
            elif t == "UnKnown": #Commercialization
                dst.append(lambda x : x == 7)
            elif t == "US_Government_Work": #Commercialization
                dst.append(lambda x : x == 8)
            elif t == "PublicDomain": #Commercialization
                dst.append(lambda x : 9<= x and x <= 10)

    return 0 < sum([item(src) for item in dst])


#Download from image link
def download_img(url, file_name):
    r = requests.get(url, stream=True)
    if r.status_code == 200:
        with open(file_name, 'wb') as f:
            f.write(r.content)
    else :
        print("not download:{}".format(url))

#Processing in the parent hierarchy
def main_process(keyword, count=100, wait_time=1):
    #Retrieving and storing results
    photos = request_flickr(keyword, count=count)
    
    #Download image
    #to key"url_l"Extract only those that contain(Caller of child hierarchy process)
    Parallel(n_jobs=__JOB_COUNT__)([delayed(sub_process)(photo, keyword=keyword, wait_time=wait_time) for photo in photos ])

#Processing in child hierarchy
def sub_process(src, keyword, wait_time=1):
    url = "https://farm{farm_id}.staticflickr.com/{server_id}/{id}_{secret}.jpg " \
            .format(farm_id=src["farm"],
                    server_id=src["server"],
                    id=src["id"],
                    secret=src["secret"])
    filepath = os.path.join(os.path.join(imgdir, keyword), src['id'] + '.jpg')
    download_img(url, filepath)
    time.sleep(wait_time)


if __name__ == "__main__":

    #Start processing time measurement
    start = time.time()

    #Get query
    query = None
    with open("query.txt") as fin:
        query = fin.readlines()
    query = [ q.strip() for q in query]

    #Save folder
    for keyword in query:
        savedir = os.path.join(imgdir, keyword)
        #If not, create a folder
        if not os.path.isdir(savedir):
            os.mkdir(savedir)

    #Request multiple keywords to flickr API(Caller of parent hierarchy process)
    Parallel(n_jobs=__JOB_COUNT__)([delayed(main_process)(keyword, count=10, wait_time=1) for keyword in query])

    print('Parallel processing', (time.time() - start), "Seconds")

The difference from the last time is the link of https://farm{farm-id}.staticflickr.com/{server-id}/{id}_{secret}.jpg to get the image more surely. I am using. (See Flickr API: Photo Source URLs)

This time, the license parameter is not assigned because the purpose is the download processing speed. count is set to 10. You are now getting 290 images. The sleep time after downloading from each image URL is set to 0.5 seconds. So, I measured how fast the processing speed would be when the number of processes was 1,2,4,8,16,24,32, max (-1).

result

Number of processes processing time(sec)
1 360.21357011795044
2 83.60558104515076
4 27.984444856643677
8 11.372981071472168
16 8.048759937286377
24 11.179131984710693
32 11.573050022125244
max (n_jobs=-1) 25.939302921295166

なまデータ

速度を対数にした時

Processing is completed 40 to 50 times faster than sequential processing. : scream_cat: Although it is a parameter of n_jobs = -1, it is faster to enter a fixed value 16 even though the maximum value is set. In the execution environment, ʻimport os os.cpu_count () = 4 `, so it probably depends on the number of processes in the cpu.

As an aside, the flickr API has a limit of 3600 data / hour. However, it seems that it can be used for a lot of loop processing.

in conclusion

Fluent Python Chapter 17 has a sample of downloading national flags in parallel, but this time the flickr API is more practical. It is also a good point to be a good sample subject when using Future or when you want to perform more optimized parallel processing. : curry:

Links that may be helpful

--joblib.Parallel (official doc) -I thoroughly investigated the parallel processing and parallel processing of Python -About parallel computing with Joblib in Python -Python parallel processing (multiprocessing and Joblib) -Fluent Python Chapter 17 Concurrency with futures (Python 3.7 version) -[Parallel processing with python joblib](http://data-analysis-stats.jp/2019/10/24/python%E3%81%AEjoblib%E3%81%A7%E4%B8%A6%E5% 88% 97% E5% 87% A6% E7% 90% 86 /)

Recommended Posts

I tried to refactor the template code posted in "Getting images from Flickr API with Python" (Part 2)
Introduction to AI creation with Python! Part 1 I tried to classify and predict what the numbers are from the handwritten number images.
I tried to get the movie information of TMDb API with Python
I tried to refactor the code of Python beginner (junior high school student)
I tried to graph the packages installed in Python
I tried to touch the CSV file with Python
I tried to solve the soma cube with python
I tried to solve the problem with Python Vol.1
I tried hitting the API with echonest's python client
Introduction to AI creation with Python! Part 2 I tried to predict the house price in Boston with a neural network
I tried to explain how to get the article content with MediaWiki API in an easy-to-understand manner with examples (Python 3)
I tried to find the entropy of the image with python
I tried to simulate how the infection spreads with Python
I tried using the Python library from Ruby with PyCall
I tried to summarize the code often used in Pandas
I tried to implement the mail sending function in Python
I tried changing the python script from 2.7.11 to 3.6.0 on windows10
I tried to divide the file into folders with Python
I tried to get various information from the codeforces API
I also tried to imitate the function monad and State monad with a generator in Python
I wrote a doctest in "I tried to simulate the probability of a bingo game with Python"
I tried to find the trend of the number of ships in Tokyo Bay from satellite images.
Getting the arXiv API in Python
Use the Flickr API from Python
I tried to describe the traffic in real time with WebSocket
I tried to solve the ant book beginner's edition with python
I tried to process the image in "sketch style" with OpenCV
I wrote the code to write the code of Brainf * ck in python
I tried to process the image in "pencil style" with OpenCV
I tried to improve the efficiency of daily work with Python
I tried to automatically collect images of Kanna Hashimoto with Python! !!
I tried to verify the speaker identification by the Speaker Recognition API of Azure Cognitive Services with Python. # 1
I tried to verify the speaker identification by the Speaker Recognition API of Azure Cognitive Services with Python. # 2
A story that didn't work when I tried to log in with the Python requests module
I tried to implement PLSA in Python
I tried to implement permutation in Python
I tried to implement PLSA in Python 2
I tried using UnityCloudBuild API from Python
I tried to implement ADALINE in Python
I tried to touch the COTOHA API
I tried to implement PPO in Python
[Python] I tried to visualize the night on the Galactic Railroad with WordCloud!
[Python] I tried to summarize the set type (set) in an easy-to-understand manner.
I tried to refer to the fun rock-paper-scissors poi for beginners with Python
Sample code to get the Twitter API oauth_token and oauth_token_secret in Python 2.7
I tried to learn the angle from sin and cos with chainer
I tried with the top 100 PyPI packages> I tried to graph the packages installed on Python
I tried to streamline the standard role of new employees with Python
A super introduction to Django by Python beginners! Part 3 I tried using the template file inheritance function
I tried to open the latest data of the Excel file managed by date in the folder with Python
A super introduction to Django by Python beginners! Part 2 I tried using the convenient functions of the template
Python programming: I tried to get company information (crawling) from Yahoo Finance in the US using BeautifulSoup4
[Kenchon book to Python] "Train your problem-solving skills! Algorithms and data structures" I tried to rewrite the posted code in Python! -table of contents-
I tried to implement merge sort in Python with as few lines as possible
I tried "smoothing" the image with Python + OpenCV
I tried hitting the Qiita API from go
I stumbled on the character code when converting CSV to JSON in Python
I tried to save the data with discord
I tried to integrate with Keras in TFv1.1
I tried simulating the "birthday paradox" in Python
I tried the least squares method in Python