Collect large numbers of images using Bing's image search API

2017-09-11 1.36.33.png

I want to collect 1000 images for machine learning. So this time, the search engine API "[Bing Image Search API](https://azure.microsoft.com/ja-jp/services/cognitive-services/bing-image-search-" provided by Bing (Microsoft) Let's collect images with Python3 using "api /)".

The Bing Image Search API has a test tool at here.

Create a Microsoft account

First, create a Microsoft count to get the API Key needed to hit the API. To be honest, this is annoying.

Microsoft seems to be trying to unify various services into the "Cognitive ●●" system, and it seems that the names of the existing services have changed, moved, and the versions have changed. It is said that a new version of Bing Search API v5 has also been released from July 1, 2016, so even if you google on the net, it is difficult to understand which is the current (straight) registration method after all.

Ichiou The following steps seem to be the minimum required, so please try them. (Maybe you need to register another account.)

By the way, in order to create an account (register for Microsoft Azure), credit card registration is required for the time being even if you use it within the free tier, as with Google's Cloud Platform (of course, if it is within the free tier, you will not be charged).

Also, new registration comes with a $ 200 coupon that can only be used for 30 days. For the time being, it feels like you can use it for free enough to collect images for fun.

After the free tier, it seems that you will be charged $ 3 for every 1,000 transactions (up to 150 images can be acquired in one transaction) (* In the case of the lowest level API billing of S1). Click here for pricing details (https://azure.microsoft.com/en-us/pricing/details/cognitive-services/search-api/web/). It is cheaper than the price of "cutsom search API" which is an image search API of google.

procedure

Go to the top page of Cognitive Service
Click "Free Account" in the upper right

2017-09-10 16.27.34.png

2017-09-10 15.14.09.png

Click "Start for free" at the bottom left. Register an account as you are told

After establishing an account, you will land on this page of Microfost Azure.

2017-09-10 15.19.57.png

Click the blue button labeled "Portal" in the upper right.

Jump to the dashboard page that manages the API to be used.

2017-09-10 15.55.09.png

You can set a new API to use from "+ New" on the left menu.

Search for "Bing Search APIs" in the search window and click on the results.

2017-09-10 16.05.18.png

Enter the following information of Create, check confirm and "Create".

Name is arbitrary
It seems that only "free trial version" can be selected for subscription immediately after registration
The price level does not require any specs, so I chose "S1".
Resource group is "newly created", writing is arbitrary (it seems good if it is the same as Name)
Resource group location is "Western Japan" or "Eastern Japan"

If "Create" is successful, a panel with the name of Name will appear on the dashboard, so click on it.

Click because there is "Keys" in the left menu of the clicked destination

2017-09-10 16.16.15.png

Note that the "KEY 1" that appears there is the key needed to hit the API (probably "KEY 2" is also ok)

code

Put the code that works (bing_api.py) for the time being. (The most minimal script needed to hit the API is here) Here, as an example, let's collect 1000 images that are caught by the search word "cat" (Japanese). The Python version is 3.5.2 and it just runs python3 bing_api.py.

When executed, directories named corr_table, ʻimgs, and pickle_fileswill be created under the directory specified bysave_dir_path`, and each data will be generated under that directory.

`bing_api.py`


# -*- coding: utf-8 -*-
import http.client
import json
import re
import requests
import os
import math
import pickle
import urllib
import hashlib
import sha3


def make_dir(path):
    if not os.path.isdir(path):
        os.mkdir(path)


def make_correspondence_table(correspondence_table, original_url, hashed_url):
    """Create reference table of hash value and original URL.
    """
    correspondence_table[original_url] = hashed_url


def make_img_path(save_dir_path, url):
    """Hash the image url and create the path

    Args:
        save_dir_path (str): Path to save image dir.
        url (str): An url of image.

    Returns:
        Path of hashed image URL.
    """
    save_img_path = os.path.join(save_dir_path, 'imgs')
    make_dir(save_img_path)

    file_extension = os.path.splitext(url)[-1]
    if file_extension.lower() in ('.jpg', '.jpeg', '.gif', '.png', '.bmp'):
        encoded_url = url.encode('utf-8') # required encoding for hashed
        hashed_url = hashlib.sha3_256(encoded_url).hexdigest()
        full_path = os.path.join(save_img_path, hashed_url + file_extension.lower())

        make_correspondence_table(correspondence_table, url, hashed_url)

        return full_path
    else:
        raise ValueError('Not applicable file extension')


def download_image(url, timeout=10):
    response = requests.get(url, allow_redirects=True, timeout=timeout)
    if response.status_code != 200:
        error = Exception("HTTP status: " + response.status_code)
        raise error

    content_type = response.headers["content-type"]
    if 'image' not in content_type:
        error = Exception("Content-Type: " + content_type)
        raise error

    return response.content


def save_image(filename, image):
    with open(filename, "wb") as fout:
        fout.write(image)


if __name__ == "__main__":
    save_dir_path = '/path/to/save/dir'
    make_dir(save_dir_path)

    num_imgs_required = 1000 # Number of images you want. The number to be divisible by 'num_imgs_per_transaction'
    num_imgs_per_transaction = 150 # default 30, Max 150
    offset_count = math.floor(num_imgs_required / num_imgs_per_transaction)

    url_list = []
    correspondence_table = {}

    headers = {
        # Request headers
        'Content-Type': 'multipart/form-data',
        'Ocp-Apim-Subscription-Key': 'xxxxxxxxxxxxxxxxxxxxxxxxxxx', # API key
    }

    for offset in range(offset_count):

        params = urllib.parse.urlencode({
            # Request parameters
            'q': 'Cat',
            'mkt': 'ja-JP',
            'count': num_imgs_per_transaction,
            'offset': offset * num_imgs_per_transaction # increment offset by 'num_imgs_per_transaction' (for example 0, 150, 300)
        })

        try:
            conn = http.client.HTTPSConnection('api.cognitive.microsoft.com')
            conn.request("POST", "/bing/v5.0/images/search?%s" % params, "{body}", headers)
            response = conn.getresponse()
            data = response.read()

            save_res_path = os.path.join(save_dir_path, 'pickle_files')
            make_dir(save_res_path)
            with open(os.path.join(save_res_path, '{}.pickle'.format(offset)), mode='wb') as f:
                pickle.dump(data, f)

            conn.close()
        except Exception as err:
            print("[Errno {0}] {1}".format(err.errno, err.strerror))

        else:
            decode_res = data.decode('utf-8')
            data = json.loads(decode_res)

            pattern = r"&r=(http.+)&p=" # extract an URL of image

            for values in data['value']:
                unquoted_url = urllib.parse.unquote(values['contentUrl'])
                img_url = re.search(pattern, unquoted_url)
                if img_url:
                    url_list.append(img_url.group(1))

    for url in url_list:
        try:
            img_path = make_img_path(save_dir_path, url)
            image = download_image(url)
            save_image(img_path, image)
            print('saved image... {}'.format(url))
        except KeyboardInterrupt:
            break
        except Exception as err:
            print("%s" % (err))

    correspondence_table_path = os.path.join(save_dir_path, 'corr_table')
    make_dir(correspondence_table_path)

    with open(os.path.join(correspondence_table_path, 'corr_table.json'), mode='w') as f:
        json.dump(correspondence_table, f)

Partial explanation of the code

About parameters

Specify the API key obtained earlier in ʻOcp-Apim-Subscription-Key of headers`
The query you want to search is specified by q in params (Japanese is also possible)
Specify the number of images you want to acquire with num_imgs_required
Specify the path to save the dropped image with save_dir_path

The number of images to be acquired in one transaction can be specified by count in params (default is 35 images, Max is 150 images dn760791.aspx # Anchor_2)) However, it seems that less than the specified number is actually returned. To acquire subsequent images, skip to the number starting from the number specified by ʻoffset and start acquisition. ʻOffset starts at 0 and loops to the number specified by num_imgs_required. Specifically, if you specify 150 for count and try to extract a total of 450 images, specify 0, 150, 300, 450 for ʻoffsetin a loop. (The official explanation ofcount and ʻoffset is here)

About the result file

When python3 bing_api.py is executed, three directories, corr_table, ʻimgs, and pickle_files, will be created under the directory specified by save_dir_path`, and each data will be generated under that directory.

** imgs **: Directory where images collected by the API are stored
** pickle_files **: The directory where the response (json data) returned by hitting the API is stored in pickle format. An example of the json format that is actually returned is here. The URL of the image we want is stored in'value'->'contentUrl' in the json so we're extracting it.
** corr_table **: A directory that stores the correspondence table between the original URL of the collected image and the file name that has been converted to hash in dictionary format (see "About the file name when saving the image"). I don't think it's necessary, but I'll write it out as data.

For a list of search parameters other than q, click here (https://msdn.microsoft.com/en-us/library/dn760791.aspx#Anchor_2)

Target image type

Here, only images with extensions of jpg, jpeg, gif, png, bmp are targeted for acquisition.

About the file name when saving the image

The image file names to be saved could be serial numbers without thinking about anything, but since they are images used for machine learning, I would like to omit the same images as much as possible.

Therefore, I tried to use the saved image name as the URL at the time of image acquisition so that duplicate names would be overwritten at the time of saving, but there is a file whose URL is too long (the file name becomes long), so when saving There was something that hindered me.

As a countermeasure, [hash processing](http://e-words.jp/w/%E3%83%8F%E3%83%83%E3%82%B7%E3] for the URL at the time of image acquisition % 83% A5% E5% 8C% 96.html) has been performed.

Hashing is originally used for encryption, but it converts to a character string of about 65 characters regardless of the original number of characters & has the feature of generating the same character string from the same content, so this You can use to shorten the file name and at the same time overwrite the image file with the same content as a duplicate. I used hashlib for hashing python, and referred to this blog for sha3 of the hashing algorithm.

`Hash conversion example`


http://hogehogehogehoge~~~/cat.jpg (long URL)->Hashing->Generates a symbol string ① of about 65 digits
http://fugafugafugafuga~~~/cat2.jpg (long URL)->Hashing->Generates a symbol string ② of about 65 digits
http://hogehogehogehoge~~~/cat.jpg (long URL)->Hashing->Generates the same 65-digit symbol string as ①

Results and impressions

As mentioned above, when I hit the API to get 1000 images with the keyword "cat", I was able to get 824 cat images as a result. Although it has not reached 1000 images a little, I think that it is a reasonable loss considering that the image extension to be acquired is limited and duplicate images are omitted. (Furthermore, work like seeing the garbage image and throwing it away starts from here ...)

However, from a different point of view, it also means that you can only get about 800 images of "cats", which are probably the most flooded images on the Internet (after that, even if you specify 2000 images and execute them). Also confirmed that only about 800 sheets can be obtained)

So, even if you say "I want 3000 images of cats", it seems to be quite difficult to actually hit the API and collect them. I feel like I can get a little more by combining keywords like "Cat Egypt", but it seems that there are not so many.

Also, I wanted to search for words that are a little more niche and seem to have few images on the net and get images, but in the case of niche search words, the goal is search results on google rather than bing. It seemed that there were a lot of images (maybe because of my mind ...). After all, the amount and type of images that can be obtained may differ depending on the search engine.

As a test, next, I also try google's "Custom Search API" and summarize the results.

(Added on 20170925) I wrote → Collect a large number of images using Google's image search API (Added on 20170926) I wrote a summary article → Summary of image collection situation on Yahoo, Bing, Google

Github

The same content as above is also posted on Github. Please refer to the README as only the method of specifying the API key and search engine ID is different.

ysdyt/image_collectors - Github

reference

Bing Image Search API
[The story of the transition from Bing Search API v2 to v5](http://techblog.adish.co.jp/entry/2016/12/03/Bing_Search_API_v2_%E3%81%8B%E3%82%89_v5_%E3% 81% B8% E7% A7% BB% E8% A1% 8C% E3% 81% 97% E3% 81% 9F% E8% A9% B1) (Very helpful!)