I want to collect 1000 images for machine learning. So this time, the search engine API "[Bing Image Search API](https://azure.microsoft.com/ja-jp/services/cognitive-services/bing-image-search-" provided by Bing (Microsoft) Let's collect images with Python3 using "api /)".
The Bing Image Search API has a test tool at here.
First, create a Microsoft count to get the API Key needed to hit the API. To be honest, this is annoying.
Microsoft seems to be trying to unify various services into the "Cognitive ●●" system, and it seems that the names of the existing services have changed, moved, and the versions have changed. It is said that a new version of Bing Search API v5 has also been released from July 1, 2016, so even if you google on the net, it is difficult to understand which is the current (straight) registration method after all.
Ichiou The following steps seem to be the minimum required, so please try them. (Maybe you need to register another account.)
By the way, in order to create an account (register for Microsoft Azure), credit card registration is required for the time being even if you use it within the free tier, as with Google's Cloud Platform (of course, if it is within the free tier, you will not be charged).
Also, new registration comes with a $ 200 coupon that can only be used for 30 days. For the time being, it feels like you can use it for free enough to collect images for fun.
After the free tier, it seems that you will be charged $ 3 for every 1,000 transactions (up to 150 images can be acquired in one transaction) (* In the case of the lowest level API billing of S1). Click here for pricing details (https://azure.microsoft.com/en-us/pricing/details/cognitive-services/search-api/web/). It is cheaper than the price of "cutsom search API" which is an image search API of google.
Click "Start for free" at the bottom left. Register an account as you are told
After establishing an account, you will land on this page of Microfost Azure.
Click the blue button labeled "Portal" in the upper right.
Jump to the dashboard page that manages the API to be used.
You can set a new API to use from "+ New" on the left menu.
Search for "Bing Search APIs" in the search window and click on the results.
Enter the following information of Create, check confirm and "Create".
If "Create" is successful, a panel with the name of Name will appear on the dashboard, so click on it.
Click because there is "Keys" in the left menu of the clicked destination
Note that the "KEY 1" that appears there is the key needed to hit the API (probably "KEY 2" is also ok)
Put the code that works (bing_api.py) for the time being. (The most minimal script needed to hit the API is here)
Here, as an example, let's collect 1000 images that are caught by the search word "cat" (Japanese).
The Python version is 3.5.2 and it just runs python3 bing_api.py
.
When executed, directories named corr_table
, ʻimgs, and
pickle_fileswill be created under the directory specified by
save_dir_path`, and each data will be generated under that directory.
bing_api.py
# -*- coding: utf-8 -*-
import http.client
import json
import re
import requests
import os
import math
import pickle
import urllib
import hashlib
import sha3
def make_dir(path):
if not os.path.isdir(path):
os.mkdir(path)
def make_correspondence_table(correspondence_table, original_url, hashed_url):
"""Create reference table of hash value and original URL.
"""
correspondence_table[original_url] = hashed_url
def make_img_path(save_dir_path, url):
"""Hash the image url and create the path
Args:
save_dir_path (str): Path to save image dir.
url (str): An url of image.
Returns:
Path of hashed image URL.
"""
save_img_path = os.path.join(save_dir_path, 'imgs')
make_dir(save_img_path)
file_extension = os.path.splitext(url)[-1]
if file_extension.lower() in ('.jpg', '.jpeg', '.gif', '.png', '.bmp'):
encoded_url = url.encode('utf-8') # required encoding for hashed
hashed_url = hashlib.sha3_256(encoded_url).hexdigest()
full_path = os.path.join(save_img_path, hashed_url + file_extension.lower())
make_correspondence_table(correspondence_table, url, hashed_url)
return full_path
else:
raise ValueError('Not applicable file extension')
def download_image(url, timeout=10):
response = requests.get(url, allow_redirects=True, timeout=timeout)
if response.status_code != 200:
error = Exception("HTTP status: " + response.status_code)
raise error
content_type = response.headers["content-type"]
if 'image' not in content_type:
error = Exception("Content-Type: " + content_type)
raise error
return response.content
def save_image(filename, image):
with open(filename, "wb") as fout:
fout.write(image)
if __name__ == "__main__":
save_dir_path = '/path/to/save/dir'
make_dir(save_dir_path)
num_imgs_required = 1000 # Number of images you want. The number to be divisible by 'num_imgs_per_transaction'
num_imgs_per_transaction = 150 # default 30, Max 150
offset_count = math.floor(num_imgs_required / num_imgs_per_transaction)
url_list = []
correspondence_table = {}
headers = {
# Request headers
'Content-Type': 'multipart/form-data',
'Ocp-Apim-Subscription-Key': 'xxxxxxxxxxxxxxxxxxxxxxxxxxx', # API key
}
for offset in range(offset_count):
params = urllib.parse.urlencode({
# Request parameters
'q': 'Cat',
'mkt': 'ja-JP',
'count': num_imgs_per_transaction,
'offset': offset * num_imgs_per_transaction # increment offset by 'num_imgs_per_transaction' (for example 0, 150, 300)
})
try:
conn = http.client.HTTPSConnection('api.cognitive.microsoft.com')
conn.request("POST", "/bing/v5.0/images/search?%s" % params, "{body}", headers)
response = conn.getresponse()
data = response.read()
save_res_path = os.path.join(save_dir_path, 'pickle_files')
make_dir(save_res_path)
with open(os.path.join(save_res_path, '{}.pickle'.format(offset)), mode='wb') as f:
pickle.dump(data, f)
conn.close()
except Exception as err:
print("[Errno {0}] {1}".format(err.errno, err.strerror))
else:
decode_res = data.decode('utf-8')
data = json.loads(decode_res)
pattern = r"&r=(http.+)&p=" # extract an URL of image
for values in data['value']:
unquoted_url = urllib.parse.unquote(values['contentUrl'])
img_url = re.search(pattern, unquoted_url)
if img_url:
url_list.append(img_url.group(1))
for url in url_list:
try:
img_path = make_img_path(save_dir_path, url)
image = download_image(url)
save_image(img_path, image)
print('saved image... {}'.format(url))
except KeyboardInterrupt:
break
except Exception as err:
print("%s" % (err))
correspondence_table_path = os.path.join(save_dir_path, 'corr_table')
make_dir(correspondence_table_path)
with open(os.path.join(correspondence_table_path, 'corr_table.json'), mode='w') as f:
json.dump(correspondence_table, f)
of
headers`q
in params
(Japanese is also possible)num_imgs_required
save_dir_path
The number of images to be acquired in one transaction can be specified by count
in params
(default is 35 images, Max is 150 images dn760791.aspx # Anchor_2)) However, it seems that less than the specified number is actually returned.
To acquire subsequent images, skip to the number starting from the number specified by ʻoffset and start acquisition. ʻOffset
starts at 0 and loops to the number specified by num_imgs_required
. Specifically, if you specify 150 for count
and try to extract a total of 450 images, specify 0, 150, 300, 450 for ʻoffsetin a loop. (The official explanation of
count and ʻoffset
is here)
When python3 bing_api.py
is executed, three directories, corr_table
, ʻimgs, and
pickle_files, will be created under the directory specified by
save_dir_path`, and each data will be generated under that directory.
For a list of search parameters other than q
, click here (https://msdn.microsoft.com/en-us/library/dn760791.aspx#Anchor_2)
Here, only images with extensions of jpg, jpeg, gif, png, bmp are targeted for acquisition.
The image file names to be saved could be serial numbers without thinking about anything, but since they are images used for machine learning, I would like to omit the same images as much as possible.
Therefore, I tried to use the saved image name as the URL at the time of image acquisition so that duplicate names would be overwritten at the time of saving, but there is a file whose URL is too long (the file name becomes long), so when saving There was something that hindered me.
As a countermeasure, [hash processing](http://e-words.jp/w/%E3%83%8F%E3%83%83%E3%82%B7%E3] for the URL at the time of image acquisition % 83% A5% E5% 8C% 96.html) has been performed.
Hashing is originally used for encryption, but it converts to a character string of about 65 characters regardless of the original number of characters & has the feature of generating the same character string from the same content, so this You can use to shorten the file name and at the same time overwrite the image file with the same content as a duplicate. I used hashlib for hashing python, and referred to this blog for sha3 of the hashing algorithm.
Hash conversion example
http://hogehogehogehoge~~~/cat.jpg (long URL)->Hashing->Generates a symbol string ① of about 65 digits
http://fugafugafugafuga~~~/cat2.jpg (long URL)->Hashing->Generates a symbol string ② of about 65 digits
http://hogehogehogehoge~~~/cat.jpg (long URL)->Hashing->Generates the same 65-digit symbol string as ①
As mentioned above, when I hit the API to get 1000 images with the keyword "cat", I was able to get 824 cat images as a result. Although it has not reached 1000 images a little, I think that it is a reasonable loss considering that the image extension to be acquired is limited and duplicate images are omitted. (Furthermore, work like seeing the garbage image and throwing it away starts from here ...)
However, from a different point of view, it also means that you can only get about 800 images of "cats", which are probably the most flooded images on the Internet (after that, even if you specify 2000 images and execute them). Also confirmed that only about 800 sheets can be obtained)
So, even if you say "I want 3000 images of cats", it seems to be quite difficult to actually hit the API and collect them. I feel like I can get a little more by combining keywords like "Cat Egypt", but it seems that there are not so many.
Also, I wanted to search for words that are a little more niche and seem to have few images on the net and get images, but in the case of niche search words, the goal is search results on google rather than bing. It seemed that there were a lot of images (maybe because of my mind ...). After all, the amount and type of images that can be obtained may differ depending on the search engine.
As a test, next, I also try google's "Custom Search API" and summarize the results.
(Added on 20170925) I wrote → Collect a large number of images using Google's image search API (Added on 20170926) I wrote a summary article → Summary of image collection situation on Yahoo, Bing, Google
Github
The same content as above is also posted on Github. Please refer to the README as only the method of specifying the API key and search engine ID is different.
Recommended Posts