It will be a memorandum of what you have done.
A large number of images may be required when trying to study machine learning. Bing seems to be the most suitable for image collection, and Microsoft Azure has never done it, so I tried this as a study. If you stumble when collecting images, it's a simple post with a reference URL, but I strongly agree with you.
[Reference URL] Summary of image collection circumstances on Yahoo, Bing, and Google https://qiita.com/ysdyt/items/565a0bf3228e12a2c503
Microsoft: Get the Bing Search API key (check the reference URL for how to get it) https://azure.microsoft.com/ja-jp/
Expiration date: 30 days for free
・ Create an automatic image collection program with the Bing Web Search API https://blog.wackwack.net/entry/2017/12/27/223755
-Collect a large number of images using Bing's image search API https://qiita.com/ysdyt/items/49e99416079546b65dfc
· Official: Quick Start: Search for images using the Bing Image Search REST API and Python https://docs.microsoft.com/ja-jp/azure/cognitive-services/bing-image-search/quickstarts/python
-** I wanted to have multiple search words, so upload locally ** (Upload the name of the folder to store with the search words)
--Only the upload part is added to the reference URL code.
import math
import requests
import time
import OpenSSL
import urllib
import hashlib
import sha3
import os
import csv
# Split the argument f into the file name and extension (not including.)
def split_filename(f):
split_name = os.path.splitext(f)
file_name =split_name[0]
extension = split_name[-1].replace(".","")
return file_name,extension
def download_img(path,url):
_,extension = split_filename(url)
if extension.lower() in ('jpg','jpeg','gif','png','bmp'):
encode_url = urllib.parse.unquote(url).encode('utf-8')
hashed_name = hashlib.sha3_256(encode_url).hexdigest()
full_path = os.path.join(path,hashed_name + '.' + extension.lower())
r = requests.get(url)
if r.status_code == requests.codes.ok:
with open(full_path,'wb') as f:
f.write(r.content)
print('saved image...{}'.format(url))
else:
print("HttpError:{0} at{1}".format(r.status_code,image_url))
Endpoint URL
url = "https://api.cognitive.microsoft.com/bing/v7.0/images/search"
Bing Search API Key
APIKey = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
Parameter
headers = {'Ocp-Apim-Subscription-Key':APIKey}
count = 10 # Maximum number of acquisitions per request default: 30 max: 150
mkt = "ja-JP" # Country code of acquisition source
num_per = 2 # number of requests (count * num_per = number of acquired images)
offset = math.floor (count / num_per) # loop count
with open("./list.txt", "r", encoding="utf-8_sig") as f:
reader = csv.reader(f, delimiter='\t')
for row in reader:
keyword = row[0]
pathname = row[1]
#Specify save destination
path = "./" + pathname
#If the save destination does not exist
if not os.path.exists(path):
os.makedirs(path)
for offset_num in range(offset):
params = {'q':keyword, 'count':count, 'offset':offset_num*offset, 'mkt':mkt}
r = requests.get(url, headers=headers, params=params)
data = r.json()
for values in data['value']:
image_url = values['contentUrl']
try:
download_img(path, image_url)
except Exception as e:
print("failed to download image at {}".format(image_url))
print(e)
time.sleep(0.5)
--Upload file: Search word and storage folder name (list.txt)
--Download image (fujisan)
--Installation: pip install pysha3
failed in python version 3.7. Since it was installed without error in version 3.6, this program is executed by python3.6.
――I was able to avoid having to stumble at the beginning when studying image-based machine learning. (Thanks)
――Since the paid fee of MS Azure is not high, I thought that it may be used depending on the situation after the free tier ends. Price: https://azure.microsoft.com/ja-jp/pricing/details/cognitive-services/search-api/
Recommended Posts