Collect images for machine learning (Bing Search API)

Introduction

It will be a memorandum of what you have done.

things to do

A large number of images may be required when trying to study machine learning. Bing seems to be the most suitable for image collection, and Microsoft Azure has never done it, so I tried this as a study. If you stumble when collecting images, it's a simple post with a reference URL, but I strongly agree with you.

[Reference URL] Summary of image collection circumstances on Yahoo, Bing, and Google https://qiita.com/ysdyt/items/565a0bf3228e12a2c503

Premise

Microsoft: Get the Bing Search API key (check the reference URL for how to get it) https://azure.microsoft.com/ja-jp/

Expiration date: 30 days for free

Reference URL

・ Create an automatic image collection program with the Bing Web Search API https://blog.wackwack.net/entry/2017/12/27/223755

-Collect a large number of images using Bing's image search API https://qiita.com/ysdyt/items/49e99416079546b65dfc

· Official: Quick Start: Search for images using the Bing Image Search REST API and Python https://docs.microsoft.com/ja-jp/azure/cognitive-services/bing-image-search/quickstarts/python

code

-** I wanted to have multiple search words, so upload locally ** (Upload the name of the folder to store with the search words)

Endpoint URL 　https://bingsearchv7forimages.cognitiveservices.azure.com/bing/v7.0

--Only the upload part is added to the reference URL code.

import math
import requests
import time
import OpenSSL
import urllib
import hashlib
import sha3
import os
import csv

# Split the argument f into the file name and extension (not including.)
def split_filename(f):
    split_name = os.path.splitext(f)
    file_name =split_name[0]
    extension = split_name[-1].replace(".","")
    return file_name,extension

def download_img(path,url):
    _,extension  = split_filename(url)
    if extension.lower() in ('jpg','jpeg','gif','png','bmp'):
        encode_url = urllib.parse.unquote(url).encode('utf-8')
        hashed_name = hashlib.sha3_256(encode_url).hexdigest()
        full_path = os.path.join(path,hashed_name + '.' + extension.lower())

        r = requests.get(url)
        if r.status_code == requests.codes.ok:
            with open(full_path,'wb') as f:
                f.write(r.content)
            print('saved image...{}'.format(url))
        else:
            print("HttpError:{0}  at{1}".format(r.status_code,image_url))

 Endpoint URL
url = "https://api.cognitive.microsoft.com/bing/v7.0/images/search"

 Bing Search API Key
APIKey = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

 Parameter
headers = {'Ocp-Apim-Subscription-Key':APIKey}
 count = 10 # Maximum number of acquisitions per request default: 30 max: 150
 mkt = "ja-JP" # Country code of acquisition source
 num_per = 2 # number of requests (count * num_per = number of acquired images)
 offset = math.floor (count / num_per) # loop count

with open("./list.txt", "r", encoding="utf-8_sig") as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        keyword = row[0]
        pathname = row[1]

 #Specify save destination
        path = "./" + pathname
 #If the save destination does not exist
        if not os.path.exists(path):
            os.makedirs(path)

        for offset_num in range(offset):
            params = {'q':keyword, 'count':count, 'offset':offset_num*offset, 'mkt':mkt}
            r = requests.get(url, headers=headers, params=params)
            data = r.json()
            for values in data['value']:
                image_url = values['contentUrl']
                try:
                    download_img(path, image_url)
                except Exception as e:
                    print("failed to download image at {}".format(image_url))
                    print(e)
            time.sleep(0.5)

--Upload file: Search word and storage folder name (list.txt)

--Download image (fujisan)

Other

--Installation: pip install pysha3 failed in python version 3.7. Since it was installed without error in version 3.6, this program is executed by python3.6.

Summary

――I was able to avoid having to stumble at the beginning when studying image-based machine learning. (Thanks)

――Since the paid fee of MS Azure is not high, I thought that it may be used depending on the situation after the free tier ends. Price: https://azure.microsoft.com/ja-jp/pricing/details/cognitive-services/search-api/