Image collection using Google Custom Search API

From now on, I would like to start studying Deep Learning seriously. Before that, we have to think about where to get the large amount of data needed for learning.

One of the methods I came up with is to collect images from the Twitter image bot. The other uses image search engines such as Google and Bing. It will take some time to find a good bot, so let's use the search API first. The Bing search API seems to end at the end of this year, so I'll choose Google this time.

Search engine settings

Create a new search engine with Custom Search. The settings are as follows.

キャプチャ.JPG ① Turn on image search ② Select "Search the entire web" ③ Delete the search site ④ Get ID from search engine ID The ID is of the type "number string: alphabetic string". It seems that the number string is the user ID and the alphabetic string is the engine ID.

Acquired Custom Search API

Enable the Custom Search API in the Google Cloud Platform Console (https://console.cloud.google.com/apis) and create an API key with your credentials.

Creating a Python Script

https://www.googleapis.com/customsearch/v1?key=[API_KEY]&cx=[CUSTOM_SEARCH_ENGINE]&q=[search_item]

You can search with. Add searchType = image to search for images, num = xx & start = yy is pagination for getting a large number of images. According to the Reference (https://developers.google.com/custom-search/json-api/v1/reference/cse/list?hl=ja), num is an integer from 1 to 10. In other words, you can search up to 10 at a time.

The script is based on tukiyo3's code.

`get_image.py`


#-*- coding:utf-8 -*-
#[email protected] 2016/11/21
import urllib.request
from urllib.parse import quote
import httplib2
import json 
import os

API_KEY = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
CUSTOM_SEARCH_ENGINE = "12345648954985648968:xxxxxxxxx"

def getImageUrl(search_item, total_num):
 img_list = []
 i = 0
 while i < total_num:
  query_img = "https://www.googleapis.com/customsearch/v1?key=" + API_KEY + "&cx=" + CUSTOM_SEARCH_ENGINE + "&num=" + str(10 if(total_num-i)>10 else (total_num-i)) + "&start=" + str(i+1) + "&q=" + quote(search_item) + "&searchType=image"
  print (query_img)
  res = urllib.request.urlopen(query_img)
  data = json.loads(res.read().decode('utf-8'))
  for j in range(len(data["items"])):
   img_list.append(data["items"][j]["link"])
  i=i+10
 return img_list
 
def getImage(search_item, img_list):
 opener = urllib.request.build_opener()
 http = httplib2.Http(".cache")
 for i in range(len(img_list)):
  try:
   fn, ext = os.path.splitext(img_list[i])
   print(img_list[i])
   response, content = http.request(img_list[i])
   with open(search_item+str(i)+ext, 'wb') as f:
    f.write(content)
  except:
   print("failed to download images.")
   continue

if __name__ == "__main__":
 img_list = getImageUrl("dog", 5)
 print(img_list)
 getImage("dog", img_list)

The code isn't that pretty, but I'll share it. I also put it in Github.

end

The Google Custom Search API is convenient, but the free usage tier is 100 requests / day, and I used 70% just for script testing. When you actually use it, you have to pay for it. After all, I want to collect images for free, so I will try some other methods (such as Twitter).

2016/11/24 update I found a good way to collect images! ↓ http://d.hatena.ne.jp/shi3z/20160309/1457480722 The python script in the above link has been modified to support python3. → GitHub