From now on, I would like to start studying Deep Learning seriously. Before that, we have to think about where to get the large amount of data needed for learning.
One of the methods I came up with is to collect images from the Twitter image bot. The other uses image search engines such as Google and Bing. It will take some time to find a good bot, so let's use the search API first. The Bing search API seems to end at the end of this year, so I'll choose Google this time.
Create a new search engine with Custom Search. The settings are as follows.
① Turn on image search ② Select "Search the entire web" ③ Delete the search site ④ Get ID from search engine ID The ID is of the type "number string: alphabetic string". It seems that the number string is the user ID and the alphabetic string is the engine ID.
Enable the Custom Search API in the Google Cloud Platform Console (https://console.cloud.google.com/apis) and create an API key with your credentials.
https://www.googleapis.com/customsearch/v1?key=[API_KEY]&cx=[CUSTOM_SEARCH_ENGINE]&q=[search_item]
You can search with.
Add searchType = image
to search for images,
num = xx & start = yy
is pagination for getting a large number of images.
According to the Reference (https://developers.google.com/custom-search/json-api/v1/reference/cse/list?hl=ja), num is an integer from 1 to 10.
In other words, you can search up to 10 at a time.
The script is based on tukiyo3's code.
get_image.py
#-*- coding:utf-8 -*-
#[email protected] 2016/11/21
import urllib.request
from urllib.parse import quote
import httplib2
import json
import os
API_KEY = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
CUSTOM_SEARCH_ENGINE = "12345648954985648968:xxxxxxxxx"
def getImageUrl(search_item, total_num):
img_list = []
i = 0
while i < total_num:
query_img = "https://www.googleapis.com/customsearch/v1?key=" + API_KEY + "&cx=" + CUSTOM_SEARCH_ENGINE + "&num=" + str(10 if(total_num-i)>10 else (total_num-i)) + "&start=" + str(i+1) + "&q=" + quote(search_item) + "&searchType=image"
print (query_img)
res = urllib.request.urlopen(query_img)
data = json.loads(res.read().decode('utf-8'))
for j in range(len(data["items"])):
img_list.append(data["items"][j]["link"])
i=i+10
return img_list
def getImage(search_item, img_list):
opener = urllib.request.build_opener()
http = httplib2.Http(".cache")
for i in range(len(img_list)):
try:
fn, ext = os.path.splitext(img_list[i])
print(img_list[i])
response, content = http.request(img_list[i])
with open(search_item+str(i)+ext, 'wb') as f:
f.write(content)
except:
print("failed to download images.")
continue
if __name__ == "__main__":
img_list = getImageUrl("dog", 5)
print(img_list)
getImage("dog", img_list)
The code isn't that pretty, but I'll share it. I also put it in Github.
The Google Custom Search API is convenient, but the free usage tier is 100 requests / day, and I used 70% just for script testing. When you actually use it, you have to pay for it. After all, I want to collect images for free, so I will try some other methods (such as Twitter).
2016/11/24 update I found a good way to collect images! ↓ http://d.hatena.ne.jp/shi3z/20160309/1457480722 The python script in the above link has been modified to support python3. → GitHub
Recommended Posts