The first thing you need to do to create a machine learning model for recognizing objects in an image is to collect a large number of training images. General items such as dogs and cars can be downloaded from services such as ImageNet, but images of characters such as Pikachu and Anpanman, for example. there is no. Then I came up with a way to collect images using Google search. This time, I will introduce how to collect image data for machine learning using the Google Custom Search API.
First, create a custom search engine with CSE.
First click Add under Edit Search Engine
Next, fill in the appropriate values in the form. As a caveat here, enter some appropriate value such as "\ * .com" for "Search site". No, I want to search all sites! If you think "\ *", you will not be able to proceed for the rest of your life. (I was quite addicted to it here) I will change it so that everything will be searched later. Make the following settings as appropriate and press the create button.
Click the create button to complete the creation. Then select the control panel.
Do three things in this control panel First, turn on image search.
Next, delete the "\ * .com" you added earlier from the site you are searching for.
Finally, change "Search only added sites" to "Search the entire web with an emphasis on added sites".
Thank you for your hard work. You have now created a custom search engine. *** Make a note of the ID that appears when you press "Search Engine ID". *** ***
Then enable the Custom Search API. This is very easy https://console.developers.google.com Go to (Create a project if you don't have one), select Library on the left menu, and select CustomeSearch API. Press "Enable" at the transition destination to enable the API.
Now get the API key from the credentials on the left. Select the API key from the Create Credentials tab.
*** Make a note of this as a key will be created when you select it. *** ***
It was a long time, but now I'm ready! !!
Collect images using the custom search engine created above and the CustomeSearch API & API key.
The script is very simple as below. Save the images in the form of number .png in a directory called images under the executed directory. For the search engine ID and API key, enter the ones you wrote down above. (Please install the imported library with pip as appropriate.)
correct_image.py
import requests
import shutil
API_PATH = "https://www.googleapis.com/customsearch/v1"
PARAMS = {
"cx" : "999999999999999999:abcdefghi", #Search engine ID
"key": "xxxxxxxxxxxxxxxxxxxxxxxxxxxx", #API key
"q" : "Pikachu", #Search word
"searchType": "image", #Search type
"start" : 1, #Starting index
"num" : 10 #Number of acquisitions in one search(10 by default)
}
LOOP = 100
image_idx = 0
for x in range(LOOP):
PARAMS.update({'start': PARAMS["num"] * x + 1})
items_json = requests.get(API_PATH, PARAMS).json()["items"]
for item_json in items_json:
path = "images/" + str(image_idx) + ".png "
r = requests.get(item_json['link'], stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
image_idx+=1
When I actually executed this, I got the following image.
I noticed later that when I tried to get a lot of images with this method
Traceback (most recent call last):
File "get_image.py", line 31, in <module>
items_json = requests.get(API_PATH, PARAMS).json()["items"]
KeyError: 'items'
It turned out that more than 100 images could not be acquired. Apparently, the Google Custom Search API does not allow acquisition of pages 11 and beyond. (There was a link that was mentioned, but I lost it.)
Recommended Posts