This script automatically collects typical images for a query. For example, in recent years, deep learning has become popular in the field of images (originally in the field of audio ...), and it has been seen at various academic societies and established as a shared task. However, the training data requires a huge amount, and the time from collection to annotation requires a considerable cost.
Therefore, we collect tagged image data necessary for machine learning such as Deep Learning! I created this script assuming such a purpose.
This time, we will try to automate image collection using bing's image search. The code below does something like crawling and scraping, but this time I implemented it without using useful modules (BeautifulSoup, urllib, etc.) for studying.
Although it is labeled as a typical image collection, it is actually a process that only fetches the top N search results.
collect_img.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import os
import re
import commands as cmd
#Query Search HTML acquisition
def get_HTML(query):
html = cmd.getstatusoutput("wget -O - https://www.bing.com/images/search?q=" + query)
return html
#Extract jpg image URL
def extract_URL(html):
url = []
sentences = html[1].split('\n')
ptn = re.compile('<a href="(.+\.jpg)" class="thumb"')
for sent in sentences:
if sent.find('<div class="item">') >= 0:
element = sent.split('<div class="item">')
for j in range(len(element)):
mtch = re.match(ptn,element[j])
if mtch >= 0:
url.append(mtch.group(1))
return url
#Save image locally
def get_IMG(dir,url):
for u in url:
try:
os.system("wget -P " + dir + " " + u)
except:
continue
if __name__ == "__main__":
argvs = sys.argv # argvs[1]:Image search query, argvs[2]:Destination directory(Only when you want to save)
query = argvs[1] # some images e.g. leopard
html = get_HTML(query)
url = extract_URL(html)
for u in url:
print u
#Enable when you want to save the image locally
#get_IMG(argvs[2],url)
Execute as follows from the command line. However, the argument dir is not specified when get_IMG is not used (the image is not saved).
collect_img.py
$ python collect_img.py query dir
--query: Search word for the image you want (e.g. leopard) --dir: Image save destination directory (./img/*)
This time, we will introduce some of the results collected by the query "leopard". First, the URL list of the acquired images is as follows. (However, only a part)
http://images.china.cn/attachement/jpg/site1007/20120720/00016c8b5de01172f9e82e.jpg http://farm2.static.flickr.com/1254/1174179702_fe9c9a5d2c_b.jpg http://www.katzen-und-kater.de/Grosskatzen/Leopard/Leopard5.jpg ...
Here is a part of the acquired image.
From the above, it was found that it was acquired properly. However, it does not mean that noise is removed by calculating the similarity of images, but it simply fetches the top N cases. (This is also an issue because it is not implemented to collect a large amount endlessly)
This time, I wrote a script to collect typical images from bing image search for the purpose of automatic collection of annotated image data of machine learning. For annotations, I think the query can be used as it is. In addition, the following two issues can be considered in the future.
--Collecting any number (or infinitely many) of images --Delete images that cause noise based on criteria such as similarity between images.
Since this script depends on the characteristics of the image search engine that the top image search is often a typical image, it is better to think seriously about the second problem above. think. Let's implement it again next time.
Recommended Posts