Purpose

This script automatically collects typical images for a query. For example, in recent years, deep learning has become popular in the field of images (originally in the field of audio ...), and it has been seen at various academic societies and established as a shared task. However, the training data requires a huge amount, and the time from collection to annotation requires a considerable cost.

Therefore, we collect tagged image data necessary for machine learning such as Deep Learning! I created this script assuming such a purpose.

Collection of typical images

This time, we will try to automate image collection using bing's image search. The code below does something like crawling and scraping, but this time I implemented it without using useful modules (BeautifulSoup, urllib, etc.) for studying.

Although it is labeled as a typical image collection, it is actually a process that only fetches the top N search results.

`collect_img.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
import os
import re
import commands as cmd


#Query Search HTML acquisition
def get_HTML(query):

    html = cmd.getstatusoutput("wget -O - https://www.bing.com/images/search?q=" + query)

    return html

#Extract jpg image URL
def extract_URL(html):

    url = []
    sentences = html[1].split('\n')
    ptn = re.compile('<a href="(.+\.jpg)" class="thumb"')

    for sent in sentences:
        if sent.find('<div class="item">') >= 0:
            element = sent.split('<div class="item">')

            for j in range(len(element)):
                mtch = re.match(ptn,element[j])
                if  mtch >= 0:
                    url.append(mtch.group(1))

    return url

#Save image locally
def get_IMG(dir,url):

    for u in url:
        try:
            os.system("wget -P " + dir + " " + u)
        except:
            continue


if __name__ == "__main__":

    argvs = sys.argv # argvs[1]:Image search query, argvs[2]:Destination directory(Only when you want to save)
    query = argvs[1] # some images  e.g. leopard

    html = get_HTML(query)

    url = extract_URL(html)

    for u in url:
        print u

    #Enable when you want to save the image locally
    #get_IMG(argvs[2],url)

Run

Execution method

Execute as follows from the command line. However, the argument dir is not specified when get_IMG is not used (the image is not saved).

`collect_img.py`


$ python collect_img.py query dir

--query: Search word for the image you want (e.g. leopard) --dir: Image save destination directory (./img/*)

Execution result

This time, we will introduce some of the results collected by the query "leopard". First, the URL list of the acquired images is as follows. (However, only a part)

http://images.china.cn/attachement/jpg/site1007/20120720/00016c8b5de01172f9e82e.jpg http://farm2.static.flickr.com/1254/1174179702_fe9c9a5d2c_b.jpg http://www.katzen-und-kater.de/Grosskatzen/Leopard/Leopard5.jpg ...

Here is a part of the acquired image. leopard leopard leopard

From the above, it was found that it was acquired properly. However, it does not mean that noise is removed by calculating the similarity of images, but it simply fetches the top N cases. (This is also an issue because it is not implemented to collect a large amount endlessly)

Summary

This time, I wrote a script to collect typical images from bing image search for the purpose of automatic collection of annotated image data of machine learning. For annotations, I think the query can be used as it is. In addition, the following two issues can be considered in the future.

--Collecting any number (or infinitely many) of images --Delete images that cause noise based on criteria such as similarity between images.

Since this script depends on the characteristics of the image search engine that the top image search is often a typical image, it is better to think seriously about the second problem above. think. Let's implement it again next time.

A Python script that automatically collects typical images using bing image search

Purpose

Collection of typical images

collect_img.py

Run

Execution method

collect_img.py

Execution result

Summary

`collect_img.py`

`collect_img.py`