Introduction

The Python library icrawler is useful for collecting image data for machine learning, and in the official example, it can be installed and implemented extremely easily as shown below.

Installation

pip install icrawler 
   or 
conda install -c hellock icrawler

Execution example (Search for "cat" on Google and get 100 images)

from icrawler.builtin import GoogleImageCrawler

google_crawler = GoogleImageCrawler(storage={'root_dir': 'your_image_dir'})
google_crawler.crawl(keyword='cat', max_num=100)

In this way, you can get an image in just three lines.

Currently on 2020/05/03, Google Image Crawler does not work due to Google's parser change.

Easy to use and improved

The following two points have been added to make it easier to use. ①. I want to specify multiple search words (externally) ②. I want you to avoid duplicate images

Regarding (1), the search word is described line by line in an external text file and implemented by reading this. Regarding (2), by describing the URL of the image in the file name, it is automatically skipped when saving the same image.

However, if the URL is too long, a file writing error will occur, so in order to solve this, it is necessary to modify "filesystem.py" in the icrawler library as follows.

`filesystem.py`


class FileSystem(BaseStorage):
    """Use filesystem as storage backend.

    The id is filename and data is stored as text files or binary files.
    """

    def __init__(self, root_dir):
        self.root_dir = root_dir

    def write(self, id, data):
        filepath = osp.join(self.root_dir, id)
        folder = osp.dirname(filepath)
        if not osp.isdir(folder):
            try:
                os.makedirs(folder)
            except OSError:
                pass
        mode = 'w' if isinstance(data, six.string_types) else 'wb'
#        with open(filepath, mode) as fout:
#            fout.write(data)
        try:
            with open(filepath, mode) as fout:
                fout.write(data)
        except  FileNotFoundError: 
                pass

Implementation

So, I will publish the implementation below first.

`img_collection.py`


import base64
from icrawler import ImageDownloader
from six.moves.urllib.parse import urlparse
from icrawler.builtin import BaiduImageCrawler
from icrawler.builtin import BingImageCrawler
from icrawler.builtin import GoogleImageCrawler
import argparse, os

parser = argparse.ArgumentParser(description='img_collection')
parser.add_argument('--output_dir', default="",type=str, help='')
parser.add_argument('--N',        default=10, type=int, help='')
parser.add_argument('--engine', choices=['baidu',"bing","google"],default="bing",type=str, help='')
args = parser.parse_args()

class Base64NameDownloader(ImageDownloader):
    def get_filename(self, task, default_ext):
        url_path = urlparse(task['file_url'])[2]
        if '.' in url_path:
            extension = url_path.split('.')[-1]
            if extension.lower() not in [
                    'jpg', 'jpeg', 'png', 'bmp', 'tiff', 'gif', 'ppm', 'pgm'
            ]:
                extension = default_ext
        else:
            extension = default_ext
        # works for python 3
        filename = base64.b64encode(url_path.encode()).decode()
        return '{}.{}'.format(filename, extension)
    
def get_crawler(args, dir_name):
    if args.engine == "baidu":
        crawler = BaiduImageCrawler(downloader_cls=Base64NameDownloader,storage={'root_dir': dir_name })
    elif args.engine == "bing":
        crawler = BingImageCrawler(downloader_cls=Base64NameDownloader,storage={'root_dir': dir_name })
    elif args.engine == "google": # dont work
        crawler = GoogleImageCrawler(storage={'root_dir': dir_name })    
    return crawler

if __name__=="__main__":
    # read ini file.
    with open('./setting.txt', mode='r', encoding = "utf_8") as f:
        read_data = list(f)    

    print("SELECTED ENGINE : "+args.engine)        

    for i in range(len(read_data)):
        print("SEARCH WORD : "+read_data[i].replace('\n', ''))
        print("NUM IMAGES  : "+str(args.N))
        dir_name = os.path.join(args.output_dir, read_data[i].replace('\n', '').replace(' ', '_'))
        
        #init crawler
        crawler = get_crawler(args, dir_name)
        crawler.crawl(keyword=read_data[i], max_num=args.N)

Create setting.txt in the same hierarchy as img_collection.py. A search word is described in this, and in the example below, three types of search words are specified, and it is not necessary to execute each of the three words.

`setting.txt`


Cat cat
Cat adult
Cat child

How to use

Enter as many search words as you like in setting.txt and execute the following --N: Upper limit of acquired images (Max1000. Actually, 1000 images cannot be acquired due to communication or presence / absence of pages) --output_dir: Output destination directory path --engine: Search engine. Select from bing and baidu. Google currently doesn't work.

python img_collection.py  --N 1000 --output_dir D:\hogehoge\WebCrawler\out --engine bing

result

Even if you specify 1000 sheets to acquire, it seems that about 600 sheets actually remain due to communication errors. For the time being, you can get a lot of images of cats. The directory is divided for each search word, but if you put the images together in one directory, the duplicate images will be basically merged because the file names conflict.

Finally

Web crawlers have delicate rights, so be sure to handle them appropriately according to the purpose of use.

Made icrawler easier to use for machine learning data collection