The Python library icrawler is useful for collecting image data for machine learning, and in the official example, it can be installed and implemented extremely easily as shown below.
pip install icrawler
or
conda install -c hellock icrawler
from icrawler.builtin import GoogleImageCrawler
google_crawler = GoogleImageCrawler(storage={'root_dir': 'your_image_dir'})
google_crawler.crawl(keyword='cat', max_num=100)
In this way, you can get an image in just three lines.
The following two points have been added to make it easier to use. ①. I want to specify multiple search words (externally) ②. I want you to avoid duplicate images
Regarding (1), the search word is described line by line in an external text file and implemented by reading this. Regarding (2), by describing the URL of the image in the file name, it is automatically skipped when saving the same image.
filesystem.py
class FileSystem(BaseStorage):
"""Use filesystem as storage backend.
The id is filename and data is stored as text files or binary files.
"""
def __init__(self, root_dir):
self.root_dir = root_dir
def write(self, id, data):
filepath = osp.join(self.root_dir, id)
folder = osp.dirname(filepath)
if not osp.isdir(folder):
try:
os.makedirs(folder)
except OSError:
pass
mode = 'w' if isinstance(data, six.string_types) else 'wb'
# with open(filepath, mode) as fout:
# fout.write(data)
try:
with open(filepath, mode) as fout:
fout.write(data)
except FileNotFoundError:
pass
So, I will publish the implementation below first.
img_collection.py
import base64
from icrawler import ImageDownloader
from six.moves.urllib.parse import urlparse
from icrawler.builtin import BaiduImageCrawler
from icrawler.builtin import BingImageCrawler
from icrawler.builtin import GoogleImageCrawler
import argparse, os
parser = argparse.ArgumentParser(description='img_collection')
parser.add_argument('--output_dir', default="",type=str, help='')
parser.add_argument('--N', default=10, type=int, help='')
parser.add_argument('--engine', choices=['baidu',"bing","google"],default="bing",type=str, help='')
args = parser.parse_args()
class Base64NameDownloader(ImageDownloader):
def get_filename(self, task, default_ext):
url_path = urlparse(task['file_url'])[2]
if '.' in url_path:
extension = url_path.split('.')[-1]
if extension.lower() not in [
'jpg', 'jpeg', 'png', 'bmp', 'tiff', 'gif', 'ppm', 'pgm'
]:
extension = default_ext
else:
extension = default_ext
# works for python 3
filename = base64.b64encode(url_path.encode()).decode()
return '{}.{}'.format(filename, extension)
def get_crawler(args, dir_name):
if args.engine == "baidu":
crawler = BaiduImageCrawler(downloader_cls=Base64NameDownloader,storage={'root_dir': dir_name })
elif args.engine == "bing":
crawler = BingImageCrawler(downloader_cls=Base64NameDownloader,storage={'root_dir': dir_name })
elif args.engine == "google": # dont work
crawler = GoogleImageCrawler(storage={'root_dir': dir_name })
return crawler
if __name__=="__main__":
# read ini file.
with open('./setting.txt', mode='r', encoding = "utf_8") as f:
read_data = list(f)
print("SELECTED ENGINE : "+args.engine)
for i in range(len(read_data)):
print("SEARCH WORD : "+read_data[i].replace('\n', ''))
print("NUM IMAGES : "+str(args.N))
dir_name = os.path.join(args.output_dir, read_data[i].replace('\n', '').replace(' ', '_'))
#init crawler
crawler = get_crawler(args, dir_name)
crawler.crawl(keyword=read_data[i], max_num=args.N)
Create setting.txt in the same hierarchy as img_collection.py. A search word is described in this, and in the example below, three types of search words are specified, and it is not necessary to execute each of the three words.
setting.txt
Cat cat
Cat adult
Cat child
Enter as many search words as you like in setting.txt and execute the following --N: Upper limit of acquired images (Max1000. Actually, 1000 images cannot be acquired due to communication or presence / absence of pages) --output_dir: Output destination directory path --engine: Search engine. Select from bing and baidu. Google currently doesn't work.
python img_collection.py --N 1000 --output_dir D:\hogehoge\WebCrawler\out --engine bing
Even if you specify 1000 sheets to acquire, it seems that about 600 sheets actually remain due to communication errors. For the time being, you can get a lot of images of cats. The directory is divided for each search word, but if you put the images together in one directory, the duplicate images will be basically merged because the file names conflict.
Web crawlers have delicate rights, so be sure to handle them appropriately according to the purpose of use.
Recommended Posts