I used icrawler to collect images for machine learning, so this is an introduction.

What is icrawler

A framework for collecting images by web crawling with python. You can collect images just by writing very short code.

Installation

pip

$ pip install icrawler

anaconda

$ conda install -c hellock icrawler

How to use

from icrawler.builtin import BingImageCrawler

crawler = BingImageCrawler(storage={"root_dir": './images'})
crawler.crawl(keyword='Cat', max_num=100)

--Specify the directory where you want to save the image in root_dir. --Specify the keywords of the images you want to collect in keyword. --Specify the number of images to collect in max_num. --You can change the BingImageCrawler part to another ImageCrawler, and you can also use Google and Flickr. --Available → https://icrawler.readthedocs.io/en/latest/builtin.html

What to do if you get a `json.decoder.JSONDecodeError` when using Google

Find google.py. --Example (using anaconda): C: \ Users \ hoge \ anaconda3 \ envs \ env1 \ Lib \ site-packages \ icrawler \ builtin \ google.py --If you installed with pip, you can find the location of the package, so follow it from there.
- https://qiita.com/t-fuku/items/83c721ed7107ffe5d8ff
Change the parse method of google.py to: --The parse method is around line 144.

def parse(self, response):
        soup = BeautifulSoup(
            response.content.decode('utf-8', 'ignore'), 'lxml')
        #image_divs = soup.find_all('script')
        image_divs = soup.find_all(name='script')
        for div in image_divs:
            #txt = div.text
            txt = str(div)
            #if not txt.startswith('AF_initDataCallback'):
            if 'AF_initDataCallback' not in txt:
                continue
            if 'ds:0' in txt or 'ds:1' not in txt:
                continue
            #txt = re.sub(r"^AF_initDataCallback\({.*key: 'ds:(\d)'.+data:function\(\){return (.+)}}\);?$",
            #             "\\2", txt, 0, re.DOTALL)
            #meta = json.loads(txt)
            #data = meta[31][0][12][2]
            #uris = [img[1][3][0] for img in data if img[0] == 1]
            
            uris = re.findall(r'http.*?\.(?:jpg|png|bmp)', txt)
            return [{'file_url': uri} for uri in uris]

reference

https://github.com/hellock/icrawler https://github.com/hellock/icrawler/issues/65

[Python] Collect images easily with icrawler!

What is icrawler

Installation

How to use

What to do if you get a json.decoder.JSONDecodeError when using Google

reference

What to do if you get a `json.decoder.JSONDecodeError` when using Google