I used icrawler to collect images for machine learning, so this is an introduction.
A framework for collecting images by web crawling with python. You can collect images just by writing very short code.
pip
$ pip install icrawler
anaconda
$ conda install -c hellock icrawler
from icrawler.builtin import BingImageCrawler
crawler = BingImageCrawler(storage={"root_dir": './images'})
crawler.crawl(keyword='Cat', max_num=100)
--Specify the directory where you want to save the image in root_dir
.
--Specify the keywords of the images you want to collect in keyword
.
--Specify the number of images to collect in max_num
.
--You can change the BingImageCrawler
part to another ImageCrawler, and you can also use Google and Flickr.
--Available → https://icrawler.readthedocs.io/en/latest/builtin.html
json.decoder.JSONDecodeError
when using GoogleC: \ Users \ hoge \ anaconda3 \ envs \ env1 \ Lib \ site-packages \ icrawler \ builtin \ google.py
--If you installed with pip, you can find the location of the package, so follow it from there.
parse
method of google.py to:
--The parse
method is around line 144.def parse(self, response):
soup = BeautifulSoup(
response.content.decode('utf-8', 'ignore'), 'lxml')
#image_divs = soup.find_all('script')
image_divs = soup.find_all(name='script')
for div in image_divs:
#txt = div.text
txt = str(div)
#if not txt.startswith('AF_initDataCallback'):
if 'AF_initDataCallback' not in txt:
continue
if 'ds:0' in txt or 'ds:1' not in txt:
continue
#txt = re.sub(r"^AF_initDataCallback\({.*key: 'ds:(\d)'.+data:function\(\){return (.+)}}\);?$",
# "\\2", txt, 0, re.DOTALL)
#meta = json.loads(txt)
#data = meta[31][0][12][2]
#uris = [img[1][3][0] for img in data if img[0] == 1]
uris = re.findall(r'http.*?\.(?:jpg|png|bmp)', txt)
return [{'file_url': uri} for uri in uris]
https://github.com/hellock/icrawler https://github.com/hellock/icrawler/issues/65
Recommended Posts