・ I'm using Scrapy because I wanted to use Scrapy for the time being. ・ At this level, it is definitely better to use Beautiful soup than Scrapy.
・ The image to be downloaded is this page. Download all the images of playing cards at the link destination.
$ pip install scrapy
...
..
.
$ scrapy version #Check version
Scrapy 1.8.0
2-1.
$ scrapy startproject download_images
The directory is complete.
$ cd download_images
download_images $ tree
.
├── download_images
│ ├── __init__.py
│ ├── __pycache__
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ └── __pycache__
└── scrapy.cfg
Uncomment DOWNLOAD_DELAY
in settings.py and set the request transmission interval (unit: seconds).
If the request interval is short, it will look like a Dos attack, so be sure to set it.
(Some sites will be blocked.)
settings.py
...
..
.
DOWNLOAD_DELAY = 3
.
..
...
Just uncomment the code that starts with HTTPCACHE_
.
Eliminates the hassle of repeatedly accessing the same page during trials and errors.
settings.py
.
..
...
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
Create a template
$ scrapy genspider download_images_spider www.irasutoya.com
The command execution directory is
download_images #← here
├── download_images
│ ├── ...
│ ├── ..
│ ├── .
│ └── spiders
│ ├── __init__.py
│ └── __pycache__
└── scrapy.cfg
Then the spiders directory looks like this
download_images
├── download_images
│ ├── ...
│ ├── ..
│ ├── .
│ └── spiders
│ ├── __init__.py
│ ├── __pycache__
│ │ └── __init__.cpython-37.pyc
│ └── download_images_spider.py
└── scrapy.cfg
download_images_spider.py
# -*- coding: utf-8 -*-
import os, scrapy, urllib
from download_images.items import DownloadImagesItem
class DownloadImagesSpiderSpider(scrapy.Spider):
name = 'download_images_spider'
allowed_domains = ['www.irasutoya.com']
start_urls = [
'https://www.irasutoya.com/2010/05/numbercardspade.html', #Spades (numbers)
'https://www.irasutoya.com/2017/05/facecardspade.html', #Spade (picture card)
'https://www.irasutoya.com/2010/05/numbercardheart.html', #Heart (number)
'https://www.irasutoya.com/2017/05/facecardheart.html', #Heart (picture card)
'https://www.irasutoya.com/2010/05/numbercarddiamond.html', #Diamond (number)
'https://www.irasutoya.com/2017/05/facecarddiamond.html', #Diamond (picture card)
'https://www.irasutoya.com/2010/05/numbercardclub.html', #Club (number)
'https://www.irasutoya.com/2017/05/facecardclub.html', #Club (picture card)
'https://www.irasutoya.com/2017/05/cardjoker.html', #Joker
'https://www.irasutoya.com/2017/05/cardback.html', #Back side
]
dest_dir = '/Users/~~~/images' #Download destination directory
def parse(self, response):
#Depending on the web page, you need to rewrite the CSS selector to the appropriate one.
for image in response.css('div.separator img'):
#URL of the file to download
image_url = image.css('::attr(src)').extract_first().strip()
#File name of the image to download
file_name = image_url[image_url.rfind('/') + 1:]
#If the image download destination does not exist, create it
if not os.path.exists(self.dest_dir):
os.mkdir(self.dest_dir)
#download
urllib.request.urlretrieve(image_url, os.path.join(self.dest_dir, file_name))
time.sleep(1) #Download interval is 1 second
You have downloaded all the images of playing cards!
Recommended Posts