Introduction

・ I'm using Scrapy because I wanted to use Scrapy for the time being. ・ At this level, it is definitely better to use Beautiful soup than Scrapy.

・ The image to be downloaded is this page. Download all the images of playing cards at the link destination.

1. Install scrapy

$ pip install scrapy
...
..
.
$ scrapy version #Check version
Scrapy 1.8.0

2. Create a project

2-1.

$ scrapy startproject download_images

The directory is complete.

$ cd download_images
download_images $ tree
.
├── download_images
│   ├── __init__.py
│   ├── __pycache__
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg

2-2. Setting the request transmission interval

Uncomment DOWNLOAD_DELAY in settings.py and set the request transmission interval (unit: seconds). If the request interval is short, it will look like a Dos attack, so be sure to set it. (Some sites will be blocked.)

`settings.py`


...
..
.
DOWNLOAD_DELAY = 3
.
..
...

2-3. Enable the cache.

Just uncomment the code that starts with HTTPCACHE_. Eliminates the hassle of repeatedly accessing the same page during trials and errors.

`settings.py`


.
..
...
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

3. Download the image

3-1. Create Spider

Create a template

$ scrapy genspider download_images_spider www.irasutoya.com

The command execution directory is

download_images #← here
├── download_images
│   ├── ...
│   ├── ..
│   ├── .
│   └── spiders
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg

Then the spiders directory looks like this

download_images
├── download_images
│   ├── ...
│   ├── ..
│   ├── .
│   └── spiders
│       ├── __init__.py
│       ├── __pycache__
│       │   └── __init__.cpython-37.pyc
│       └── download_images_spider.py
└── scrapy.cfg

3-3. Edit the created template file

`download_images_spider.py`


# -*- coding: utf-8 -*-
import os, scrapy, urllib
from download_images.items import DownloadImagesItem


class DownloadImagesSpiderSpider(scrapy.Spider):
    name = 'download_images_spider'
    allowed_domains = ['www.irasutoya.com']
    start_urls = [
        'https://www.irasutoya.com/2010/05/numbercardspade.html', #Spades (numbers)
        'https://www.irasutoya.com/2017/05/facecardspade.html', #Spade (picture card)

        'https://www.irasutoya.com/2010/05/numbercardheart.html', #Heart (number)
        'https://www.irasutoya.com/2017/05/facecardheart.html', #Heart (picture card)

        'https://www.irasutoya.com/2010/05/numbercarddiamond.html', #Diamond (number)
        'https://www.irasutoya.com/2017/05/facecarddiamond.html', #Diamond (picture card)

        'https://www.irasutoya.com/2010/05/numbercardclub.html', #Club (number)
        'https://www.irasutoya.com/2017/05/facecardclub.html', #Club (picture card)

        'https://www.irasutoya.com/2017/05/cardjoker.html', #Joker

        'https://www.irasutoya.com/2017/05/cardback.html', #Back side
    ]
    dest_dir = '/Users/~~~/images' #Download destination directory

    def parse(self, response):
#Depending on the web page, you need to rewrite the CSS selector to the appropriate one.
        for image in response.css('div.separator img'):
            #URL of the file to download
            image_url = image.css('::attr(src)').extract_first().strip()

            #File name of the image to download
            file_name = image_url[image_url.rfind('/') + 1:]

            #If the image download destination does not exist, create it
            if not os.path.exists(self.dest_dir):
                os.mkdir(self.dest_dir)
            
            #download
            urllib.request.urlretrieve(image_url, os.path.join(self.dest_dir, file_name))

            time.sleep(1) #Download interval is 1 second

You have downloaded all the images of playing cards!

Download images from "Irasutoya" using Scrapy

Introduction

1. Install scrapy

2. Create a project

2-2. Setting the request transmission interval

settings.py

2-3. Enable the cache.

settings.py

3. Download the image

3-1. Create Spider

3-3. Edit the created template file

download_images_spider.py

`settings.py`

`settings.py`

`download_images_spider.py`