Download images from "Irasutoya" using Scrapy

Introduction

・ I'm using Scrapy because I wanted to use Scrapy for the time being. ・ At this level, it is definitely better to use Beautiful soup than Scrapy.

The image to be downloaded is this page. Download all the images of playing cards at the link destination.

1. Install scrapy

$ pip install scrapy
...
..
.
$ scrapy version #Check version
Scrapy 1.8.0

2. Create a project

2-1.

$ scrapy startproject download_images

The directory is complete.

$ cd download_images
download_images $ tree
.
├── download_images
│   ├── __init__.py
│   ├── __pycache__
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg

2-2. Setting the request transmission interval

Uncomment DOWNLOAD_DELAY in settings.py and set the request transmission interval (unit: seconds). If the request interval is short, it will look like a Dos attack, so be sure to set it. (Some sites will be blocked.)

settings.py


...
..
.
DOWNLOAD_DELAY = 3
.
..
...

2-3. Enable the cache.

Just uncomment the code that starts with HTTPCACHE_. Eliminates the hassle of repeatedly accessing the same page during trials and errors.

settings.py


.
..
...
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

3. Download the image

3-1. Create Spider

Create a template

$ scrapy genspider download_images_spider www.irasutoya.com

The command execution directory is

download_images #← here
├── download_images
│   ├── ...
│   ├── ..
│   ├── .
│   └── spiders
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg

Then the spiders directory looks like this

download_images
├── download_images
│   ├── ...
│   ├── ..
│   ├── .
│   └── spiders
│       ├── __init__.py
│       ├── __pycache__
│       │   └── __init__.cpython-37.pyc
│       └── download_images_spider.py
└── scrapy.cfg

3-3. Edit the created template file

download_images_spider.py


# -*- coding: utf-8 -*-
import os, scrapy, urllib
from download_images.items import DownloadImagesItem


class DownloadImagesSpiderSpider(scrapy.Spider):
    name = 'download_images_spider'
    allowed_domains = ['www.irasutoya.com']
    start_urls = [
        'https://www.irasutoya.com/2010/05/numbercardspade.html', #Spades (numbers)
        'https://www.irasutoya.com/2017/05/facecardspade.html', #Spade (picture card)

        'https://www.irasutoya.com/2010/05/numbercardheart.html', #Heart (number)
        'https://www.irasutoya.com/2017/05/facecardheart.html', #Heart (picture card)

        'https://www.irasutoya.com/2010/05/numbercarddiamond.html', #Diamond (number)
        'https://www.irasutoya.com/2017/05/facecarddiamond.html', #Diamond (picture card)

        'https://www.irasutoya.com/2010/05/numbercardclub.html', #Club (number)
        'https://www.irasutoya.com/2017/05/facecardclub.html', #Club (picture card)

        'https://www.irasutoya.com/2017/05/cardjoker.html', #Joker

        'https://www.irasutoya.com/2017/05/cardback.html', #Back side
    ]
    dest_dir = '/Users/~~~/images' #Download destination directory

    def parse(self, response):
#Depending on the web page, you need to rewrite the CSS selector to the appropriate one.
        for image in response.css('div.separator img'):
            #URL of the file to download
            image_url = image.css('::attr(src)').extract_first().strip()

            #File name of the image to download
            file_name = image_url[image_url.rfind('/') + 1:]

            #If the image download destination does not exist, create it
            if not os.path.exists(self.dest_dir):
                os.mkdir(self.dest_dir)
            
            #download
            urllib.request.urlretrieve(image_url, os.path.join(self.dest_dir, file_name))

            time.sleep(1) #Download interval is 1 second

You have downloaded all the images of playing cards! image.png

Recommended Posts

Download images from "Irasutoya" using Scrapy
Download images using requests
Geotag prediction from images using DNN
Download the file from S3 using boto.
Download images from URL list in Python
Download profile images (icons) for everyone from Slack
Load images from URLs using Pillow in Python 3
Bulk download images from specific URLs with python
Bulk download images from specific site URLs with python
Extract characters from images using docomo's character recognition API
Collect images using icrawler
Find card illustrations from images using feature point matching
Compile Tesseract for Tess4J to transcribe from images using CentOS
Convert pixiv to mp4 and download from pixiv using python's pixivpy
Flatten using Python yield from
Automatically download images with scraping
Save images using python3 requests
Scraping immediately from google images!
Script when executing Scrapy from Script
Categorize cat images using ChainerCV
Download video from YouTube (youtube-dl)
Download excel using spring mvc
Batch download images from a specific URL with python Modified version