We have described in an easy-to-understand manner how to execute Scrapy, a framework that enables web scraping. I'm glad if you can use it as a reference.

reference: Python --Create a crawler with Scrapy https://qiita.com/naka-j/items/4b2136b7b5a4e2432da8

Time required 15 minutes Contents

Install Scrapy and create a project
About Spider
Let's actually get the web page information!

1. Install Scrapy and create a project

Execute the following pip on the terminal to install scrapy

pip install scrapy

Then go to the directory where you want to create the scrapy project and do the following

scrapy startproject sake

After this, I will scrape the website related to sake, so I changed the name of the project to "sake". Then, the following folders will be configured under the current directory.

Screen Shot 2020-04-09 at 11.52.09.png

2. About Spider

Web scraping is not possible with the above files alone, so enter the following command and enter the following command. Create files in the spiders directory.

#scrapy genspider <file name> <Web URL you want to scrape>
scrapy genspider scrapy_sake https://www.saketime.jp/ranking/

Then you can see that a file called "scrapy_sake.py" is created in the spiders directory. Screen Shot 2020-04-09 at 11.57.12.png The contents of the created file are as follows.

`sake/sake/spiders/scrapy_sake.py`


# -*- coding: utf-8 -*-
import scrapy


class ScrapySakeSpider(scrapy.Spider):
    name = 'scrapy_sake'
    allowed_domains = ['https://www.saketime.jp/ranking/']
    start_urls = ['http://https://www.saketime.jp/ranking/']

    def parse(self, response):
        pass

As I will explain in detail later, I will mainly code this "def parse" part. Before coding, let's check once if you can get the web information accurately. Add a print statement to the "def parse" part to see the acquired information.

`sake/sake/spiders/scrapy_sake.py`


# -*- coding: utf-8 -*-
import scrapy


class ScrapySakeSpider(scrapy.Spider):
    name = 'scrapy_sake'
    allowed_domains = ['https://www.saketime.jp/ranking/']
    start_urls = ['http://https://www.saketime.jp/ranking/']

    def parse(self, response):
        #Delete pass and add print statement
        print(response)

And if you execute the following command, quite a lot of output will be returned, but you can confirm that html is acquired firmly in it.

`Execution command`


#scrapy crawl <file name>
scrapy crawl scrapy_sake

`Output`



               <li class="brand_review clearfix">
                <div>
                  <p>
Iso's pride, special brewed raw sake

Click here for today's sake, Iso's proud special brewed raw sake!
Rice...                    <br>
                    <span class="brand_review_user">
                      by
                      <span>Sue</span>
                      <span>
                        <span class="review-star">★</span>
                        <span>4.5</span>
                      </span>
                      <span class="reviewtime">
                        <span>March 23, 2020</span>
                      </span>
                    </span>

                  </p>
                </div>
              </li>
                                        </ul>
          </a>
                  </div>
:
:
:
'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2020, 4, 9, 3, 23, 24, 461847)}
2020-04-09 12:23:26 [scrapy.core.engine] INFO: Spider closed (finished)

Next, let's extract only the necessary information from here!

3. Let's actually get the web page information!

Basically, there are only two files implemented by scrapy:

items.py
spiders > scrapy_sake.py

Let's edit from item.py first. When you first open it, it originally looks like the following file.

`sake/items.py`


# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class SakeItem(scrapy.Item):
    pass

Arbitrarily register the information you want to get with scrapy in this class part.

`sake/items.py`


# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class SakeItem(scrapy.Item):
    #Information name you want to get(Any) = scrapy.Field()
    prefecture_maker = scrapy.Field()
    prefecture = scrapy.Field()
    maker = scrapy.Field()
    brand = scrapy.Field()
    pass

This is the end of the description of items.py. Next, let's move on to coding scrapy_sake.py.

The completed form is as follows. I think that the inside of def parse () is richer than the one seen in Chapter 2 above.

`sake/sake/spiders/scrapy_sake.py`


# -*- coding: utf-8 -*-
import scrapy
#items.Don't forget to import py
from sake.items import SakeItem

class ScrapySakeSpider(scrapy.Spider):
    name = 'scrapy_sake'
    #allowed_domains = ['ja.wikipedia.org']
    start_urls = ['https://www.saketime.jp/ranking/']
    
    def parse(self, response):
        items = []
        #html tag li.Sake information was stored in a place called clearfix.
        sakes = response.css("li.clearfix")

        #Multiple li on the page.Let's look at each clearfix
        for sake in sakes:
            #item.Declare a SakeItem object defined in py
            item = SakeItem()
            item["prefecture_maker"] = sake.css("div.col-center p.brand_info::text").extract_first()

            #<div class="headline clearfix">In the case of description like,headline.In between as clearfix.To put on
            item["brand"] = sake.css("div.headline.clearfix h2 a span::text").extract_first()

            #Cleansing the acquired data
            if (item["prefecture_maker"] is not None) or (item["brand"] is not None):
                #Delete \ n and spaces
                item["prefecture_maker"] = item["prefecture_maker"].replace(' ','').replace('\n','')
                #Separation of prefecture and maker
                item["prefecture"] = item["prefecture_maker"].split('|')[0]
                item["maker"] = item["prefecture_maker"].split('|')[1]
                items.append(item) 
        print(items)

    #Reflect page switching with recursive processing
        #a tag rel="next"Get the elements of
        next_page = response.css('a[rel="next"]::attr(href)').extract_first()
        if next_page is not None:
            #Convert to absolute path if URL is relative path
            next_page = response.urljoin(next_page)
            #Return Request once in yield, sakes are registered on the page after request, and the above for statement is executed again
            yield scrapy.Request(next_page, callback=self.parse)

When these are executed, it will be as follows.

:
:
:
2020-04-10 16:52:58 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.saketime.jp/ranking/page:110/> from <GET https://www.saketime.jp/ranking/page:110>
2020-04-10 16:52:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.saketime.jp/ranking/page:110/> (referer: https://www.saketime.jp/ranking/page:109/)
[{'brand': 'Orochi's tongue',
 'maker': 'Kisuki Brewery',
 'prefecture': 'Shimane',
 'prefecture_maker': 'Shimane|Kisuki Brewery'}, {'brand': '禱 and Minoru',
 'maker': 'Fukumitsuya',
 'prefecture': 'Ishikawa',
 'prefecture_maker': 'Ishikawa|Fukumitsuya'}, {'brand': 'Kanazawa beauty',
 'maker': 'Fukumitsuya',
 'prefecture': 'Ishikawa',
 'prefecture_maker': 'Ishikawa|Fukumitsuya'}, {'brand': 'Jinkuro',
 'maker': 'Hokusetsu Sake Brewery',
 'prefecture': 'Niigata',
 'prefecture_maker': 'Niigata|Hokusetsu Sake Brewery'}, {'brand': 'Kenroku Sakura',
 'maker': 'Nakamura Sake Brewery',
 'prefecture': 'Ishikawa',
 'prefecture_maker': 'Ishikawa|Nakamura Sake Brewery'}, {'brand': 'birth',
 'maker': 'Tohoku Meijo',
 'prefecture': 'Yamagata',
 'prefecture_maker': 'Yamagata|Tohoku Meijo'}, {'brand': 'SUMMERGODDESS',
 'maker': 'Mana Tsuru Sake Brewery',
 'prefecture': 'Fukui',
:
:
:
 'scheduler/dequeued/memory': 221,
 'scheduler/enqueued': 221,
 'scheduler/enqueued/memory': 221,
 'start_time': datetime.datetime(2020, 4, 10, 7, 51, 13, 756973)}
2020-04-10 16:53:00 [scrapy.core.engine] INFO: Spider closed (finished)

I got 110 pages of sake information in JSON format in about 20 seconds. It's convenient.

Try scraping the sites you are interested in to get information.

Note

Although it is basic information, as a way to read the html information of the site you want to scrape, in the case of chrome browser and macOS, it is possible to display it with cmd + option + i. You can also press cmd + shift + c to click on an element in the site to see where it represents in the html code. Screen Shot 2020-04-10 at 17.09.44.png

Easy web scraping with Scrapy

1. Install Scrapy and create a project

2. About Spider

sake/sake/spiders/scrapy_sake.py

sake/sake/spiders/scrapy_sake.py

Execution command

Output

3. Let's actually get the web page information!

sake/items.py

sake/items.py

sake/sake/spiders/scrapy_sake.py

Note

`sake/sake/spiders/scrapy_sake.py`

`sake/sake/spiders/scrapy_sake.py`

`Execution command`

`Output`

`sake/items.py`

`sake/items.py`

`sake/sake/spiders/scrapy_sake.py`