1.First of all

We collect information such as store name, latitude / longitude, and deployment service from KFC store information. Since the store information is generated by javascript, splash is caught in between.

List of precautions for web scraping

2. Execution environment

AWS EC2: Amazon Linux (2016.09-release) ,t2.micro Python 2.7.12

3. Environment construction

`python2.7`


sudo yum groupinstall "Development tools" 
sudo yum install python-devel libffi-devel openssl-devel libxml2-devel libxslt-devel
sudo pip install scrapy
sudo pip install service_identity  #Amazon Linux is installed by default, so it is unnecessary

sudo yum -y install docker-io
sudo service docker start
sudo chkconfig docker on

sudo pip install scrapy-splash

docker pull scrapinghub/splash
docker run -p 8050:8050 scrapinghub/splash

splash is covered by docker. Easy.

(Reference site) [Note on how to get started with the Python crawler "Scrapy"](http://sechiro.hatenablog.com/entry/2016/04/02/Python%E8%A3%BD%E3%82%AF%E3%83%AD % E3% 83% BC% E3% 83% A9% E3% 83% BC% E3% 80% 8CScrapy% E3% 80% 8D% E3% 81% AE% E5% A7% 8B% E3% 82% 81% E6 % 96% B9% E3% 83% A1% E3% 83% A2) Easy scraping of JavaScript page using scrapy-splash github (scrapy-splash)

scrapy

project creation

Create a template for project and spider.

`python2.7`


export PRJ_NAME=KFCShopSpider
scrapy startproject ${PRJ_NAME}
cd ./${PRJ_NAME}/${PRJ_NAME}/spider
scrapy genspider ${PRJ_NAME} kfc.co.jp

item definition

Define the item you want to get. This time, we will get the store name, address, map_url (latitude and longitude), and the presence or absence of various services.

`~/KFCShopSpider/KFCShopSpider/items.py`


# -*- coding: utf-8 -*-

import scrapy

class KFCShopspiderItem(scrapy.Item):
    name = scrapy.Field()
    address = scrapy.Field()
    map_url = scrapy.Field()
    DriveThrough = scrapy.Field()
    Parking = scrapy.Field()
    Delivery = scrapy.Field()
    Wlan = scrapy.Field()
    pass

Initial setting

Do not put a load on the other party. Required for USER_AGENT, ROBOTSTXT_OBEY, and DOWNLOAD_DELAY.

`~/KFCShopSpider/KFCShopSpider/settings.py`


# -*- coding: utf-8 -*-

BOT_NAME = 'KFCShopSpider'
SPIDER_MODULES = ['KFCShopSpider.spiders']
NEWSPIDER_MODULE = 'KFCShopSpider.spiders'

USER_AGENT = 'KFCShopSpider (+http://www.yourdomain.com)'
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 3

SPIDER_MIDDLEWARES = {
     'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPLASH_URL = 'http://localhost:8050/'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

spider program

`~/KFCShopSpider/KFCShopSpider/spider/KFCShop_spider.py`


# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider
from scrapy_splash import SplashRequest
from ..items import KFCShopspiderItem

class KFCShopSpider(CrawlSpider):
    name = "KFCShopSpider"
    allowed_domains = ["kfc.co.jp"]

    start_urls = []
    shop_url_home ='http://www.kfc.co.jp/search/fuken.html?t=attr_con&kencode='
    #Start from the search result page of 47 prefectures.
    for i in range(1,48):
        prfct_id = '{0:02d}'.format(i)
        url = shop_url_home + prfct_id
        start_urls.append(url)

    #Gets the rendered response.
    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse,
                args={'wait': 0.5},
        )

    def parse(self, response):
        #Each store information(=list element)To specify.
        stores = response.xpath('//ul[@id="outShop"]/li')
        for store in stores:
            item = KFCShopspiderItem()
            #Describe with a relative path.
            item['address']     = store.xpath('./span[@class="scAddress"]/text()[1]').extract()
            item['map_url']     = store.xpath('./ul/li[2]/div/a/@href').extract()
            item['DriveThrough']= store.xpath('./span[@class="scIcon"]/img[contains(./@src,"check04")]/@alt').extract()
            item['Parking']     = store.xpath('./span[@class="scIcon"]/img[contains(./@src,"check05")]/@alt').extract()
            item['Delivery']    = store.xpath('./span[@class="scIcon"]/img[contains(./@src,"check02")]/@alt').extract()
            item['Wlan']        = store.xpath('./span[@class="scIcon"]/img[contains(./@src,"check03")]/@alt').extract()
            yield item

        #Of each search result'next'Get the link destination of and call the paese method.
        next_page= response.xpath('//li[@class="next"]/a/@href')
        if next_page:
            # 'next'There are two at the top and bottom of the store list, so get only the first element
            url = response.urljoin(next_page[0].extract())
            yield SplashRequest(url, self.parse)

5. Debug

The developer tool (F12 key) of chrome is convenient for checking the xpath. You can get the element you want to check from the Elements view by right-clicking> Copy> Copy XPath.

`python2.7`


scrapy shell "http://localhost:8050/render.html?url={The url you want to render}"

6. Run

We were able to obtain information on over 1,000 stores in about 10 minutes.

`python2.7`


scrapy crawl ${PRJ_NAME} -o hoge.csv

Collect store latitude / longitude information with scrapy + splash ①

1.First of all

2. Execution environment

3. Environment construction

python2.7

project creation

python2.7

item definition

~/KFCShopSpider/KFCShopSpider/items.py

Initial setting

~/KFCShopSpider/KFCShopSpider/settings.py

spider program

~/KFCShopSpider/KFCShopSpider/spider/KFCShop_spider.py

5. Debug

python2.7

6. Run

python2.7

`python2.7`

`python2.7`

`~/KFCShopSpider/KFCShopSpider/items.py`

`~/KFCShopSpider/KFCShopSpider/settings.py`

`~/KFCShopSpider/KFCShopSpider/spider/KFCShop_spider.py`

`python2.7`

`python2.7`