We collect information such as store name, latitude / longitude, and deployment service from KFC store information. Since the store information is generated by javascript, splash is caught in between.
List of precautions for web scraping
AWS EC2: Amazon Linux (2016.09-release) ,t2.micro Python 2.7.12
python2.7
sudo yum groupinstall "Development tools"
sudo yum install python-devel libffi-devel openssl-devel libxml2-devel libxslt-devel
sudo pip install scrapy
sudo pip install service_identity #Amazon Linux is installed by default, so it is unnecessary
sudo yum -y install docker-io
sudo service docker start
sudo chkconfig docker on
sudo pip install scrapy-splash
docker pull scrapinghub/splash
docker run -p 8050:8050 scrapinghub/splash
splash is covered by docker. Easy.
(Reference site) [Note on how to get started with the Python crawler "Scrapy"](http://sechiro.hatenablog.com/entry/2016/04/02/Python%E8%A3%BD%E3%82%AF%E3%83%AD % E3% 83% BC% E3% 83% A9% E3% 83% BC% E3% 80% 8CScrapy% E3% 80% 8D% E3% 81% AE% E5% A7% 8B% E3% 82% 81% E6 % 96% B9% E3% 83% A1% E3% 83% A2) Easy scraping of JavaScript page using scrapy-splash github (scrapy-splash)
Create a template for project and spider.
python2.7
export PRJ_NAME=KFCShopSpider
scrapy startproject ${PRJ_NAME}
cd ./${PRJ_NAME}/${PRJ_NAME}/spider
scrapy genspider ${PRJ_NAME} kfc.co.jp
Define the item you want to get. This time, we will get the store name, address, map_url (latitude and longitude), and the presence or absence of various services.
~/KFCShopSpider/KFCShopSpider/items.py
# -*- coding: utf-8 -*-
import scrapy
class KFCShopspiderItem(scrapy.Item):
name = scrapy.Field()
address = scrapy.Field()
map_url = scrapy.Field()
DriveThrough = scrapy.Field()
Parking = scrapy.Field()
Delivery = scrapy.Field()
Wlan = scrapy.Field()
pass
Do not put a load on the other party. Required for USER_AGENT, ROBOTSTXT_OBEY, and DOWNLOAD_DELAY.
~/KFCShopSpider/KFCShopSpider/settings.py
# -*- coding: utf-8 -*-
BOT_NAME = 'KFCShopSpider'
SPIDER_MODULES = ['KFCShopSpider.spiders']
NEWSPIDER_MODULE = 'KFCShopSpider.spiders'
USER_AGENT = 'KFCShopSpider (+http://www.yourdomain.com)'
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 3
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPLASH_URL = 'http://localhost:8050/'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
~/KFCShopSpider/KFCShopSpider/spider/KFCShop_spider.py
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider
from scrapy_splash import SplashRequest
from ..items import KFCShopspiderItem
class KFCShopSpider(CrawlSpider):
name = "KFCShopSpider"
allowed_domains = ["kfc.co.jp"]
start_urls = []
shop_url_home ='http://www.kfc.co.jp/search/fuken.html?t=attr_con&kencode='
#Start from the search result page of 47 prefectures.
for i in range(1,48):
prfct_id = '{0:02d}'.format(i)
url = shop_url_home + prfct_id
start_urls.append(url)
#Gets the rendered response.
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
args={'wait': 0.5},
)
def parse(self, response):
#Each store information(=list element)To specify.
stores = response.xpath('//ul[@id="outShop"]/li')
for store in stores:
item = KFCShopspiderItem()
#Describe with a relative path.
item['address'] = store.xpath('./span[@class="scAddress"]/text()[1]').extract()
item['map_url'] = store.xpath('./ul/li[2]/div/a/@href').extract()
item['DriveThrough']= store.xpath('./span[@class="scIcon"]/img[contains(./@src,"check04")]/@alt').extract()
item['Parking'] = store.xpath('./span[@class="scIcon"]/img[contains(./@src,"check05")]/@alt').extract()
item['Delivery'] = store.xpath('./span[@class="scIcon"]/img[contains(./@src,"check02")]/@alt').extract()
item['Wlan'] = store.xpath('./span[@class="scIcon"]/img[contains(./@src,"check03")]/@alt').extract()
yield item
#Of each search result'next'Get the link destination of and call the paese method.
next_page= response.xpath('//li[@class="next"]/a/@href')
if next_page:
# 'next'There are two at the top and bottom of the store list, so get only the first element
url = response.urljoin(next_page[0].extract())
yield SplashRequest(url, self.parse)
The developer tool (F12 key) of chrome is convenient for checking the xpath. You can get the element you want to check from the Elements view by right-clicking> Copy> Copy XPath.
python2.7
scrapy shell "http://localhost:8050/render.html?url={The url you want to render}"
We were able to obtain information on over 1,000 stores in about 10 minutes.
python2.7
scrapy crawl ${PRJ_NAME} -o hoge.csv
Recommended Posts