Collect store latitude / longitude information with scrapy + splash ②

1. 1. Introduction

Following Last time, the latitude and longitude information of the store will be acquired. This time, we will get the store information of the companies that are developing chains from Mapion Phonebook. In order to have versatility, the following items can be passed as arguments when executing scrapy.

・ Genre: Genre ID ・ Category: Category ID ・ Chain_store: Chain development company ID

For example, in the case of Gyoza no Ohsho, it will be as follows. genre = M01 (gourmet), category = 002 (ramen / gyoza), chain_store = CA01 (gyoza no ohsho)

2. Execution environment / environment construction

The execution environment and environment construction are the same as Last time.

3.scrapy The settings of item.py and setting.py are the same as previous except for the name / acquisition item, so they are omitted.

A list of stores is listed on the top page of chain stores (Example). However, since latitude and longitude information cannot be obtained from this page, the store can be accessed from the link destination of each store (Example). Get the name and latitude / longitude.

Therefore, it is inefficient because it needs to be crawled for each store, and it takes a lot of time compared to Last time. (10,000 cases for about 12 hours)

Also, since the maximum number of stores listed in the Mapion phonebook is 10,000 (100 stores / page * 100page), In the case of a company that has more than 10,000 stores, it is not possible to cover all stores.

MapionSpider.py


# -*- coding: utf-8 -*-

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_splash import SplashRequest
from ..items import MapionspiderItem

class MapionSpider(CrawlSpider):
    name = 'Mapion_spider'
    allowed_domains = ['mapion.co.jp']

    #Receive genre, category, and chain store ID information as arguments.
    def __init__(self, genre=None, category=None, chain_store=None, *args, **kwargs):
        super(MapionSpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://www.mapion.co.jp/phonebook/{0}{1}{2}/'.format(genre,category,chain_store)]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse_item, args={'wait': 0.5}, )

    def parse_item(self, response):
        item = MapionspiderItem()
        shop_info = response.xpath('//*[@id="content"]/section/table/tbody')
        if shop_info:
            item['name'] = shop_info.xpath('tr[1]/td/text()').extract()
            url_path = shop_info.xpath('//a[@id="spotLargMap"]/@href').extract()
            #Store map information(url_path)Extract latitude / longitude information from, make it a list type, and pass it to item.
            url_elements = url_path[0].split(',')
            item['latitude'] = [url_elements[0][4:]]
            item['longitude'] = [url_elements[1]]
            yield item

        #Get the link destination of each store details from the store list.
        list_size = len(response.xpath('//table[@class="list-table"]/tbody/tr').extract())
        for i in range(2,list_size+1):
            target_url_path = '//table[@class="list-table"]/tbody/tr['+str(i)+']/th/a/@href'
            target = response.xpath(target_url_path)
            if target:
                target_url = response.urljoin(target[0].extract())
                yield SplashRequest(target_url, self.parse_item)

        #Get the link destination of the next page number at the bottom of the store list.
        next_path = response.xpath('//p[@class="pagination"]/*[contains(@class, "pagination-currnet ")]/following::a[1]/@href')
        if next_path:
            next_url = response.urljoin(next_path[0].extract())
            yield SplashRequest(next_url, self.parse_item)

4. Run

The argument is specified with the -a option. When acquiring 7-Eleven store information.

scrapy crawl MapionSpider -o hoge.csv -a genre='M02' -a category='005' -a chain_store='CM01'

Recommended Posts

Collect store latitude / longitude information with scrapy + splash ②
Collect store latitude / longitude information with scrapy + splash ①
Visualize latitude / longitude coordinate information with kepler.gl
Collect anime song lyrics with Scrapy
Latitude / longitude coordinates ↔ UTM coordinate conversion with python
Get latitude / longitude distance in meters with QGIS
Restart with Scrapy
Get nearby latitude / longitude points with Python's geoindex library
YOLP: Extract latitude and longitude with Yahoo! Geocoder API.