Serverless scraping on a regular basis with AWS lambda + scrapy Part 1

First post! I really wanted to include serverless in one article, but I couldn't make it in time ... So this time it will be scraping.

Thing you want to do

I want to automatically scrape web pages whose information is updated regularly!

Target

Get Yahoo! Weather (Tokyo) data every 6 hours.

Method

Python + Scrapy + AWSlambda + CroudWatchEvents seems to be good ...?

I will try it for the time being

First from scraping

Follow the steps below to create the crawling and scraping parts.

  1. Scrapy installation
  2. Create a Scrapy project
  3. Create spider
  4. Run

1. Scrapy installation

$ python3 -V
Python 3.7.4

$ pip3 install scrapy
...
Successfully installed

$ scrapy version
Scrapy 1.8.0

2. Create a Scrapy project

The project folder is created in the hierarchy where you entered the command.

$ scrapy startproject yahoo_weather_crawl
New Scrapy project 'yahoo_weather_crawl'

$ ls
yahoo_weather_crawl

This time I will try to get this part of yahoo weather. image.png Let's pick up the announcement date, date, weather, temperature, and probability of precipitation.

Scrapy has a command line shell, and you can enter commands to check if the acquisition target is properly taken, so let's proceed while checking it once.

Specify the acquisition target with xpath. You can easily get the xpath from the google chrome developer tools (the one that comes out when you press F12).

image.png

The xpath of the announcement date and time acquired this time is as follows //*[@id="week"]/p

Let's pull this out of the response.


#Launch scrapy shell
$ scrapy shell https://weather.yahoo.co.jp/weather/jp/13/4410.html

>>> announcement_date = response.xpath('//*[@id="week"]/p/text()').extract_first()
>>> announcement_date
'Announced at 18:00 on November 29, 2019'

If you specify text (), you can get only the text. See Resources (https://doc.scrapy.org/en/latest/index.html) for more information.

For the time being, the date and time have been set, so let's get the others in the same way.

Other information is in the table tag, so get all the contents of the table once.

image.png


>>> table = response.xpath('//*[@id="yjw_week"]/table')

You now have the elements in the table tag for id = "yjw_week" . We will get each element from here.


#date
>>> date = table.xpath('//tr[1]/td[2]/small/text()').extract_first()
>>> date
'December 1st'

#weather
>>> weather = table.xpath('//tr[2]/td[2]/small/text()').extract_first()
>>> weather
'Cloudy and sometimes sunny'

#temperature
>>> temperature = table.xpath('//tr[3]/td[2]/small/font/text()').extract()
>>> temperature
['14', '5']

#rainy percent
>>> rainy_percent = table.xpath('//tr[4]/td[2]/small/text()').extract_first()
>>> rainy_percent
'20'

Now that you know how to get each We will create a Spider (the main part of the process).

3. Create spider

The structure of the project folder created earlier is as follows.


.
├── scrapy.cfg
└── yahoo_weather_crawl
    ├── __init__.py
    ├── __pycache__
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        ├── __init__.py
        └── __pycache__

First, define the items to be acquired.

items.py



import scrapy

class YahooWeatherCrawlItem(scrapy.Item):
    announcement_date = scrapy.Field()  #Announcement date and time
    date = scrapy.Field()               #date
    weather = scrapy.Field()            #weather
    temperature = scrapy.Field()        #temperature
    rainy_percent = scrapy.Field()      #rainy percent

Next, create the body of the spider in the spiders folder.

spider/weather_spider.py


# -*- coding: utf-8 -*-
import scrapy
from yahoo_weather_crawl.items import YahooWeatherCrawlItem

# spider
class YahooWeatherSpider(scrapy.Spider):

    name = "yahoo_weather_crawler"
    allowed_domains = ['weather.yahoo.co.jp']
    start_urls = ["https://weather.yahoo.co.jp/weather/jp/13/4410.html"]

    #Extraction process for response
    def parse(self, response):
        #Announcement date and time
        yield YahooWeatherCrawlItem(announcement_date = response.xpath('//*[@id="week"]/p/text()').extract_first())
        table = response.xpath('//*[@id="yjw_week"]/table')

        #Date loop
        for day in range(2, 7):

            yield YahooWeatherCrawlItem(
                #Data extraction
                date=table.xpath('//tr[1]/td[%d]/small/text()' % day).extract_first(),
                weather=table.xpath('//tr[2]/td[%d]/small/text()' % day).extract_first(),
                temperature=table.xpath('//tr[3]/td[%d]/small/font/text()' % day).extract(),
                rainy_percent=table.xpath('//tr[4]/td[%d]/small/text()' % day).extract_first(),
                )

4. Now run!

scrapy crawl yahoo_weather_crawler

2019-12-01 20:17:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weather.yahoo.co.jp/weather/jp/13/4410.html>
{'announcement_date': 'Announced at 17:00 on December 1, 2019'}
2019-12-01 20:17:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weather.yahoo.co.jp/weather/jp/13/4410.html>
{'date': 'December 3rd',
 'rainy_percent': '10',
 'temperature': ['17', '10'],
 'weather': 'Sunny'}
2019-12-01 20:17:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weather.yahoo.co.jp/weather/jp/13/4410.html>
{'date': 'December 4th',
 'rainy_percent': '0',
 'temperature': ['15', '4'],
 'weather': 'Sunny'}
2019-12-01 20:17:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weather.yahoo.co.jp/weather/jp/13/4410.html>
{'date': 'December 5th',
 'rainy_percent': '0',
 'temperature': ['14', '4'],
 'weather': 'Partially cloudy'}
2019-12-01 20:17:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weather.yahoo.co.jp/weather/jp/13/4410.html>
{'date': 'December 6th',
 'rainy_percent': '10',
 'temperature': ['11', '4'],
 'weather': 'Cloudy'}
2019-12-01 20:17:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weather.yahoo.co.jp/weather/jp/13/4410.html>
{'date': 'December 7th',
 'rainy_percent': '30',
 'temperature': ['9', '3'],
 'weather': 'Cloudy'}

It looks like it's taken well! It's a big deal, so let's output it to a file.

When outputting to a file, Japanese characters will be garbled by default, so Add the encoding settings to settings.py.

settings.py


FEED_EXPORT_ENCODING='utf-8'
$ scrapy crawl yahoo_weather_crawler -o weather_data.json
...

weather_data.json


[
{"announcement_date": "Announced at 17:00 on December 1, 2019"},
{"date": "December 3rd", "weather": "Sunny", "temperature": ["17", "10"], "rainy_percent": "10"},
{"date": "December 4th", "weather": "Sunny", "temperature": ["15", "4"], "rainy_percent": "0"},
{"date": "December 5th", "weather": "Partially cloudy", "temperature": ["14", "4"], "rainy_percent": "0"},
{"date": "December 6th", "weather": "Cloudy", "temperature": ["11", "4"], "rainy_percent": "10"},
{"date": "December 7th", "weather": "Cloudy", "temperature": ["9", "3"], "rainy_percent": "30"}
]

I was able to output!

Next time, I will combine this process with AWS to run it serverlessly.

References

Scrapy 1.8 documentation https://doc.scrapy.org/en/latest/index.html Understand in 10 minutes Scrapy https://qiita.com/Chanmoro/items/f4df85eb73b18d902739 Web scraping with Scrapy https://qiita.com/Amtkxa/items/4c1172c932264ae941b4

Recommended Posts

Serverless scraping on a regular basis with AWS lambda + scrapy Part 1
Serverless scraping using selenium with [AWS Lambda] -Part 1-
Move CloudWatch logs to S3 on a regular basis with Lambda
Deploy Python3 function with Serverless Framework on AWS Lambda
Build a Flask / Bottle-like web application on AWS Lambda with Chalice
How to create a serverless machine learning API with AWS Lambda
I made a bot to post on twitter by web scraping a dynamic site with AWS Lambda (continued)
Use AWS lambda to scrape the news and notify LINE of updates on a regular basis [python]
Periodically run a python program on AWS Lambda
Build a WardPress environment on AWS with pulumi
Try Tensorflow with a GPU instance on AWS
Serverless application with AWS SAM! (APIGATEWAY + Lambda (Python))
Scraping with Python, posting on TwitterBot, regular execution on Heroku
[AWS] Play with Step Functions (SAM + Lambda) Part.3 (Branch)
Scraping with scrapy shell
Create a Layer for AWS Lambda Python with Docker
[AWS] Play with Step Functions (SAM + Lambda) Part.1 (Basic)
I want to AWS Lambda with Python on Mac!
Procedure for creating a Line Bot on AWS Lambda
[AWS] Play with Step Functions (SAM + Lambda) Part.2 (Parameter)
I just built a virtual environment with AWS lambda layer
Scraping with Selenium + Python Part 1
Festive scraping with Python, scrapy
Deploy Django serverless with Lambda
AWS Lambda with PyTorch [Lambda import]
Web scraping using AWS lambda
Scraping with Selenium + Python Part 2
[AWS Hands-on] Let's create a celebrity identification service with a serverless architecture!
# 3 Build a Python (Django) environment on AWS EC2 instance (ubuntu18.04) part2
Make a scraping app with Python + Django + AWS and change jobs
Install pip in Serverless Framework and AWS Lambda with Python environment
Get data from your website on a regular basis using ScraperWiki
Launched a web application on AWS with django and changed jobs
Let's make a web chat using WebSocket with AWS serverless (Python)!
Play with a turtle with turtle graphics (Part 1)
Run Python on Schedule on AWS Lambda
Using Lambda with AWS Amplify with Go
Notify HipChat with AWS Lambda (Python)
Procedure for building a kube environment on amazon linux2 (aws) ~ (with bonus)
Prepare the environment of Chainer on EC2 spot instance with AWS Lambda
I tried to make a url shortening service serverless with AWS CDK
I wrote a Slack bot that notifies delay information with AWS Lambda