Introduction

This is my first post. I'm a college student who has been studying programming on his own for more than half a year, but I'd like to output what I've learned little by little, so I'll post it. I think there are many messy and dirty points such as how to write the code, so feel free to point out.

Goals / procedures

As the title suggests, use a tool called lambda on AWS (Yahoo News Articles) on a regular basis ( Scraping every 6 hours) and using the LINE messaging API for updated articles to your LINE account The goal is to notify you. The development environment uses cloud9 in consideration of cooperation with lambda.

Regarding the procedure (1) Obtain 20 article titles and urls from News site using Python's BeautifulSoup module, url is AWS Write to the csv file placed in S3 of. (The part to be written is done at the end of the program.) (2) Compare the url of the previous execution (read csv from S3) with the url acquired this time to find the update. (3) Notify the title and url of the updated article using the LINE messaging API. It will be in the order.

For the LINE messaging API, Monitor web page updates with LINE BOT For scraping with BeautifulSoup, Web scraping with BeautifulSoup For reading and writing S3 csvfile, Python code to simply read CSV file of AWS S3 and [Write a Pandas dataframe to CSV on S3]( I referred to the article around https://www.jitsejan.com/write-dataframe-to-csv-on-s3.html).

code

import urllib.request
from bs4 import BeautifulSoup
import csv
import pandas as pd
import io
import boto3
import s3fs
import itertools
from linebot import LineBotApi
from linebot.models import TextSendMessage

def lambda_handler(event, context):
    
    url = 'https://follow.yahoo.co.jp/themes/051839da5a7baa353480'
    html = urllib.request.urlopen(url)
    #html perspective
    soup = BeautifulSoup(html, "html.parser")
    
    
    def news_scraping(soup=soup):
        """
Get article title and url
        """
        title_list = []
        titles = soup.select('#wrapper > section.content > ul > li:nth-child(n) > a.detailBody__wrap > div.detailBody__cnt > p.detailBody__ttl')
    
        for title in titles:
            title_list.append(title.string)
    
        url_list = []   
        urls = soup.select('#wrapper > section.content > ul > li:nth-child(n) > a.detailBody__wrap')
        
        for url in urls:
            url_list.append(url.get('href'))
       
        return title_list,url_list
        
    def get_s3file(bucket_name, key):
        """
Read csv from S3
        """
        s3 = boto3.resource('s3')
        s3obj = s3.Object(bucket_name, key).get()
    
        return io.TextIOWrapper(io.BytesIO(s3obj['Body'].read()))
        
    def write_df_to_s3(csv_list):
        """
Write to S3
        """
        csv_buffer = io.StringIO()
        csv_list.to_csv(csv_buffer,index=False,encoding='utf-8-sig')
        s3_resource = boto3.resource('s3')
        s3_resource.Object('Bucket name','file name').put(Body=csv_buffer.getvalue())
    
    def send_line(content):
        access_token = ********
        #Fill in the Channel access token
        line_bot_api = LineBotApi(access_token)
        line_bot_api.broadcast(TextSendMessage(text=content))
    
    ex_csv =[]
    #Enter the url for the previous scraping
    for rec in csv.reader(get_s3file('Bucket name', 'file name')):
        ex_csv.append(rec)
    
    ex_csv = ex_csv[1:]
    #index=It should have been written as False, but the index 0 at the beginning of the read csv(?)Was written
    ex_csv = list(itertools.chain.from_iterable(ex_csv))
    #Since the read csv was a two-dimensional array, it was converted to one dimension.
    
    title,url = news_scraping()
    #Scraping execution
    csv_list = url
    
    #ex_Extract updates by comparing with csv
    for i in range(20):
        if csv_list[i] in ex_csv[0]:
        #I used in because it didn't match exactly
            num = i
        #ex_The article at the beginning of csv is csv_Find out what number in the list it corresponds to
            break
        else:
            num = 'all'

    if num == 'all':
        send_list = [None]*2*20
        send_list[::2] = title
        send_list[1::2] = url
        send_list = "\n".join(send_list)
    #Insert title and url alternately, and start a new line
    
    elif num == 0:
        send_list = 'No new news'
    
    else:
        send_list = [None]*2*num
        send_list[::2] = title[:num]
        send_list[1::2] = url[:num]
        send_list = "\n".join(send_list)
    ##Insert title and url alternately, and start a new line
    
    send_line(send_list)
    
    csv_list = pd.DataFrame(csv_list)
    #If you write to S3 as a list, an error will occur, so convert the data type
    write_df_to_s3(csv_list)
    #Csv on S3_Write list and finish

That's it for the code. (It was better to define a function where you create send_list from num) After that, deploy to remote and set the periodic execution with Amazon CloudWatch Events. I think it is better to schedule with Cron expression to execute at specific time intervals. You need to be careful because you need to grant access authority to S3 with IAM.

in conclusion

Reading and writing csv in S3 didn't quite go as expected. In particular, there seems to be room for improvement, such as when writing to a two-dimensional array.

Actually, I had practiced scraping with selnium, headless-chrome and lambda before (in addition to using lambda for the first time, I had a lot of errors related to chrome binary and had a hard time). Therefore, I was able to write the code in a relatively short time this time. That said, it's a lot more work than local scraping. I've omitted how to use lambda here, but it's quite confusing, so please refer to other articles.

Recently, I've been working on django and Twitter API, so if I notice something around here, I'll post it again.

Thank you very much.

Use AWS lambda to scrape the news and notify LINE of updates on a regular basis [python]

Introduction

Goals / procedures

code

in conclusion