I tried to notify the update of "Hamelin" using "Beautiful Soup" and "IFTTT"

Introduction

When I made Update notification app to become a novelist using API, the update information of the work from My Page I thought that I would be more addicted to what I wanted to do, so I made it with the novel posting site version called Hameln.

What do you do?

It is an application that notifies LINE Notify of Hamelin update information using BeautifulSoup4 and IFTTT.

environment

Before preparing ... Check and note about the difference between API and scraping

This time we will use scraping. </ b> It's a technology that is regulated by law, so let's check the story of the law. The first thing to keep in mind is not to overload the site's servers. This time, as a countermeasure, after getting or posting, time.sleep (1) is used to create a waiting time.

Preparation (Create Applet with IFTTT )

It is a service that links services different from IFTTT. This time, connect Webhooks and LINE Notify and have them send notifications to your LINE. Procedure </ b>

  1. Register with IFTTT
  2. Click Create at the top right of the screen. The screen will change to the screen that says "if + This Then That". See the figure below. ifttt_ifttt.PNG
  3. Click + This. Type Webhooks into the search bar to select it.
  4. Click on the column that says "Receive a web request". When you come to the screen below, enter your favorite name in "Event Name". I will use it later. ifttt_2.PNG
  5. Click + That. Select LINE and click the "Send message" field.
  6. Log in to LINE, set the content of the message to "Value1: Value1 \
    " (it doesn't matter if you don't) and click "Create Action". ifttt_3.PNG
  7. Check the contents and click Finish.
  8. Click Explore, type Webhooks in the search window and select the Services tab. Click Webhooks. Maybe you can go from this link ...?
  9. Click Documentation and you should see "Your key is: ~~~~~", so make a note of it. I will use it later. This is the end of IFTTT.

Source code description

Briefly explain the source code. import

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import csv
import time

post_ifttt() It is a function to send a notification from IFTTT to LINE Notify. Use the Applet name and Webhooks key here. I also used it in the update notification to become a novelist.

def post_ifttt(json):
    # json: {value1: " content "}
    url = (
        "https://maker.ifttt.com/trigger/"
        + # Applet Name
        + "/with/key/"
        + # Webhooks Key
    )
    requests.post(url, json)

extract() This is the underlying function of this code. It will be used in the part described later. Extract one of ["Title"], ["Number of stories"], and ["URL"] from HTML according to the condition and store it in the list. It may be a little difficult to see the branch. It might have been better to write the if statements of condition in parallel. The "<" and "" parts are if statements that remove HTML tags and extract only the desired attributes.

def extract(info, condition, li):
    for item in info:
        if condition in str(item):
            a = ""
            is_a = 0
            if condition!="href":
                for s in str(item):
                    if s=="<" and is_a==1:
                        is_a = 0
                        li.append(a)
                        break
                    if is_a==1:
                        if condition=="latest":
                            if "0" <= s and s <= "9":
                                a+=s
                        else:
                            a += s
                    if s==">" and is_a==0:
                        is_a = 1
            else:
                if "mode=user" in str(item):
                    continue
                for s in str(item):
                    if s=="\"" and is_a==1:
                        is_a = 0
                        li.append(a)
                        break
                    if is_a==1:
                        a += s
                    if s=="\"" and is_a==0:
                        is_a = 1

Login </ b> Since scraping is done from Hameln's My Page, you can POST the necessary information from the login screen and log in. The information required for the login process differs from site to site and can be confirmed from the developer tools for each site, but in Hamelin it is "id, pass, mode". The mode is "last_entry_end" for everyone. POST this information and log in. The detailed usage of Beautifu Soup is summarized in the article below, so please have a look.

##############################################################
#                           Log in                           #
##############################################################
# id, pass
with open("input.txt") as f:
    """
    input.txt: [ID PASS]
    """
    s = f.read().split()
    ID = s[0]
    PASS = s[1]

session = requests.session()

url_login = "https://syosetu.org/?mode=login"
response = session.get(url_login)
time.sleep(1)

login_info = {
    "id":ID,
    "pass":PASS,
    "mode":"login_entry_end"
}

res = session.post(url_login, data=login_info)
res.raise_for_status() # for error
time.sleep(1)

By the way, input.txt is an input file in which the ID and password are saved in this order with one half-width space. Example)

input.txt


ID_hoge passwd_hoge 

User name output </ b> The user name part is extracted from the HTML of the user information page. Easy.

###############################################################
#                        Print User Name                      #
###############################################################

soup_myage = BeautifulSoup(res.text, "html.parser")

account_href = soup_myage.select_one(".spotlight li a").attrs["href"]
url_account = urljoin(url_login, account_href)

res_account = session.get(url_account)
res_account.raise_for_status()
time.sleep(1)

soup_account = BeautifulSoup(res_account.text, "html.parser")
user_name = str((soup_account.select(".section3 h3"))[0])[4:-5].split("/")[0]

print("Hello "+ user_name + "!")

Get information about your favorite novels from each favorite page </ b> There are multiple favorite pages. So, from each page, store ["Title"], ["Number of stories"], and ["URL"] in the list title, latest_no, and ncode, respectively. Check for updates later and save to a file.

###############################################################
#                        Page Transition                      #
###############################################################
a_list = soup_myage.select(".section.pickup a")
favo_a = ""
for _ in a_list:
    if("To favorite list" in _):
        favo_a = _
        break

url_favo = urljoin(url_login, favo_a.attrs["href"])

res_favo = session.get(url_favo)
res_favo.raise_for_status()
time.sleep(1)

soup_favo = BeautifulSoup(res_favo.text, "html.parser")
bookmark_titles = soup_favo.select(".section3 h3 a")
bookmark_latest = soup_favo.select(".section3 p a")
titles = []
latest_no = []
ncode = []

extract(bookmark_titles, "novel", titles)
extract(bookmark_latest, "latest", latest_no)
extract(bookmark_titles, "href", ncode)
###############################################################
#                     Start Page Transition                   #
###############################################################
number_of_bookmarks_h2 = soup_favo.select_one(".heading h2")
number_of_bookmarks = ""
for s in str(number_of_bookmarks_h2)[4:-5]:
    if s>="0" and s<='9':
        number_of_bookmarks += s
number_of_bookmarks = int(number_of_bookmarks)
number_of_favo_pages = number_of_bookmarks // 10 + 1

for i in range(2,number_of_favo_pages+1):
    url_favo = "https://syosetu.org/?mode=favo&word=&gensaku=&type=&page=" + str(i)
    res_favo = session.get(url_favo)
    res_favo.raise_for_status()
    soup_favo = BeautifulSoup(res_favo.text, "html.parser")
    bookmark_titles = soup_favo.select(".section3 h3 a")
    bookmark_latest = soup_favo.select(".section3 p a")
    extract(bookmark_titles, "novel", titles)
    extract(bookmark_latest, "latest", latest_no)
    extract(bookmark_titles, "href", ncode)
    time.sleep(1)

Data acquisition </ b> The newly acquired information is stored in bookmark_info, and the previously acquired information is stored in data. Then check if it has been updated.

###############################################################
#                        Get Latest Data                      #
###############################################################
bookmark_info = []
for i in range(len(titles)):
    bookmark_info.append([titles[i], latest_no[i], ncode[i]])

###############################################################
#                       Get Previous Data                     #
###############################################################
read_file = "hameln.csv"
with open(read_file, encoding="utf-8") as f:
    reader = csv.reader(f)
    data = [row for row in reader]

###############################################################
#              Check Whether Novels are Updated               #
###############################################################
"""
previous data: data
latest data: bookmark_info
"""
for prev in data:
    for latest in bookmark_info:
        if prev[0] == latest[0]:
            # check
            if prev[1] != latest[1]:
                print(str(latest[0]) + "Has been updated.\n" + latest[2])
                json = {"value1" : str(latest[0]) +"Has been updated.\n" + latest[2]}
                post_ifttt(json)

Write update information to a file </ b>

###############################################################
#                    Write Latest Information                 #
###############################################################
output = "hameln.csv"
with open(output, mode='w', newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    for i in range(len(bookmark_info)):
        writer.writerow([bookmark_info[i][0], bookmark_info[i][1], bookmark_info[i][2]])

GitHub I uploaded it to GitHub ( here ). Please take a look if you like.

At the end

The login process was the most interesting part of the knowledge gained from this app. You're not just passing in your ID and password. Also, automation was done with the task scheduler. For more information on how to use Task Scheduler, please see the references section.

References

Recommended Posts