I created a tool to get the title and URL of a new blog post by scraping with python. It's github-> https://github.com/cerven12/blog_post_getter

Why did you make it?

When my friend started (restarted?) Blogging, I wanted to try to fix my memory and improve my writing skills by posting blogs. I thought that it would be more active to be stimulated by two people competing and cooperating with each other rather than doing it alone, so I made it as part of that.

What kind of tool is it?

Get the title and URL of the newly posted article. (I'd like to run it regularly and notify it via LINE API etc.)

Use the txt file that contains the URL of the existing post list and compare it with the URL of the latest post list. I try not to detect changes in the title or content. (Because it's hard to get a notification saying "New!" Just by editing the title!) However, the exception is when the URL changes when editing an article (Is there ...?).

I don't know about other sites because I made it so that it can be used with Qiita, but I think that it can be used on pages where html has the following format

<!--There is a class in the a tag. The title is written as an element of the a tag-->
<a class='articles'  href='#'>Title</a>

Usable page

Qiita user page: https://qiita.com/takuto_neko_like Hatena Blog user page: http://atc.hateblo.jp/about

Terms of use

There is a page that displays a list of articles
A common selector is set in the <a> tag of each article.
The title must be written as an element of the a tag
Create an empty .txt file in advance

Whole code


import requests, bs4


def new_post_getter(url, selecter, txt):
    '''
Article title and URL bs4_Get element
1st argument:URL of the page with the post list
2nd argument:Of each post<a>At the selector head attached to the tag.With
3rd argument:Recording txt path
    '''
    res = requests.get(url)
    posts = bs4.BeautifulSoup(res.text, 'html.parser').select(selecter)

    now_posts_url = [] #1 List of URLs of the acquired article list,Used to identify new posts by comparing with previous post data
    now_posts_url_title_set = [] #List of URLs and titles of the acquired article list,
    for post in posts:
        #Extract URL
        index_first = int(str(post).find('href=')) + 6
        index_end = int(str(post).find('">'))
        url = (str(post)[index_first : index_end])
        #Extract title
        index_first = int(str(post).find('">')) + 2
        index_end = int(str(post).find('</a'))
        title = (str(post)[index_first : index_end].replace('\u3000', ' ')) #Whitespace replacement

        now_posts_url.append(url)
        now_posts_url_title_set.append(f"{url}>>>{title}")

    old_post_text = open(txt)
    old_post = old_post_text.read().split(',') #From text file to list type
    # differences :Posts that have been posted but are not displayed on the list screen+New post
    differences = list(set(now_posts_url) - set(old_post))
    old_post_text.close()

    #Overwrite txt for recording all_posts are past posts+New post
    all_posts = ",".join(old_post + differences)
    f = open(txt, mode='w')
    f.writelines(all_posts)
    f.close()

    new_post_info = []
    for new in now_posts_url_title_set:
        for incremental in differences:
            if incremental in new:
                new_post_info.append(new.split(">>>"))
    return new_post_info

How to Use

Page of article list,Selector attached to a tag of each article, Specify the path of the txt file that saves the posting status as an argument

★ Try using

`Try using`


url = 'https://qiita.com/takuto_neko_like'
selecter = '.u-link-no-underline'
file = 'neko.txt'

my_posts = new_post_getter(url, selecter, file)
print(my_posts)

By doing the above ...

`result`



[['/takuto_neko_like/items/93b3751984e5e3fd3670', '[Fish] About the matter that the movement of fish was too slow ~ Trouble with git ~'], ['/takuto_neko_like/items/14e92797fa2b23a64adb', '[Python] What is inherited by multiple inheritance?']]

You'll get a double list of URLs and titles. [[URL, title], [URL, title], [URL, title], .......]

By turning the double list with a for statement and formatting the character string ...


for url, title in my_posts:
    print(f'{title} : {url}')

Easy-to-read output ↓

`output`



[Fish] About the matter that the movement of fish was too slow ~ Trouble with git ~: /takuto_neko_like/items/93b3751984e5e3fd3670
[Python] What is inherited by multiple inheritance?: /takuto_neko_like/items/14e92797fa2b23a64adb

By the way

The contents of neko.txt are like this.

/takuto_neko_like/items/93b3751984e5e3fd3670,/takuto_neko_like/items/14e92797fa2b23a64adb,/takuto_neko_like/items/bb8d0957347636b5bf4f,/takuto_neko_like/items/62aeb4271614f6f0347f,/takuto_neko_like/items/c9c80ff453d0c4fad239,/takuto_neko_like/items/aed9dd5619d8457d4894,/takuto_neko_like/items/6cf9bade3d9515a724c0

Contains a list of URLs. Try deleting the first and last ...

/takuto_neko_like/items/14e92797fa2b23a64adb,/takuto_neko_like/items/bb8d0957347636b5bf4f,/takuto_neko_like/items/62aeb4271614f6f0347f,/takuto_neko_like/items/c9c80ff453d0c4fad239,/takuto_neko_like/items/aed9dd5619d8457d4894

When you run ...


my_posts = new_post_getter(url, selecter, file)
print(my_posts)

Result ↓

[['/takuto_neko_like/items/c5791f267e0964e09d03', 'Created a tool to get new articles to work hard with friends on blog posts'], ['/takuto_neko_like/items/93b3751984e5e3fd3670', '[Fish] About the matter that the movement of fish was too slow ~ Trouble with git ~'], ['/takuto_neko_like/items/6cf9bade3d9515a724c0', '【Python】@What are classmethods and decorators?']]

Get only the deleted amount! ☺

How did you make it?

Below is a description of the code.

Code flow

Get the <a> tags of all displayed articles from the article list page
Extract the URL from the obtained <a> tags. Also, a set of URL and title is extracted separately.
Use the existing post list (txt), 2. Compare with the URL obtained in. Extract the difference.
Overwrite txt by combining the URL of the new post and the existing post record.
From the set of URL and title obtained in 2., only the items that correspond to the difference are extracted. Make it a double list type for easy shaping

Each code

1. Get the `<a>` tag of all displayed articles from the article list page

`1.Get the a tag of all displayed articles from the article list page`



import requests, bs4


def new_post_getter(url, selecter, txt):
    '''
Article title and URL bs4_Get element
1st argument:URL of the page with the post list
2nd argument:Of each post<a>At the selector head attached to the tag.With
3rd argument:Recording txt path
    '''
    res = requests.get(url)
    posts = bs4.BeautifulSoup(res.text, 'html.parser').select(selecter)

We will use two third party libraries here.

request A library that can use WebAPI. This time, I use the GET method to get the response object. The request object contains various information, but it can be converted to a character string by using .text. In 2. Beautiful Soup, the text HTML returned as a response is used.
BeautifulSoup You can syntactically interpret the retrieved HTML text, then retrieve attributes with various methods, retrieve multiple elements using selectors, and much more. This time, I interpret the textual response as HTML and then use the .select method to specify a particular selector. This will give you multiple elements for that selector.

The data actually acquired by the above ★ Try using is the next white frame part

2. Extract the URL from the obtained `<a>` tags. Also, a set of URL and title is extracted separately.

`2.Get the URL of the title from the obtained a tag. Also, a set of URL and title is extracted separately.`



    now_posts_url = [] #1 List of URLs of the acquired article list,Used to identify new posts by comparing with previous post data
    now_posts_url_title_set = [] #List of URLs and titles of the acquired article list,
    for post in posts:
        #Extract URL
        index_first = int(str(post).find('href=')) + 6
        index_end = int(str(post).find('">'))
        url = (str(post)[index_first : index_end])
        #Extract title
        index_first = int(str(post).find('">')) + 2
        index_end = int(str(post).find('</a'))
        title = (str(post)[index_first : index_end].replace('\u3000', ' ')) #Whitespace replacement

        now_posts_url.append(url)
        now_posts_url_title_set.append(f"{url}>>>{title}")

Turn the acquired <a> tag elements with a for statement. By specifying the string with .find (), you can find the index of the start position of the string, so you can get the URL part and the title part by slicing the string with that value.

now_posts_url is the data used to compare with the posted data so far and extract the difference (excluding articles that disappeared from the list screen due to pagination etc.). This time, we will detect new arrivals using a URL that will not change even if the article is updated, but in order to output the title and URL later, save the set of ʻURL + titlenow. I want to. Therefore, usenow_posts_url to get the diff, and later extract only the data containing the diff URL from now_posts_url_title_set`.

3. Use the existing post list (txt), 2. Compare with the URL obtained in. Extract the difference.

`3.List of existing posts(txt)Using, 2. Compare with the URL obtained in. Extract the difference.`


    old_post_text = open(txt)
    old_post = old_post_text.read().split(',') #From text file to list type
    # differences :Posts that have been posted but are not displayed on the list screen+New post
    differences = list(set(now_posts_url) - set(old_post))
    old_post_text.close()

I want to compare with the txt file where the post records so far are saved and extract the difference from the latest newly acquired post list. It's a difference set. The Venn diagram is as follows A is a list of past posts B is the latest post list And the shaded area is the difference, which is a completely new post.

Set operations can be easily performed by setting the calculation target to a set type object. This time, the list type character string ([URL1, URL2, URL3]) recorded in the txt file is converted to list type with split (). The difference is calculated while converting to the set type together with the latest post list obtained in 2.

4. Combine the URL of the new post with the existing post record and overwrite the txt

`Overwrite txt with the URL of the new post and the existing post record in one`


    #Overwrite txt for recording all_posts are past posts+New post
    all_posts = ",".join(old_post + differences)
    f = open(txt, mode='w')
    f.writelines(all_posts)
    f.close()

The txt file should also be updated with the latest information so that it can be used next time. Overwrite the txt file by adding the difference (new post) to the previous posts.

From the set of URL and title obtained in 5.2., Only the items that correspond to the difference are extracted. Make it a double list type for easy shaping

`Format new post titles and URLs`


    new_post_info = []
    for new in now_posts_url_title_set:
        for incremental in differences:
            if incremental in new:
                new_post_info.append(new.split(">>>"))
    return new_post_info

From the data of the character string "URL >>> Title" obtained in advance in 2., only the data containing the URL (character string) that matches the difference is obtained.

IMG_7687 2.jpg

Since it is a character string, it is OK if the same character is included in the character string with the ʻin` operator. This allowed me to get the URL and title of the new article.

from now on

I want to be notified on LINE

~~ I want to be able to regularly notify chats with friends. later. ~~

2020/03/09 postscript

I used Line notify.


def send_line_notify(posts, token):
    '''
    # new_post_Take the return value of getter as an argument
    '''
    notice_url = "https://notify-api.line.me/api/notify"
    headers = {"Authorization" : "Bearer "+ token}
    for url, title in posts:
        if 'http' not in url:
            url = 'https://qiita.com/' + url
        message = f'{title}:{url}'
        payload = {'message': message}
        r = requests.post(notice_url, headers=headers, params=payload,)

`Use in this way`



token = '########'
neko_post = new_post_getter(neko_url, neko_selecter, neko_txt)
send_line_notify(neko_post, token)

If you specify the return value and token of the new_post_getter function as arguments, it will be sent to LINE Notify. I referred to here.

I want to run it regularly

I want to run every minute using ~~ python anywhere. later. ~~

2020/03/09 Copy each file to python anywhere and create .sh as below

`To be able to cron a virtual environment`



source /home/<account>/blog_post_notice/venv/bin/activate
python3 /home/<account>/blog_post_notice/send.py ##

Then, when I try to run .sh before setting up cron ...

`Error`



requests.exceptions.ProxyError: HTTPSConnectionPool(host='qiita.com', port=443): Max retries exceeded with url: /takuto_neko_like (Caused by ProxyError('Canno
t connect to proxy.', OSError('Tunnel connection failed: 403 Forbidden')))

After investigating, it seems that Python anywhere can only access external sites that correspond to the Whitelist in order to prevent unauthorized access for free accounts. So I gave up on python anywhere ...

I tried deploying to Heroku. However, since you can not save files on Heroku, you can not overwrite the txt file in the same directory with the processing in python like this time. I tried to update the file by manipulating the API of Google Drive and Dropbox from python. It seems that I can get the filename and metadata, and add a new file, but I didn't know how to get the contents of the file.

Therefore, this time I will set up cron on my PC and run it regularly.

As crontab -e ...

`For the time being, try running it every minute`


0 * * * * sh /Users/User name/dir1/post_notice/notice.sh

I made a tool to get new articles

Why did you make it?

What kind of tool is it?

Usable page

Terms of use

Whole code

How to Use

★ Try using

Try using

result

output

By the way

How did you make it?

Code flow

Each code

1. Get the <a> tag of all displayed articles from the article list page

1.Get the a tag of all displayed articles from the article list page

2. Extract the URL from the obtained <a> tags. Also, a set of URL and title is extracted separately.

2.Get the URL of the title from the obtained a tag. Also, a set of URL and title is extracted separately.