I created a tool to get the title and URL of a new blog post by scraping with python. It's github-> https://github.com/cerven12/blog_post_getter
When my friend started (restarted?) Blogging, I wanted to try to fix my memory and improve my writing skills by posting blogs. I thought that it would be more active to be stimulated by two people competing and cooperating with each other rather than doing it alone, so I made it as part of that.
Get the title and URL of the newly posted article. (I'd like to run it regularly and notify it via LINE API etc.)
Use the txt file that contains the URL of the existing post list and compare it with the URL of the latest post list. I try not to detect changes in the title or content. (Because it's hard to get a notification saying "New!" Just by editing the title!) However, the exception is when the URL changes when editing an article (Is there ...?).
I don't know about other sites because I made it so that it can be used with Qiita, but I think that it can be used on pages where html has the following format
<!--There is a class in the a tag. The title is written as an element of the a tag-->
<a class='articles' href='#'>Title</a>
Qiita user page: https://qiita.com/takuto_neko_like Hatena Blog user page: http://atc.hateblo.jp/about
<a>
tag of each article.
import requests, bs4
def new_post_getter(url, selecter, txt):
'''
Article title and URL bs4_Get element
1st argument:URL of the page with the post list
2nd argument:Of each post<a>At the selector head attached to the tag.With
3rd argument:Recording txt path
'''
res = requests.get(url)
posts = bs4.BeautifulSoup(res.text, 'html.parser').select(selecter)
now_posts_url = [] #1 List of URLs of the acquired article list,Used to identify new posts by comparing with previous post data
now_posts_url_title_set = [] #List of URLs and titles of the acquired article list,
for post in posts:
#Extract URL
index_first = int(str(post).find('href=')) + 6
index_end = int(str(post).find('">'))
url = (str(post)[index_first : index_end])
#Extract title
index_first = int(str(post).find('">')) + 2
index_end = int(str(post).find('</a'))
title = (str(post)[index_first : index_end].replace('\u3000', ' ')) #Whitespace replacement
now_posts_url.append(url)
now_posts_url_title_set.append(f"{url}>>>{title}")
old_post_text = open(txt)
old_post = old_post_text.read().split(',') #From text file to list type
# differences :Posts that have been posted but are not displayed on the list screen+New post
differences = list(set(now_posts_url) - set(old_post))
old_post_text.close()
#Overwrite txt for recording all_posts are past posts+New post
all_posts = ",".join(old_post + differences)
f = open(txt, mode='w')
f.writelines(all_posts)
f.close()
new_post_info = []
for new in now_posts_url_title_set:
for incremental in differences:
if incremental in new:
new_post_info.append(new.split(">>>"))
return new_post_info
Page of article list
,Selector attached to a tag of each article
, Specify the path of the txt file that saves the posting status
as an argument
Try using
url = 'https://qiita.com/takuto_neko_like'
selecter = '.u-link-no-underline'
file = 'neko.txt'
my_posts = new_post_getter(url, selecter, file)
print(my_posts)
By doing the above ...
result
[['/takuto_neko_like/items/93b3751984e5e3fd3670', '[Fish] About the matter that the movement of fish was too slow ~ Trouble with git ~'], ['/takuto_neko_like/items/14e92797fa2b23a64adb', '[Python] What is inherited by multiple inheritance?']]
You'll get a double list of URLs and titles.
[[URL, title], [URL, title], [URL, title], .......]
By turning the double list with a for statement and formatting the character string ...
for url, title in my_posts:
print(f'{title} : {url}')
Easy-to-read output ↓
output
[Fish] About the matter that the movement of fish was too slow ~ Trouble with git ~: /takuto_neko_like/items/93b3751984e5e3fd3670
[Python] What is inherited by multiple inheritance?: /takuto_neko_like/items/14e92797fa2b23a64adb
The contents of neko.txt are like this.
/takuto_neko_like/items/93b3751984e5e3fd3670,/takuto_neko_like/items/14e92797fa2b23a64adb,/takuto_neko_like/items/bb8d0957347636b5bf4f,/takuto_neko_like/items/62aeb4271614f6f0347f,/takuto_neko_like/items/c9c80ff453d0c4fad239,/takuto_neko_like/items/aed9dd5619d8457d4894,/takuto_neko_like/items/6cf9bade3d9515a724c0
Contains a list of URLs. Try deleting the first and last ...
/takuto_neko_like/items/14e92797fa2b23a64adb,/takuto_neko_like/items/bb8d0957347636b5bf4f,/takuto_neko_like/items/62aeb4271614f6f0347f,/takuto_neko_like/items/c9c80ff453d0c4fad239,/takuto_neko_like/items/aed9dd5619d8457d4894
When you run ...
my_posts = new_post_getter(url, selecter, file)
print(my_posts)
Result ↓
[['/takuto_neko_like/items/c5791f267e0964e09d03', 'Created a tool to get new articles to work hard with friends on blog posts'], ['/takuto_neko_like/items/93b3751984e5e3fd3670', '[Fish] About the matter that the movement of fish was too slow ~ Trouble with git ~'], ['/takuto_neko_like/items/6cf9bade3d9515a724c0', '【Python】@What are classmethods and decorators?']]
Get only the deleted amount! ☺
Below is a description of the code.
<a>
tags of all displayed articles from the article list page<a>
tags. Also, a set of URL and title is extracted separately.<a>
tag of all displayed articles from the article list page1.Get the a tag of all displayed articles from the article list page
import requests, bs4
def new_post_getter(url, selecter, txt):
'''
Article title and URL bs4_Get element
1st argument:URL of the page with the post list
2nd argument:Of each post<a>At the selector head attached to the tag.With
3rd argument:Recording txt path
'''
res = requests.get(url)
posts = bs4.BeautifulSoup(res.text, 'html.parser').select(selecter)
We will use two third party libraries here.
.text
. In 2. Beautiful Soup, the text HTML returned as a response is used..select
method to specify a particular selector. This will give you multiple elements for that selector.The data actually acquired by the above ★ Try using
is the next white frame part
<a>
tags. Also, a set of URL and title is extracted separately.2.Get the URL of the title from the obtained a tag. Also, a set of URL and title is extracted separately.
now_posts_url = [] #1 List of URLs of the acquired article list,Used to identify new posts by comparing with previous post data
now_posts_url_title_set = [] #List of URLs and titles of the acquired article list,
for post in posts:
#Extract URL
index_first = int(str(post).find('href=')) + 6
index_end = int(str(post).find('">'))
url = (str(post)[index_first : index_end])
#Extract title
index_first = int(str(post).find('">')) + 2
index_end = int(str(post).find('</a'))
title = (str(post)[index_first : index_end].replace('\u3000', ' ')) #Whitespace replacement
now_posts_url.append(url)
now_posts_url_title_set.append(f"{url}>>>{title}")
Turn the acquired <a>
tag elements with a for statement. By specifying the string with .find ()
, you can find the index of the start position of the string, so you can get the URL part and the title part by slicing the string with that value.
now_posts_url
is the data used to compare with the posted data so far and extract the difference (excluding articles that disappeared from the list screen due to pagination etc.).
This time, we will detect new arrivals using a URL that will not change even if the article is updated, but in order to output the title and URL later, save the set of ʻURL + titlenow. I want to. Therefore, use
now_posts_url to get the diff, and later extract only the data containing the diff URL from
now_posts_url_title_set`.
3.List of existing posts(txt)Using, 2. Compare with the URL obtained in. Extract the difference.
old_post_text = open(txt)
old_post = old_post_text.read().split(',') #From text file to list type
# differences :Posts that have been posted but are not displayed on the list screen+New post
differences = list(set(now_posts_url) - set(old_post))
old_post_text.close()
I want to compare with the txt file where the post records so far are saved and extract the difference from the latest newly acquired post list. It's a difference set. The Venn diagram is as follows A is a list of past posts B is the latest post list And the shaded area is the difference, which is a completely new post.
Set operations can be easily performed by setting the calculation target to a set type object.
This time, the list type character string ([URL1, URL2, URL3]
) recorded in the txt file is converted to list type with split ()
.
The difference is calculated while converting to the set type together with the latest post list obtained in 2.
Overwrite txt with the URL of the new post and the existing post record in one
#Overwrite txt for recording all_posts are past posts+New post
all_posts = ",".join(old_post + differences)
f = open(txt, mode='w')
f.writelines(all_posts)
f.close()
The txt file should also be updated with the latest information so that it can be used next time. Overwrite the txt file by adding the difference (new post) to the previous posts.
Format new post titles and URLs
new_post_info = []
for new in now_posts_url_title_set:
for incremental in differences:
if incremental in new:
new_post_info.append(new.split(">>>"))
return new_post_info
From the data of the character string "URL >>> Title" obtained in advance in 2., only the data containing the URL (character string) that matches the difference is obtained.
Since it is a character string, it is OK if the same character is included in the character string with the ʻin` operator. This allowed me to get the URL and title of the new article.
~~ I want to be able to regularly notify chats with friends. later. ~~
2020/03/09 postscript
I used Line notify.
def send_line_notify(posts, token):
'''
# new_post_Take the return value of getter as an argument
'''
notice_url = "https://notify-api.line.me/api/notify"
headers = {"Authorization" : "Bearer "+ token}
for url, title in posts:
if 'http' not in url:
url = 'https://qiita.com/' + url
message = f'{title}:{url}'
payload = {'message': message}
r = requests.post(notice_url, headers=headers, params=payload,)
Use in this way
token = '########'
neko_post = new_post_getter(neko_url, neko_selecter, neko_txt)
send_line_notify(neko_post, token)
If you specify the return value and token of the new_post_getter
function as arguments, it will be sent to LINE Notify
.
I referred to here.
I want to run every minute using ~~ python anywhere. later. ~~
2020/03/09 Copy each file to python anywhere and create .sh as below
To be able to cron a virtual environment
source /home/<account>/blog_post_notice/venv/bin/activate
python3 /home/<account>/blog_post_notice/send.py ##
Then, when I try to run .sh before setting up cron ...
Error
requests.exceptions.ProxyError: HTTPSConnectionPool(host='qiita.com', port=443): Max retries exceeded with url: /takuto_neko_like (Caused by ProxyError('Canno
t connect to proxy.', OSError('Tunnel connection failed: 403 Forbidden')))
After investigating, it seems that Python anywhere can only access external sites that correspond to the Whitelist in order to prevent unauthorized access for free accounts. So I gave up on python anywhere ...
I tried deploying to Heroku. However, since you can not save files on Heroku, you can not overwrite the txt file in the same directory with the processing in python like this time. I tried to update the file by manipulating the API of Google Drive and Dropbox from python. It seems that I can get the filename and metadata, and add a new file, but I didn't know how to get the contents of the file.
Therefore, this time I will set up cron on my PC and run it regularly.
As crontab -e
...
For the time being, try running it every minute
0 * * * * sh /Users/User name/dir1/post_notice/notice.sh
Recommended Posts