I wanted to introduce Mattermost and try something, so I made a program to post RSS feeds. (It's a secret that I later realized that there was an official project)
I thought that it could be applied in various ways, so I processed the obtained feed with pandas. I decided to store it in the DB.
PostgreSQL was introduced from the docker image.
docker pull postgres:9.5
docker run -p 5432:5432 --name postgres-server -v /var/lib/postgresql:/var/lib/postgresql:rw postgres:9.5
firewall-cmd --permanent --add-port=5432/tcp
firewall-cmd --reload
This should launch a PostgreSQL container that can be connected remotely for the time being. docker is convenient. .. ..
Other environment construction is omitted, Python is implemented in the pyenv environment.
We use a Python library called feedparser
.
I installed it with pip
, referring to this area.
http://qiita.com/shunsuke227ono/items/da52a290f78924c1f485
import feedparser
RSS_URL = "http://b.hatena.ne.jp/hotentry/it.rss"
print("Start get feed from %s" % (RSS_URL))
feed = feedparser.parse(RSS_URL)
Now you can get the feed. (By the way, I got a hot entry in the technology category of Hatena.)
Map it to pandas.DataFrame for ease of future processing.
import pandas as pd
entries = pd.DataFrame(feed.entries)
...the end. pandas is excellent.
In the case of Hatena's RSS feed, the following 12 column elements were acquired.
At this point, you can freely manipulate the data with the pandas function.
feedparser is very convenient, but it gets the feed at the time of access, so it will be duplicated with the feed obtained in the past.
This is where the meaning of expanding to DataFrame comes out! The following is an example of extracting and displaying only new feeds by operating DataFrame.
already_print_feeds = pd.Series()
while True:
time.sleep(300)
feed = feedparser.parse(RSS_URL)
entries = pd.DataFrame(feed.entries)
new_entries = entries[~entries['id'].isin(already_print_feeds)]
if not new_entries.empty:
for key, row in new_entries.iterrows():
feedinfo = "[**%s**](%s)\n\n>%s"%(row['title'],row['link'],tag_re.sub('',row['summary']))
print(feedinfo)
already_print_feeds = already_print_feeds.append(new_entries['id'])
new_entries = entries[~entries['id'].isin(already_print_feeds)]
It only pulls out new arrivals from the retrieved RSS feed.
It is assumed that ʻalready_print_feeds contains the ʻid
of the RSS feeds obtained so far.
Then, of the feeds stored in ʻentries, Since Serires with
True set only for new lines is returned, If you specify this as the index of ʻentries
, you can extract only new arrivals.
~entries['id'].isin(already_print_feeds)
# =>
0 False
1 True # => ★New!
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 False
19 True # => ★New!
20 False
21 False
22 False
23 False
24 False
25 False
26 False
27 False
28 False
29 False
Name: id, dtype: bool
ʻAlready_print_feeds` should be added with the ID of the new feed displayed so far.
already_print_feeds = already_print_feeds.append(new_entries['id'])
: warning: However, with the above code, data will be accumulated indefinitely in ʻalready_print_feeds`, so it will break down someday (memory). Let's flash once a day or read from the DB
Save the obtained RSS feed in PostgreSQL. However, the columns have been narrowed down to the following.
First, create a table in the DB.
create table feed ( id text primary key , link text, title text, summary text, updated timestamp );
For the time being, I put a primary key constraint in id, and updated is a timestamp type. (It seems that the updated version of Hatena's feed can be INSERTed as a timestamp type as it is.)
from sqlalchemy import create_engine
DATABASE_CONN = "postgresql://xxxx:xxxx@xxxxx:xxx/xxxx"
DATABASE_TABLE = "feed"
# connect database
engine = create_engine(DATABASE_CONN)
# Store database
stored_entries = new_entries.ix[:, [
"id", "link", "title", "summary", "updated"]]
stored_entries.to_sql(DATABASE_TABLE, engine, index=False, if_exists='append')
Use the DataFrame's to_sql
method.
By doing so, the index column will not be added arbitrarily at the time of storage,
Then, the behavior is to add data to the table that already exists.
It's very easy to post with a Python library called request
that sends HTTP requests.
import requests
import json
mattermosturl = "MatterMost incomming webhook URL"
username = "Favorite name"
header = {'Content-Type': 'application/json'}
payload = {
"text": feedinfo,
"username": username,
}
resp = requests.post(mattermosturl,
headers=header, data=json.dumps(payload))
Since I mapped it to pandas, I also want to do machine learning.
Recommended Posts