Get RSS feed using Python + pandas → Post to Mattermost & Save to DB

What i did

I wanted to introduce Mattermost and try something, so I made a program to post RSS feeds. (It's a secret that I later realized that there was an official project)

I thought that it could be applied in various ways, so I processed the obtained feed with pandas. I decided to store it in the DB.

For those who say Mattermost, here

environment

CentOS7
Python2.7
- pandas
- feedparser
- sqlalchemy
- requests
PostgreSQL9.5
MatterMost

PostgreSQL was introduced from the docker image.

docker pull postgres:9.5
docker run -p 5432:5432 --name postgres-server -v /var/lib/postgresql:/var/lib/postgresql:rw  postgres:9.5
firewall-cmd --permanent --add-port=5432/tcp
firewall-cmd --reload

This should launch a PostgreSQL container that can be connected remotely for the time being. docker is convenient. .. ..

Other environment construction is omitted, Python is implemented in the pyenv environment.

1. Get an RSS feed.

We use a Python library called feedparser. I installed it with pip, referring to this area.

http://qiita.com/shunsuke227ono/items/da52a290f78924c1f485

import feedparser

RSS_URL = "http://b.hatena.ne.jp/hotentry/it.rss"
print("Start get feed from %s" % (RSS_URL))
feed = feedparser.parse(RSS_URL)

Now you can get the feed. (By the way, I got a hot entry in the technology category of Hatena.)

2. Extract the obtained feed to pandas.DataFrame

Map it to pandas.DataFrame for ease of future processing.

import pandas as pd
entries = pd.DataFrame(feed.entries)

...the end. pandas is excellent.

In the case of Hatena's RSS feed, the following 12 column elements were acquired.

content
hatena_bookmarkcount
id
link
links
summary
summary_detail
tags
title
title_detail
updated
updated_parsed

At this point, you can freely manipulate the data with the pandas function.

3. Check for new feeds

feedparser is very convenient, but it gets the feed at the time of access, so it will be duplicated with the feed obtained in the past.

This is where the meaning of expanding to DataFrame comes out! The following is an example of extracting and displaying only new feeds by operating DataFrame.

already_print_feeds = pd.Series()

while True:
        time.sleep(300)
        feed = feedparser.parse(RSS_URL)
        entries = pd.DataFrame(feed.entries)
        new_entries = entries[~entries['id'].isin(already_print_feeds)]
        if not new_entries.empty:
            for key, row in new_entries.iterrows():
                feedinfo = "[**%s**](%s)\n\n>%s"%(row['title'],row['link'],tag_re.sub('',row['summary']))
                print(feedinfo)
        already_print_feeds = already_print_feeds.append(new_entries['id'])

A little commentary

new_entries = entries[~entries['id'].isin(already_print_feeds)]

It only pulls out new arrivals from the retrieved RSS feed.

It is assumed that ʻalready_print_feeds contains the ʻid of the RSS feeds obtained so far.

Then, of the feeds stored in ʻentries, Since Serires with True set only for new lines is returned, If you specify this as the index of ʻentries, you can extract only new arrivals.

~entries['id'].isin(already_print_feeds)
# =>
0     False
1     True # => ★New!
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    True # => ★New!
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
Name: id, dtype: bool

ʻAlready_print_feeds` should be added with the ID of the new feed displayed so far.

already_print_feeds = already_print_feeds.append(new_entries['id'])

: warning: However, with the above code, data will be accumulated indefinitely in ʻalready_print_feeds`, so it will break down someday (memory). Let's flash once a day or read from the DB

4. Save to DB (PostgreSQL)

Save the obtained RSS feed in PostgreSQL. However, the columns have been narrowed down to the following.

id
link
title
summary
updated

First, create a table in the DB.

create table feed ( id text primary key , link text, title text, summary text, updated timestamp );

For the time being, I put a primary key constraint in id, and updated is a timestamp type. (It seems that the updated version of Hatena's feed can be INSERTed as a timestamp type as it is.)

from sqlalchemy import create_engine

DATABASE_CONN = "postgresql://xxxx:xxxx@xxxxx:xxx/xxxx"
DATABASE_TABLE = "feed"
# connect database
engine = create_engine(DATABASE_CONN)

# Store database
stored_entries = new_entries.ix[:, [
                "id", "link", "title", "summary", "updated"]]
stored_entries.to_sql(DATABASE_TABLE, engine, index=False, if_exists='append')

Use the DataFrame's to_sql method.

index=False

By doing so, the index column will not be added arbitrarily at the time of storage,

if_exists='append'

Then, the behavior is to add data to the table that already exists.

5. Post to MatterMost

It's very easy to post with a Python library called request that sends HTTP requests.

import requests
import json

mattermosturl = "MatterMost incomming webhook URL"
username = "Favorite name"
header = {'Content-Type': 'application/json'}
payload = {
        "text": feedinfo,
        "username": username,
        }

resp = requests.post(mattermosturl,
                     headers=header, data=json.dumps(payload))

so

Since I mapped it to pandas, I also want to do machine learning.