Introduction

[arXiv] link-01 is a site operated by the Cornell University Library, where papers in various fields are submitted, and PDF viewing is possible for free.

I thought that if I analyze the information I want to see and post what I think on twitter, I can save the trouble of searching, but as the first step I decided to tweet the feed of [arXiv] link-01 on twitter.

The target this time was cs.CV, which is the category I am interested in.

I chose Raspberry Pi because it runs all the time, but if python works, there are no particular restrictions on the information device side.

How arXiv's RSS feed works

You can find out by reading the following two contents on the help page.

The important thing is that "update is once a day" written in arXiv API User's Manual 3.3.1.1. .. It is suggested that it is necessary to design considering the frequency of API calls and the cache mechanism because the information is not updated even if it is accessed frequently.

Because the arXiv submission process works on a 24 hour submission cycle, new articles are only available to the API on the midnight after the articles were processed. The tag thus reflects the midnight of the day that you are calling the API. This is very important - search results do not change until new articles are added. Therefore there is no need to call the API more than once in a day for the same query. Please cache your results. This primarily applies to production systems, and of course you are free to play around with the API while you are developing your program!

The xml of the feed can be obtained by replacing the category name described below.

http://export.arxiv.org/rss/cs.CV/rss.xml

A list of categories can be found here [https://arxiv.org/help/api/user-manual#subject_classifications).

python program

References

The program was created with reference to the following information.

Getting started with the Twitter API
Let's make a news bot with Raspberry Pi2 + Twython
step1: Preparation for using Twitter API
step2: Install Twython and run sample program
step3: Get RSS data with Python
step4: Automatically post from Raspberry Pi to Twitter

Creation point

Library

twython
feedparser

auth key information

First of all, according to the model, the auth key related information of twitter is summarized in ʻauth.py`.

`auth.py`


consumer_key        = 'ABCDEFGHIJKLKMNOPQRSTUVWXYZ'
consumer_secret     = '1234567890ABCDEFGHIJKLMNOPQRSTUVXYZ'
access_token        = 'ZYXWVUTSRQPONMLKJIHFEDCBA'
access_token_secret = '0987654321ZYXWVUTSRQPONMLKJIHFEDCBA'

Ingest feed

Enter the xml URL of the feed you want to import into RSS_URL, and leave the update log (date and time: updated) in the file specified by PUBDATE_LOG.

I wanted to check the file specified by PUBDATE_LOG in the program, but I haven't implemented it so much, so in advance

`python`


$ touch cs.CV.log

You need to create an empty file with. .. ..

** your LOG dir ** is the location of this program. If you want to set up automatic execution with cron, you need to describe it with an absolute path.

`python`


RSS_URL = "http://export.arxiv.org/rss/cs.CV/rss.xml"
PUBDATE_LOG = "/your LOG dir/cs.CV.log"

Save the feed contents in a dictionary format in news_dic and post the necessary information on twitter with twython. The contents of the arXiv feed at this time are described below in the comments of the program.

`python`


news_dic = feedparser.parse(RSS_URL)

"""
new_dic.* : 
updated_parsed
etag
encoding
version
updated
headers
entries
namespaces
bozo
href
status
feed

print(news_dic.updated_parsed)  
print(news_dic.etag          )  #time.struct_time(tm_year=2017, tm_mon=8, tm_mday=16, tm_hour=0, tm_min=30, tm_sec=0, tm_wday=2, tm_yday=228, tm_isdst=0)
print(news_dic.encoding      )  #us-ascii
print(news_dic.version       )  #rss10
print(news_dic.updated       )  #Wed, 16 Aug 2017 00:30:00 GMT
print(news_dic.headers       )  #{'Expires': 'Thu, 17 Aug 2017 00:00:00 GMT', 'Connection': 'close', 'ETag': '"Wed, 16 Aug 2017 00:30:00 GMT", "1502843400"', 'Server': 'Apache', 'Vary': 'Accept-Encoding,User-Agent', 'Content-Type': 'text/xml', 'Content-Length': '15724', 'Date': 'Wed, 16 Aug 2017 06:43:57 GMT', 'Last-Modified': 'Wed, 16 Aug 2017 00:30:00 GMT', 'Content-Encoding': 'gzip'}
print(news_dic.entries       )  #CONTENTS OF RSS FEED!!
print(news_dic.namespaces    )  #{'': 'http://purl.org/rss/1.0/', 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#', 'content': 'http://purl.org/rss/1.0/modules/content/', 'sy': 'http://purl.org/rss/1.0/modules/syndication/', 'dc': 'http://purl.org/dc/elements/1.1/', 'admin': 'http://webns.net/mvcb/', 'taxo': 'http://purl.org/rss/1.0/modules/taxonomy/'}
print(news_dic.bozo          )  #0
print(news_dic.href          )  #http://export.arxiv.org/rss/cs.CV/rss.xml
print(news_dic.status        )  #200
print(news_dic.feed          )  
"""

Check if the feed information has been updated with PubID and lastPubID, and if not, exit the program. If it has been updated, overwrite the file pointed to by PUBDATE_LOG.

`python`


pubID = news_dic.updated

#  pubID 
with open(PUBDATE_LOG, "r") as rf:
    lastPubID = rf.readline().rstrip("\n")

# 
if (pubID == lastPubID):
    print("")
    sys.exit()
else:
    with open(PUBDATE_LOG, "w") as f:
        f.write(pubID + "\n")

Post to twitter

There are the following items in new_dic.entries.

title
link
description (abstract)
creator

The information to be posted is title and link in new_dic.entries. However, title may be long, so make sure to keep it within 140 characters and limit the number of characters so that URL links can be described.

`python`


for i in news_dic.entries:
    if len(i.title) > 100:
        message = i.title[0:100] + "......\n" + i.link
    else:
        message = i.title[0:109] + "\n" + i.link
    #print(len(message))
    #print(message)

    try:
        twitter.update_status(status=message)
    except TwythonError as e:
        print(e)

Creation result

The final program as a result of trial and error is as follows.

`py:twitter_feed_arxiv_cs.CV.py`


# coding: utf-8
from twython import Twython, TwythonError
import feedparser
import sys
 
from auth import (
    consumer_key,
    consumer_secret,
    access_token,
    access_token_secret
)

twitter = Twython(
    consumer_key,
    consumer_secret,
    access_token,
    access_token_secret
)

RSS_URL = "http://export.arxiv.org/rss/cs.CV/rss.xml"
PUBDATE_LOG = "/<your LOG dir>/cs.CV.log"
"""

touch cs.CV.log
cron
"""

news_dic = feedparser.parse(RSS_URL)

"""
new_dic.* : 
updated_parsed
etag
encoding
version
updated
headers
entries
namespaces
bozo
href
status
feed

print(news_dic.updated_parsed)  
print(news_dic.etag          )  #time.struct_time(tm_year=2017, tm_mon=8, tm_mday=16, tm_hour=0, tm_min=30, tm_sec=0, tm_wday=2, tm_yday=228, tm_isdst=0)
print(news_dic.encoding      )  #us-ascii
print(news_dic.version       )  #rss10
print(news_dic.updated       )  #Wed, 16 Aug 2017 00:30:00 GMT
print(news_dic.headers       )  #{'Expires': 'Thu, 17 Aug 2017 00:00:00 GMT', 'Connection': 'close', 'ETag': '"Wed, 16 Aug 2017 00:30:00 GMT", "1502843400"', 'Server': 'Apache', 'Vary': 'Accept-Encoding,User-Agent', 'Content-Type': 'text/xml', 'Content-Length': '15724', 'Date': 'Wed, 16 Aug 2017 06:43:57 GMT', 'Last-Modified': 'Wed, 16 Aug 2017 00:30:00 GMT', 'Content-Encoding': 'gzip'}
print(news_dic.entries       )  #CONTENTS OF RSS FEED!!
print(news_dic.namespaces    )  #{'': 'http://purl.org/rss/1.0/', 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#', 'content': 'http://purl.org/rss/1.0/modules/content/', 'sy': 'http://purl.org/rss/1.0/modules/syndication/', 'dc': 'http://purl.org/dc/elements/1.1/', 'admin': 'http://webns.net/mvcb/', 'taxo': 'http://purl.org/rss/1.0/modules/taxonomy/'}
print(news_dic.bozo          )  #0
print(news_dic.href          )  #http://export.arxiv.org/rss/cs.CV/rss.xml
print(news_dic.status        )  #200
print(news_dic.feed          )  
"""

pubID = news_dic.updated

#  pubID 
with open(PUBDATE_LOG, "r") as rf:
    lastPubID = rf.readline().rstrip("\n")

# 
if (pubID == lastPubID):
    print("")
    sys.exit()
else:
    with open(PUBDATE_LOG, "w") as f:
        f.write(pubID + "\n")

for i in news_dic.entries:
    if len(i.title) > 100:
        message = i.title[0:100] + "......\n" + i.link
    else:
        message = i.title[0:109] + "\n" + i.link
    #print(len(message))
    #print(message)

    try:
        twitter.update_status(status=message)
    except TwythonError as e:
        print(e)

I did the following and confirmed that it was posted to my twitter account.

`python`


$ python3 twitter_feed_arxiv_cs.CV.py

I created a log file and program for cs.RO in a separate file, but it was also successful.

Tweet automation

You can tweet once a day with cron. It seems to be updated to ** 00: 30:00 GMT **, so set it to go to the feed every day at 10:00 (JST).

`python`


$ crontab -e

Set to go to the feed every day at 10:00 when the editor starts. ** your LOG dir ** is the location of this program.

`python`


00 10 * * * python3 /your LOG dir/twitter_feed_arxiv_cs.CV.py >/dev/null 2>&1

at the end

First of all, it became possible to simply tweet, but it seems that there are more than 50 submissions every day in cs.CV and cs.RO, so in order to efficiently search for articles of interest, it is necessary to further narrow down the submissions. ..

It seems that it can be done by parsing the character strings of title and description. It may be an example of machine learning.

Try tweeting arXiv's RSS feed on twitter from Raspberry Pi with python

Introduction

How arXiv's RSS feed works

python program

References

Creation point

Library

auth key information

auth.py

Ingest feed

python

python

python

python

Post to twitter

python

Creation result

py:twitter_feed_arxiv_cs.CV.py

Tweet

python

Tweet automation

python

python

at the end

`auth.py`

`python`

`python`

`python`

`python`

`python`

`py:twitter_feed_arxiv_cs.CV.py`

`python`

`python`

`python`