[arXiv] link-01 is a site operated by the Cornell University Library, where papers in various fields are submitted, and PDF viewing is possible for free.
I thought that if I analyze the information I want to see and post what I think on twitter, I can save the trouble of searching, but as the first step I decided to tweet the feed of [arXiv] link-01 on twitter.
The target this time was cs.CV, which is the category I am interested in.
I chose Raspberry Pi because it runs all the time, but if python works, there are no particular restrictions on the information device side.
You can find out by reading the following two contents on the help page.
The important thing is that "update is once a day" written in arXiv API User's Manual 3.3.1.1. .. It is suggested that it is necessary to design considering the frequency of API calls and the cache mechanism because the information is not updated even if it is accessed frequently.
Because the arXiv submission process works on a 24 hour submission cycle, new articles are only available to the API on the midnight after the articles were processed. The
tag thus reflects the midnight of the day that you are calling the API. This is very important - search results do not change until new articles are added. Therefore there is no need to call the API more than once in a day for the same query. Please cache your results. This primarily applies to production systems, and of course you are free to play around with the API while you are developing your program!
The xml of the feed can be obtained by replacing the category name described below.
http://export.arxiv.org/rss/cs.CV/rss.xml
A list of categories can be found here [https://arxiv.org/help/api/user-manual#subject_classifications).
The program was created with reference to the following information.
Designed by importing the following library.
First of all, according to the model, the auth key related information of twitter is summarized in ʻauth.py`.
auth.py
consumer_key = 'ABCDEFGHIJKLKMNOPQRSTUVWXYZ'
consumer_secret = '1234567890ABCDEFGHIJKLMNOPQRSTUVXYZ'
access_token = 'ZYXWVUTSRQPONMLKJIHFEDCBA'
access_token_secret = '0987654321ZYXWVUTSRQPONMLKJIHFEDCBA'
Enter the xml URL of the feed you want to import into RSS_URL
, and leave the update log (date and time: updated) in the file specified by PUBDATE_LOG
.
I wanted to check the file specified by PUBDATE_LOG
in the program, but I haven't implemented it so much, so in advance
python
$ touch cs.CV.log
You need to create an empty file with. .. ..
** your LOG dir ** is the location of this program. If you want to set up automatic execution with cron, you need to describe it with an absolute path.
python
RSS_URL = "http://export.arxiv.org/rss/cs.CV/rss.xml"
PUBDATE_LOG = "/your LOG dir/cs.CV.log"
Save the feed contents in a dictionary format in news_dic
and post the necessary information on twitter with twython. The contents of the arXiv feed at this time are described below in the comments of the program.
python
news_dic = feedparser.parse(RSS_URL)
"""
new_dic.* :
updated_parsed
etag
encoding
version
updated
headers
entries
namespaces
bozo
href
status
feed
print(news_dic.updated_parsed)
print(news_dic.etag ) #time.struct_time(tm_year=2017, tm_mon=8, tm_mday=16, tm_hour=0, tm_min=30, tm_sec=0, tm_wday=2, tm_yday=228, tm_isdst=0)
print(news_dic.encoding ) #us-ascii
print(news_dic.version ) #rss10
print(news_dic.updated ) #Wed, 16 Aug 2017 00:30:00 GMT
print(news_dic.headers ) #{'Expires': 'Thu, 17 Aug 2017 00:00:00 GMT', 'Connection': 'close', 'ETag': '"Wed, 16 Aug 2017 00:30:00 GMT", "1502843400"', 'Server': 'Apache', 'Vary': 'Accept-Encoding,User-Agent', 'Content-Type': 'text/xml', 'Content-Length': '15724', 'Date': 'Wed, 16 Aug 2017 06:43:57 GMT', 'Last-Modified': 'Wed, 16 Aug 2017 00:30:00 GMT', 'Content-Encoding': 'gzip'}
print(news_dic.entries ) #CONTENTS OF RSS FEED!!
print(news_dic.namespaces ) #{'': 'http://purl.org/rss/1.0/', 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#', 'content': 'http://purl.org/rss/1.0/modules/content/', 'sy': 'http://purl.org/rss/1.0/modules/syndication/', 'dc': 'http://purl.org/dc/elements/1.1/', 'admin': 'http://webns.net/mvcb/', 'taxo': 'http://purl.org/rss/1.0/modules/taxonomy/'}
print(news_dic.bozo ) #0
print(news_dic.href ) #http://export.arxiv.org/rss/cs.CV/rss.xml
print(news_dic.status ) #200
print(news_dic.feed )
"""
Check if the feed information has been updated with PubID
and lastPubID
, and if not, exit the program. If it has been updated, overwrite the file pointed to by PUBDATE_LOG
.
python
pubID = news_dic.updated
# pubID
with open(PUBDATE_LOG, "r") as rf:
lastPubID = rf.readline().rstrip("\n")
#
if (pubID == lastPubID):
print("")
sys.exit()
else:
with open(PUBDATE_LOG, "w") as f:
f.write(pubID + "\n")
There are the following items in new_dic.entries
.
The information to be posted is title
and link
in new_dic.entries
. However, title
may be long, so make sure to keep it within 140 characters and limit the number of characters so that URL links can be described.
python
for i in news_dic.entries:
if len(i.title) > 100:
message = i.title[0:100] + "......\n" + i.link
else:
message = i.title[0:109] + "\n" + i.link
#print(len(message))
#print(message)
try:
twitter.update_status(status=message)
except TwythonError as e:
print(e)
The final program as a result of trial and error is as follows.
py:twitter_feed_arxiv_cs.CV.py
# coding: utf-8
from twython import Twython, TwythonError
import feedparser
import sys
from auth import (
consumer_key,
consumer_secret,
access_token,
access_token_secret
)
twitter = Twython(
consumer_key,
consumer_secret,
access_token,
access_token_secret
)
RSS_URL = "http://export.arxiv.org/rss/cs.CV/rss.xml"
PUBDATE_LOG = "/<your LOG dir>/cs.CV.log"
"""
touch cs.CV.log
cron
"""
news_dic = feedparser.parse(RSS_URL)
"""
new_dic.* :
updated_parsed
etag
encoding
version
updated
headers
entries
namespaces
bozo
href
status
feed
print(news_dic.updated_parsed)
print(news_dic.etag ) #time.struct_time(tm_year=2017, tm_mon=8, tm_mday=16, tm_hour=0, tm_min=30, tm_sec=0, tm_wday=2, tm_yday=228, tm_isdst=0)
print(news_dic.encoding ) #us-ascii
print(news_dic.version ) #rss10
print(news_dic.updated ) #Wed, 16 Aug 2017 00:30:00 GMT
print(news_dic.headers ) #{'Expires': 'Thu, 17 Aug 2017 00:00:00 GMT', 'Connection': 'close', 'ETag': '"Wed, 16 Aug 2017 00:30:00 GMT", "1502843400"', 'Server': 'Apache', 'Vary': 'Accept-Encoding,User-Agent', 'Content-Type': 'text/xml', 'Content-Length': '15724', 'Date': 'Wed, 16 Aug 2017 06:43:57 GMT', 'Last-Modified': 'Wed, 16 Aug 2017 00:30:00 GMT', 'Content-Encoding': 'gzip'}
print(news_dic.entries ) #CONTENTS OF RSS FEED!!
print(news_dic.namespaces ) #{'': 'http://purl.org/rss/1.0/', 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#', 'content': 'http://purl.org/rss/1.0/modules/content/', 'sy': 'http://purl.org/rss/1.0/modules/syndication/', 'dc': 'http://purl.org/dc/elements/1.1/', 'admin': 'http://webns.net/mvcb/', 'taxo': 'http://purl.org/rss/1.0/modules/taxonomy/'}
print(news_dic.bozo ) #0
print(news_dic.href ) #http://export.arxiv.org/rss/cs.CV/rss.xml
print(news_dic.status ) #200
print(news_dic.feed )
"""
pubID = news_dic.updated
# pubID
with open(PUBDATE_LOG, "r") as rf:
lastPubID = rf.readline().rstrip("\n")
#
if (pubID == lastPubID):
print("")
sys.exit()
else:
with open(PUBDATE_LOG, "w") as f:
f.write(pubID + "\n")
for i in news_dic.entries:
if len(i.title) > 100:
message = i.title[0:100] + "......\n" + i.link
else:
message = i.title[0:109] + "\n" + i.link
#print(len(message))
#print(message)
try:
twitter.update_status(status=message)
except TwythonError as e:
print(e)
I did the following and confirmed that it was posted to my twitter account.
python
$ python3 twitter_feed_arxiv_cs.CV.py
I created a log file and program for cs.RO in a separate file, but it was also successful.
You can tweet once a day with cron
. It seems to be updated to ** 00: 30:00 GMT **, so set it to go to the feed every day at 10:00 (JST).
python
$ crontab -e
Set to go to the feed every day at 10:00 when the editor starts. ** your LOG dir ** is the location of this program.
python
00 10 * * * python3 /your LOG dir/twitter_feed_arxiv_cs.CV.py >/dev/null 2>&1
First of all, it became possible to simply tweet, but it seems that there are more than 50 submissions every day in cs.CV and cs.RO, so in order to efficiently search for articles of interest, it is necessary to further narrow down the submissions. ..
It seems that it can be done by parsing the character strings of title
and description
. It may be an example of machine learning.
Recommended Posts