Aikatsu! I want to collect tweets about it every day, and it is troublesome to do it manually, so I want to automate it so that it will be output every day on AWS Lambda.
First, on AWS Lambda, I registered the Twitter Scraper library (1.4.0) in Lambda Layers, implemented the following code roughly at the operation verification level, and executed the test.
from twitterscraper import query_tweets
import datetime as dt
def lambda_handler(event, context):
begin_date = dt.date(2020,6,5)
end_date = dt.date(2020,6,6)
pool_size = (end_date - begin_date).days
tweets = query_tweets("Aikatsu", begindate = begin_date, enddate = end_date, poolsize=pool_size, lang="ja")
tuple_tweet=[(tweet.user_id, tweet.tweet_id, tweet.text.replace("\n","\t"), tweet.timestamp) for tweet in tweets]
return True
Then, the following "pool" is missing error is output on AWS Lambda.
{
"errorMessage": "name 'pool' is not defined",
"errorType": "NameError",
"stackTrace": [
" File \"/var/task/lambda_function.py\", line 14, in lambda_handler\n tweets = query_tweets(\"Aikatsu\", begindate = begin_date, enddate = end_date, poolsize=pool_size, lang=\"ja\")\n",
" File \"/opt/python/twitterscraper/query.py\", line 246, in query_tweets\n pool.close()\n"
]
}
It works normally on Jupyter Notebook, so I wondered if something in Lambda Layers was wrong, so what is the variable "pool" in the first place? Let's find out that.
Apparently it's a variable in query.py of TwitterScraper.
query.py
def query_tweets(query, limit=None, begindate=dt.date(2006, 3, 21), enddate=dt.date.today(), poolsize=20, lang=''):
no_days = (enddate - begindate).days
if(no_days < 0):
sys.exit('Begin date must occur before end date.')
if poolsize > no_days:
# Since we are assigning each pool a range of dates to query,
# the number of pools should not exceed the number of dates.
poolsize = no_days
dateranges = [begindate + dt.timedelta(days=elem) for elem in linspace(0, no_days, poolsize+1)]
if limit and poolsize:
limit_per_pool = (limit // poolsize)+1
else:
limit_per_pool = None
queries = ['{} since:{} until:{}'.format(query, since, until)
for since, until in zip(dateranges[:-1], dateranges[1:])]
all_tweets = []
try:
pool = Pool(poolsize)
logger.info('queries: {}'.format(queries))
try:
for new_tweets in pool.imap_unordered(partial(query_tweets_once, limit=limit_per_pool, lang=lang), queries):
all_tweets.extend(new_tweets)
logger.info('Got {} tweets ({} new).'.format(
len(all_tweets), len(new_tweets)))
except KeyboardInterrupt:
logger.info('Program interrupted by user. Returning all tweets '
'gathered so far.')
finally:
pool.close()
pool.join()
return all_tweets
Probably pool = Pool (poolsize)
, remove this variable from the try clause and run AWS Lambda.
{
"errorMessage": "[Errno 38] Function not implemented",
"errorType": "OSError",
"stackTrace": [
" File \"/var/task/lambda_function.py\", line 14, in lambda_handler\n tweets = query_tweets(\"Aikatsu\", begindate = begin_date, enddate = end_date, poolsize=pool_size, lang=\"ja\")\n",
" File \"/opt/python/twitterscraper/query.py\", line 233, in query_tweets\n pool = Pool(poolsize)\n",
" File \"/opt/python/billiard/pool.py\", line 995, in __init__\n self._setup_queues()\n",
" File \"/opt/python/billiard/pool.py\", line 1364, in _setup_queues\n self._inqueue = self._ctx.SimpleQueue()\n",
" File \"/opt/python/billiard/context.py\", line 150, in SimpleQueue\n return SimpleQueue(ctx=self.get_context())\n",
" File \"/opt/python/billiard/queues.py\", line 390, in __init__\n self._rlock = ctx.Lock()\n",
" File \"/opt/python/billiard/context.py\", line 105, in Lock\n return Lock(ctx=self.get_context())\n",
" File \"/opt/python/billiard/synchronize.py\", line 182, in __init__\n SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)\n",
" File \"/opt/python/billiard/synchronize.py\", line 71, in __init__\n sl = self._semlock = _billiard.SemLock(\n"
]
}
Since the error content is "Function not implemented", it is apparently caused by the multi-process billiard library. It seems that multi-process is not available in AWS Lambda.
https://aws.amazon.com/es/blogs/compute/parallel-processing-in-python-with-aws-lambda/
There is a description about the same phenomenon in the issues of twitterscraper.
Since there is a workaround implementation in pullrequest on github, I could avoid it by replacing quert.py with the contents of pullrequest. https://github.com/taspinar/twitterscraper/pull/280/commits/685c5b4f601de58c2b2591219a805839011c5faf
Since the number of multi-processes is set using the variable "poolsize" when passing it to the function "query_tweets", it is implemented so that it will not be multi-processed if it is explicitly set to 0.
Recommended Posts