I tried to use Twitter Scraper on AWS Lambda and it didn't work.

Aikatsu! I want to collect tweets about it every day, and it is troublesome to do it manually, so I want to automate it so that it will be output every day on AWS Lambda.

First, on AWS Lambda, I registered the Twitter Scraper library (1.4.0) in Lambda Layers, implemented the following code roughly at the operation verification level, and executed the test.

from twitterscraper import query_tweets
import datetime as dt

def lambda_handler(event, context):
    
    begin_date = dt.date(2020,6,5)
    end_date = dt.date(2020,6,6)
    pool_size = (end_date - begin_date).days
    
    tweets = query_tweets("Aikatsu", begindate = begin_date, enddate = end_date, poolsize=pool_size, lang="ja")
    tuple_tweet=[(tweet.user_id, tweet.tweet_id, tweet.text.replace("\n","\t"), tweet.timestamp) for tweet in tweets]
      
    return True

Then, the following "pool" is missing error is output on AWS Lambda.

{
  "errorMessage": "name 'pool' is not defined",
  "errorType": "NameError",
  "stackTrace": [
    "  File \"/var/task/lambda_function.py\", line 14, in lambda_handler\n    tweets = query_tweets(\"Aikatsu\", begindate = begin_date, enddate = end_date, poolsize=pool_size, lang=\"ja\")\n",
    "  File \"/opt/python/twitterscraper/query.py\", line 246, in query_tweets\n    pool.close()\n"
  ]
}

It works normally on Jupyter Notebook, so I wondered if something in Lambda Layers was wrong, so what is the variable "pool" in the first place? Let's find out that.

Apparently it's a variable in query.py of TwitterScraper.

`query.py`


def query_tweets(query, limit=None, begindate=dt.date(2006, 3, 21), enddate=dt.date.today(), poolsize=20, lang=''):
    no_days = (enddate - begindate).days
    
    if(no_days < 0):
        sys.exit('Begin date must occur before end date.')
    
    if poolsize > no_days:
        # Since we are assigning each pool a range of dates to query,
		# the number of pools should not exceed the number of dates.
        poolsize = no_days
    dateranges = [begindate + dt.timedelta(days=elem) for elem in linspace(0, no_days, poolsize+1)]

    if limit and poolsize:
        limit_per_pool = (limit // poolsize)+1
    else:
        limit_per_pool = None

    queries = ['{} since:{} until:{}'.format(query, since, until)
               for since, until in zip(dateranges[:-1], dateranges[1:])]

    all_tweets = []
    try:
        pool = Pool(poolsize)
        logger.info('queries: {}'.format(queries))
        try:
            for new_tweets in pool.imap_unordered(partial(query_tweets_once, limit=limit_per_pool, lang=lang), queries):
                all_tweets.extend(new_tweets)
                logger.info('Got {} tweets ({} new).'.format(
                    len(all_tweets), len(new_tweets)))
        except KeyboardInterrupt:
            logger.info('Program interrupted by user. Returning all tweets '
                         'gathered so far.')
    finally:
        pool.close()
        pool.join()

    return all_tweets

Probably pool = Pool (poolsize), remove this variable from the try clause and run AWS Lambda.

{
  "errorMessage": "[Errno 38] Function not implemented",
  "errorType": "OSError",
  "stackTrace": [
    "  File \"/var/task/lambda_function.py\", line 14, in lambda_handler\n    tweets = query_tweets(\"Aikatsu\", begindate = begin_date, enddate = end_date, poolsize=pool_size, lang=\"ja\")\n",
    "  File \"/opt/python/twitterscraper/query.py\", line 233, in query_tweets\n    pool = Pool(poolsize)\n",
    "  File \"/opt/python/billiard/pool.py\", line 995, in __init__\n    self._setup_queues()\n",
    "  File \"/opt/python/billiard/pool.py\", line 1364, in _setup_queues\n    self._inqueue = self._ctx.SimpleQueue()\n",
    "  File \"/opt/python/billiard/context.py\", line 150, in SimpleQueue\n    return SimpleQueue(ctx=self.get_context())\n",
    "  File \"/opt/python/billiard/queues.py\", line 390, in __init__\n    self._rlock = ctx.Lock()\n",
    "  File \"/opt/python/billiard/context.py\", line 105, in Lock\n    return Lock(ctx=self.get_context())\n",
    "  File \"/opt/python/billiard/synchronize.py\", line 182, in __init__\n    SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)\n",
    "  File \"/opt/python/billiard/synchronize.py\", line 71, in __init__\n    sl = self._semlock = _billiard.SemLock(\n"
  ]
}

Since the error content is "Function not implemented", it is apparently caused by the multi-process billiard library. It seems that multi-process is not available in AWS Lambda.

https://aws.amazon.com/es/blogs/compute/parallel-processing-in-python-with-aws-lambda/

Correspondence

There is a description about the same phenomenon in the issues of twitterscraper.

Since there is a workaround implementation in pullrequest on github, I could avoid it by replacing quert.py with the contents of pullrequest. https://github.com/taspinar/twitterscraper/pull/280/commits/685c5b4f601de58c2b2591219a805839011c5faf

Since the number of multi-processes is set using the variable "poolsize" when passing it to the function "query_tweets", it is implemented so that it will not be multi-processed if it is explicitly set to 0.