Experiment to collect tweets for a long time (program preparation (2))

Until last time

-[x] I'll make a Twitter tweet collection program that will work for 3 months!

Review the source for the time being

It was the last program that worked surprisingly well despite the fact that it was made quite appropriately, but since it was made by imitating what you see, is this really good? I can't get rid of the question. The official explanation is at the level of one pella, and there is no API reference. Stream API is actually ~~ left unattended without demand ~~ Isn't it popular?

Perhaps there is a page that explains in detail somewhere, but I can hardly read English so I do not understand well. Even if you are using Tweepy, do you think that you are not looking at anything other than the main things, or do you think that it is self-evident except for the parts that are introduced?

If you don't know, you have to read the source.

Anyway, there is too little information around the Stream API than I imagined. Above all, leaving the official document unattended is the least fashionable. I'm scared because the function I want is "actually implemented" or something. If this happens, you have to check the source directly . Fortunately, Tweepy is an open source software published under the MIT License, published on GitHub. In other words, it's not impossible if you try to read it-if you have the time, motivation, and ability --- it should be.

Let's look at the source specifically.

streaming.py , there was a source with the name itself, so open this When I try it, there are three classes. * StreamListener * ReadBuffer * Stream The two other than ReadBuffer are used in the first program I made. If the ReadBuffer class is exactly what it sounds like, it's okay to leave it alone for now. Let's take a look at "StreamListener" and "Stream".

StreamListener class

Isn't it okay to understand that the class that passes the information received by the Stream class and actually processes it? Actually, after inheriting this class, the method of the parent class is overridden to implement the original operation, and other than that, the predetermined operation of the parent class is performed ... , isn't it? </ small>

There were quite a few methods that could be overridden. Below is a list (check the description against the reference in dev.twitter.com ) ..

  • on_connect
    Called when connected.
    The default operation is "pass" …… Do nothing? Why isn't this a return unlike the others? ?? </ small>
  • on_data
    Called when the data to be processed flies.
    Check the data acquired in this process and call each method below (italicized).
  • keep_alive
    Called when a line break to keep the connection is sent with no data flowing.
    • on_status *
      Called when you receive a so-called general tweet.
  • on_exception
    Called when an exception occurs while processing on a Stream object (if it is a callable error).
    • on_delete *
      Called when a message has been deleted.
    • on_event * (UserStream only)
      Called for notifications such as RT, likes, and blocks.
    • on_direct_message * (UserStream only)
      Called when DM arrives.
    • on_friends * (Only for User Stream)
      Called when User Stream starts. You will receive a list of your friends' IDs.
    • on_limit *
      The message that should have been filtered and delivered is still called when the flow rate is exceeded. You can also get the number that could not be obtained.
  • on_error
    Called when a status code other than 200 is returned during processing on the Stream object.
  • on_timeout
    Called when the Stream connection times out.
    • on_disconnect *
      Called when disconnected from Twitter. With message.
    • on_warning *
      Called when a processing warning comes from the Twitter side. With message.

What is it, isn't there a good message sent? There seems to be a proper "connected" message. I was a little relieved. For the method called from on_data, it seems that it will be disconnected if the return value is False. on_error is implemented to return False if not overridden. You can't see this without looking at the Stream class and checking the caller.

By the way, the messages sent by Stream API are published on dev.twitter.com . However, it seems that the messages "status_withheld", "user_withheld", "scrub_geo", "for_user", and "control" are not checked by on_data. Is it because it doesn't make much sense? If you want to see these, you have to override on_data and override most of the above methods. I don't know.

Stream class

It looks like a class that implements connect, disconnect, and receive loops. Make the initial settings for connection with "init (constructor)", and call the methods corresponding to Stream API of "userstream", "firehose", "retweet", "sample", "filter", and "sitestream" to go through "_start". Call "_run".

The loop from "while self.running:" performs specific connection processing, and the execution loop after connection is handled by "_read_loop". If the connection process fails with "self.session.request ()", call on_error of StreamListener class, escape from the While loop in case of False, and reconnect while waiting for a certain period of time while turning the error counter in other cases ...

** Isn't there a reconnection function! ** **

That's right, there is no reason not to implement the troublesome story in the library. It's not a story to let the user do it.

The hurdle has suddenly dropped for the most important requirement specification, "** A function to reconnect in the event of an unexpected disconnection **". On the other hand, I'd like to ask for an hour why the details of these important functions are not described in an easy-to-understand manner ...

By the way, in "\ _read_loop", "on_data" of StreamListener class is called, and if False is returned here, the loop is escaped. At this time, self.running where False is stored is also seen in the connection loop of the caller "_run", and the connection loop is exited and the session is terminated.

Look at the source and decide on a policy.

take heart. As a result of looking at the source

  1. The method that branches from on_data of StreamListener class is disconnected and ends when False is returned.
    → Override anything that returns False in the original method so that it returns True.
  2. on_error and on_timeout will end without reconnecting when False is returned.
    → Override this and return True.
  3. on_exception cannot be dealt with, so think of some means

Looking at the source specifically, there is nothing that corresponds to (1). I mean, it doesn't return True or False as a return value. ...... Is this okay? * I'm checking with "is False:", so it's OK * because it doesn't equal in comparison when the value doesn't return? Regarding (2), on_timeout does not return a value and returns, so it is OK if you override only on_error and return True. (3) What should I do?

However, for warnings as well as errors, it seems that it will be useful to keep a log later, so override "on_connect" "on_disconnect" "on_limit" "on_timeout" "on_warning" "on_exception" To.

Dealing with exceptions

Exceptions are exceptions, so it is difficult to deal with them. I think, the exception in the processing loop of the Stream class is raised after calling on_exception of StreamListener, so it should be possible to exclude this at the caller. Otherwise, what if it happens during individual processing in StreamListener? However, if you follow the caller, it will be a StreamListener, so it will eventually gather there ...? If it still stops with an exception ... Would you like to call the script in a loop with a batch file? ??

Source so far

Based on the top priority "** Function to reconnect in case of unexpected disconnection **", the source up to now looks like this.

tweetcheck2.py


#!/usr/bin/env python
# -*- coding:utf-8 -*-

import tweepy

#Get it yourself and put it in
CK = ''   # Consumer Key
CS = ''   # Consumer Secret
AT = ''   # Access Token
AS = ''   # Accesss Token Secert

class Listener(tweepy.StreamListener):
    def on_status(self, status):
        print('Tweet')    #I can't read it anyway, so I'll just tell you that there was a message
        return True

    def on_error(self, status_code):
        print('Error occurred: ' + str(status_code))
        return True

    def on_connect(self):
        print('Connected')
        return

    def on_disconnect(self, notice):
        print('Disconnected:' + str(notice.code))
        return

    def on_limit(self, track):
        print('Receive limit has occurred:' + str(track))
        return

    def on_timeout(self):
        print('time out')
        return True

    def on_warning(self, notice):
        print('Warning message:' + str(notice.message))
        return

    def on_exception(self, exception):
        print('Exception error:' + str(exception))
        return


#Main processing from here
auth = tweepy.OAuthHandler(CK, CS)
auth.set_access_token(AT, AS)

while True:     #infinite loop
        try:
                listener = Listener()
                stream = tweepy.Stream(auth, listener)

                #Select one and uncomment it.
                #stream.filter(track=['#xxxxxx'])
                stream.sample()
                #stream.userstream()
        except:
                pass    #Ignore all exceptions and loop

Display all messages that threaten execution, such as errors and warnings, and reconnect when an error occurs. In the unlikely event that an exception occurs, the main processing section will exclude it, ignore the exception, loop it, and re-execute.

Finished? …… Ctrl + C …… </ small>

(When I actually tried it, I couldn't stop it (stupid. Is there a way to escape the loop only with Ctrl + C?)

The basic form of Twitter-related processing is probably like this for the time being. I'm wondering if this is really okay, but ... ** I'm looking for Tsukkomi . Next time, on the MongoDB side, I think I'll start working on the top priority " Storage of received data in MongoDB **". (Continue)

Recommended Posts