Try to analyze Twitter trends

Hello. When I woke up in the morning, I was surprised that "#Protest against the bill to amend the Public Prosecutor's Office Law" was on the Twitter trend, so I analyzed it. Twitter filled with political tweets is not Twitter. Twitter, which I know, is a world filled with "Kintama Glitter Friday". I don't think there will be 2 million political tweets on Twitter. Shinzo Abe ... It's a lie ...

First of all, my great-grandfather said during the war that you shouldn't believe these numbers as a premise. When I saw this, I first suspected "intentional trending by bots and spam." There is also a risk of multiple postings by the same person. Fortunately, I applied for the Twitter API a few months ago, and now I can freely touch Twitter search from the program, so I would like to write the code immediately.

coding

from requests_oauthlib import OAuth1Session
import json
from datetime import datetime
import calendar
import csv
import time
from bs4 import BeautifulSoup

#Code to access the Twitter API
consumer_key = *****
consumer_key_secret = *****
access_token = *****
access_token_secret = *****


#Access Twitter API
twitter = OAuth1Session(consumer_key, consumer_key_secret, access_token, access_token_secret)

#Define a function to search on Twitter. que is the search word, bot is whether to include the bot, count is the number of tweets acquired, max_id is the maximum ID of the Tweet to be searched.

def get(que,max_id):
    params = {'q': que, 'count': 100, 'max_id': max_id, 'modules': 'status', 'lang': 'ja'}
    #Access Twitter.
    req = twitter.get("https://api.twitter.com/1.1/search/tweets.json", params=params)

    #If the access is successful, the Tweet information is retained.
    if req.status_code == 200:
        search_timeline = json.loads(req.text)
        limit = req.headers['x-rate-limit-remaining']
        reset = int(req.headers['x-rate-limit-reset'])
        print("API remain: " + limit)
        if int(limit) == 1:
            print('sleep')
            time.sleep((datetime.fromtimestamp(reset) - datetime.now()).seconds)

    #If it fails, terminate the process.
    elif req.status_code == 503:
        time.sleep(30)
        req = twitter.get("https://api.twitter.com/1.1/search/tweets.json", params=params)
        if req.status_code == 200:
            search_timeline = json.loads(req.text)
            #API rest
            limit = req.headers['x-rate-limit-remaining']
            reset = int(req.headers['x-rate-limit-reset'])
            print("API remain: " + limit)
            if limit == 0:
                print('sleep')
                time.sleep((datetime.fromtimestamp(reset) - datetime.now()).seconds)

        else:
            print(req.status_code)
            return [], 0
    else:
        print(req.status_code)
        return [], 0

    for i in range(len(search_timeline['statuses'])):
        bs3obj = BeautifulSoup(search_timeline['statuses'][i]['source'], 'html.parser')
        search_timeline['statuses'][i]['source'] = bs3obj.text


    #Returns a list of Tweet information when this function is executed.
    return search_timeline['statuses'], search_timeline['statuses'][-1]['id']


def TweetSearch(que, bot, rep):
    max_id = 1259158522710730000 - 1
    global search_timeline, tweetTime
    tweetList = []
    #Specify whether to exclude Tweets by bot.
    if bot:
        que = str(que + ' -bot -rt')
    else:
        que = str(que + ' -rt')

    for i in range(rep):
        time.sleep(1)
        result, max_id = get(que,max_id)
        if max_id == 0:
            break

        tweetList.extend(result)


    return tweetList



word = '#Protest against the Public Prosecutor's Office Law Amendment'
tweetList = TweetSearch(word,False,200)

head = [i for i in tweetList[0]]

#Output to CSV file
with open('tweetanalysis_02.csv','w',newline='', encoding='utf_8') as f:
    writter = csv.writer(f)
    writter.writerow(head)
    for i in tweetList:
        writter.writerow([i[key] for key in head])

There is a relatively gabber part to finish with haste. To add one thing, each Tweet is assigned an ID, and you can check it by looking at the last 18 digits of the URL of the Tweet. And the value taken by that ID becomes larger as the Tweet time becomes later. By using this property to limit Tweet searches with max_id, you can prevent duplicate Tweets from being extracted when multiple queries are submitted. (The number of Tweets that can be searched with one query is up to 100 Tweets)

Take a look at the data

Run this program, Saved 37935 Tweets up to 5/9 23:46 --5/10 2:58 in CSV.

At this point you can see that it does not reach overwhelmingly 1 million. (Actually, at this point, the trend should have exceeded 1 million Tweets) By the way, there is an item called'retweet_count'in the acquired data, and you can know how many times each of the acquired Tweets was Retweet. If you add them up briefly, it will be 391675, so it seems better to think that the trend of Twitter includes Retweet. (Actually, Tweets before 5/9 23:46 are also considered to contribute to the trend) It is also the first question raised ・ Posted by the same user ・ Bot, spam posting Let's verify briefly. I'm not a great person who can handle CSV with statistical software such as R, so this time it's a classic method, but I'll try it easily using Excel. (Because the amount of data is not so large and I thought that my PC could withstand it)

On the [Data] tab, check'user'from the [Delete Duplicates] item and check![Image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com /0/596394/04981035-8aa0-125f-ceee-1b3044ed2dc4.png) Delete! Oh. About 1/4 of Tweets have been removed as duplicate users. Of course, this alone doesn't tell you if the same user was tweeted insanely or if many users were tweeted multiple times, but at least you can see that the number of users tweeting is much less than it seems.

Next, let's find out the data posting source. It is analyzing data with user duplicates removed. Twitter for iPhone accounted for more than half. Also, regarding other things, there are many official clients such as'Twitter for iPad'and'Twitter Web Client' and unofficial clients, and you can see that the number of tweets due to automatic posting and spam is negligible. (By the way, since you cannot set the source of your own Tweet that includes the character string of'Twitter', you can conclude that as long as Twitter is included in the character string, spam posts are not 100%.)

It's simple, but I've analyzed Twitter trends. I uploaded the Python code and the obtained CSV file to github. 　https://github.com/ogadra/twitter_analysis Next time, I would like to increase the data and analyze it with R etc.

Postscript

I'm not sure, but there seems to be an omission of acquisition. I'm still not sure if this is using the Twitter API or if I'm bad, so I'll try to get it again.