[Python] A data infrastructure for acquiring and publishing tweets from Twitter API to BigQuery was built on GCP (with negative / positive score).

Introduction

This is the first post. Since we have built a data infrastructure on GCP to acquire tweets from the Twitter API for internal use and publish them to BigQuery, we will post them as a memorandum and a future todo list.

This is my first time to build a data infrastructure and use GCP, so if you notice something, I would appreciate it if you could comment. datapotal.JPG

Development environment

Google Cloud Platform python3 series

Overview

Based on the keywords described in the spreadsheet, get the tweet content from the Twitter API, calculate the negative / positive score, save it in BigQuery, and display & download it with Data Potal.

The logic stored in GCS and BigQuery is built with python on Cloud Functions.

Since it is set to run once a day in Cloud Scheduler, it will be acquired automatically every day after the initial setting. gcp-twitter.JPG

Commentary

① Get tweets with Twitter API

To use the Standard Search API, you need to apply in English. I referred to here for the application. Summary of steps from Twitter API registration (account application method) to approval

In addition, acquisition by Standard Search API has the following restrictions. ・ Up to 100 tweets at one time ・ 180 requests in 15 minutes ・ You can get tweets up to the last 7 days </ b>

You can only get up to 18000 tweets in 15 minutes. After 15 minutes, the number of requests will return to 180. This area is pretty good if you build logic, but you can not get tweets that are more than 7 days old. If you want to get it in earnest, you can get it almost unlimitedly by using the paid API. (I need money forever)

I've seen this page more than 30 times regarding the Twitter API. How to get and analyze over 5 million Twitter retweet data

You may think that scraping should be done if there are restrictions, but Twitter scraping is prohibited by the rules, so let's weight it.

② Store tweets in Cloud Storage with Cloud Functions

Cloud Functions is a service that can run JavaScript and python on the cloud. From 2019, 3 series of python can also be used.

Get the pre-filled keywords from the spreadsheet on Google Drive and throw them to the Twitter API. Convert the returned json to csv and store one day's worth of tweets in Cloud Storage. gcs.JPG

It was okay to store the acquired tweet content directly in BigQuery, but when I was studying the data infrastructure, I learned the concepts of "data lake", "data warehouse", and "data mart", so I once stored it in Cloud Storage. I keep it. If you store the data in a limited way, you may end up saying, "Oh, I wanted this item too!"

I referred to here for the concept of data infrastructure. Three classifications of data infrastructure and evolutionary data modeling

Since the data lake is positioned as a "copy of the original data" </ b>, I originally wanted to store it in json format instead of csv, but errors occur and I do it. I gave up because I didn't understand. </ S>

③ Periodically execute tweet acquisition with Cloud Scheduler

Cloud Scheduler is an inexpensive and easy-to-use cron service that allows you to send messages to oogle Pub / Sub topics. In a nutshell, you can run Cloud Functions by specifying the date and time. </ b>

By setting a value in the payload, you can use that value in Cloud Functions. Currently, we are specifying the number of rows in the spreadsheet where the keywords are entered, so we are setting Cloud Scheduler as many as the number of keywords we want to get.

In addition, due to the restrictions of the Twitter API described in (1), each keyword is executed at intervals of 15 minutes. If you can sign up for a paid Twitter API, the mechanism around here will change.

④ Add negative / positive score from Cloud Storage to BigQuery and store

Again, we are using Cloud Functions. Sentiment analysis is performed with python and scored. Since it is processed in the data frame, to_gbq is used for storage in BigQuery. pandas.DataFrame.to_gbq

If you do not specify the BigQuery schema in the table_schema of to_gbq, you may get angry when you save almost null items (geo etc.), so be careful.

I referred to the following sites for sentiment analysis, but the accuracy is still low, so improving the accuracy is a future task. I have about 500,000 tweet datasets at hand, and I would like to use this to create a sentiment analysis tool specialized for twitter. I made a negative / positive analysis app with deep learning (python)

Since sentiment analysis is the main focus this time, BOT and retweets are excluded. </ b> For BOT, select only those that include "twitter.com" in the source, Retweets exclude tweets whose body begins with "@ RT".

If you want to count hashtags in campaigns etc., you need to change the logic.

⑤ Execute ④ with Cloud Scheduler

What you are doing with ③ is the same. Cloud Scheduler executes storage in Cloud Storage as a trigger.

⑥ Store in BigQuery

There is a table for each keyword. Since the time on Twitter is Coordinated Universal Time (UTC), we store the content of tweets from 9:00 am the day before to 9:00 am on the day. In addition to the content of the tweet, the date and time of the tweet, account information, negative / positive score, etc. are stored.

⑦ Published on Data Potal

Since the company has a contract with Gsuite and all employees can use Google accounts, we are making it possible to view and download the contents of tweets with Data Potal. Of course, if you download it with csv, you can also sort it in negative / positive score order on DataPotal. I haven't touched this part at all yet, and currently I only specify the period, but I would like to add filters and items in the future.

I was also interested in Tableau, but when I applied for the trial, I gave up because I didn't receive a verification email. </ s> I will challenge another opportunity.

Future tasks, what you want to do

・ When the number of keywords increases, GCP billing and logic must be changed. ・ Use the accuracy of sentiment analysis UP or NLP ・ Display the words that are tweeted together in the word-separation (Mecab + NEologd) ・ Calculate influencer influence score from tweet account information and classify private graph, interest graph, and abolition of Twitter

Finally

python strongest. If you search, it will be 95%. If you understand English, it will be 99%. Every day is searched by the minister, and I am deeply grateful to all humankind.

Since you can work with GCP on Jupyter Notebook, it's better to test your code on Jupyter. (Deploying Cloud Functions is very slow)

I think there were other parts that fit in, so I would like to publish the code in the future.

Please let me know if there is something interesting or possible.

Recommended Posts