I'm tired of Python, so I tried to analyze the data with nehan (I want to go live even with corona sickness-Part 2)

greeting

Hello, this is sunfish. ** "Twtter x Corona" ** We will continue to analyze. Part 1 morphologically analyzed the tweet text and went up to the point where the number of frequent words appearing daily was calculated.

↓ Selected frequently-used words 27 スクリーンショット 2020-10-05 17.53.26.png

Search for up / down trend words from Twitter data

More than half a year has passed since the coronavirus became a social problem. Let's follow from the tweet what is rising in people and what is forgotten. In the second part, we will use regression analysis to find words with up / down trends.

data

We will use the data on the number of occurrences by day and word created in the first part. ↓ Data スクリーンショット 2020-10-12 18.37.23.png ↓ When visualized スクリーンショット 2020-10-05 17.56.28.png

Preparing to perform regression analysis

As the days go by, I would like to find the words that appear more or less. In other words

y:Number of tweets for a specific word=a\times x:Number of days elapsed+b

Let's derive such a regression equation and observe the slope ʻa` and the correlation coefficient. As a data operation, it is necessary to calculate the "elapsed days" from the date data. As an approach, assign serial numbers, word by word, and date in ascending order.

from scipy.spatial.distance import cdist
import pandas as pd
import statsmodels.api as sm

port_23 = port_22.copy()
model_params = {'method': 'first', 'ascending': True}
port_23[['Created_At']] = pd.to_datetime(port_23[['Created_At']])
port_23['index'] = port_23.groupby(['word'])[['Created_At']].rank(**model_params)
port_23[['Created_At']] = port_23[['Created_At']].map(lambda x: x.date())

スクリーンショット 2020-10-12 18.53.53.png

↓ Pay attention to the x-axis. Now that we have the number of days elapsed, we are ready for regression analysis. スクリーンショット 2020-10-12 18.57.13.png

Perform regression analysis. Word by word

We will observe how the selected 24 words change according to the number of days elapsed from the regression analysis results. When I try to write in python, it is difficult to loop for each word.

group_keys = ['word']
X_columns = ['index']
Y_column = 'count'
groups = port_23.groupby(group_keys)
models = {}
summaries = {}

def corr_xy(X, y):
    """Find the correlation coefficient between the objective variable and the explanatory variable"""
    X_label = X.columns.tolist()
    X = X.T.values
    y = y.values.reshape(1, -1)
    corr = 1 - cdist(X, y, metric='correlation')
    corr = pd.DataFrame(corr, columns=['Correlation coefficient with the objective variable'])
    corr['Explanatory variable'] = X_label
    return corr

for i, g in groups:
    X = g[X_columns]
    Y = g[Y_column].squeeze()
    corr = corr_xy(X, Y)
    try:
        model = sm.OLS(y, sm.add_constant(X, has_constant='add')).fit()
        model.X = X.columns
        models[i] = model
        summary = pd.DataFrame(
            {
                'Explanatory variable': X.columns,
                'coefficient': np.round(model.params, 5),
                'standard deviation': np.round(model.bse, 5),
                't value': np.round(model.tvalues, 5),
                'Pr(>|t|)': np.round(model.pvalues, 5)
            },
            columns=['Explanatory variable', 'coefficient', 'standard deviation', 't value', 'Pr(>|t|)'])
        summary = summary.merge(corr, on='Explanatory variable', how='left')
        summaries[i] = summary
    except:
        continue

res = []
for key, value in summaries.items():
    value[group_keys] = key
    res.append(value)

concat_summary = pd.concat(res, ignore_index=True)
port_24 = models
port_25 = concat_summary

↓ With nehan, you can avoid the troublesome loop processing with the Create model for each group option. スクリーンショット 2020-10-12 20.04.14.png

And we got the result of regression analysis for each word. Explanatory variable = const, is the intercept information. スクリーンショット 2020-10-12 19.13.16.png

Focus on up / downtrend words

There are various interpretations, but here

Correlation coefficient is 0.4 or more
Correlation coefficient is -0.4 or less

Extract the words as they are correlated.

port_27 = port_25[(port_25['Correlation coefficient with the objective variable'] <= -0.4) |
                  (port_25['Correlation coefficient with the objective variable'] >= 0.4)]

スクリーンショット 2020-10-12 19.14.51.png

Let's take a closer look at the word information. スクリーンショット 2020-10-12 19.28.43.png

Observe the results

Uptrend word

** Event **
live

Downtrend word

** Patient **
** Government **
** Symptoms **
** Severe **

↓ ** Live **, transition by number of days スクリーンショット 2020-10-12 19.29.58.png ↓ ** Government **, change by number of days スクリーンショット 2020-10-12 19.30.49.png

Summary

The threat of corona is not gone, but the number of words that we often see in the news during the crisis has decreased, and the number of words that have been strongly influenced by self-restraint, such as events and live performances, has increased. You can see how it is doing. Of course, this alone cannot be said to be ** "everyone wants to go live!" **, but I would like to conclude this theme as a consideration based on the data so far.

We hope that you can understand the appeal of nehan, a programming-free analysis tool that can be linked to various analyzes and visualizations from preprocessed data.

Click here for an introduction to the analysis tool nehan (https://nehan.io/product/).
↓ Today's content In addition, the above source code is a copy of the code output by nehan's python export function. (There were some bugs, so I rewrote it ...)