Hello, this is sunfish. ** "Twtter x Corona" ** We will continue to analyze. Part 1 morphologically analyzed the tweet text and went up to the point where the number of frequent words appearing daily was calculated.
↓ Selected frequently-used words 27
More than half a year has passed since the coronavirus became a social problem. Let's follow from the tweet what is rising in people and what is forgotten. In the second part, we will use regression analysis to find words with up / down trends.
We will use the data on the number of occurrences by day and word created in the first part. ↓ Data ↓ When visualized
As the days go by, I would like to find the words that appear more or less. In other words
y:Number of tweets for a specific word=a\times x:Number of days elapsed+b
Let's derive such a regression equation and observe the slope ʻa` and the correlation coefficient. As a data operation, it is necessary to calculate the "elapsed days" from the date data. As an approach, assign serial numbers, word by word, and date in ascending order.
from scipy.spatial.distance import cdist
import pandas as pd
import statsmodels.api as sm
port_23 = port_22.copy()
model_params = {'method': 'first', 'ascending': True}
port_23[['Created_At']] = pd.to_datetime(port_23[['Created_At']])
port_23['index'] = port_23.groupby(['word'])[['Created_At']].rank(**model_params)
port_23[['Created_At']] = port_23[['Created_At']].map(lambda x: x.date())
↓ Pay attention to the x-axis. Now that we have the number of days elapsed, we are ready for regression analysis.
We will observe how the selected 24 words change according to the number of days elapsed from the regression analysis results. When I try to write in python, it is difficult to loop for each word.
group_keys = ['word']
X_columns = ['index']
Y_column = 'count'
groups = port_23.groupby(group_keys)
models = {}
summaries = {}
def corr_xy(X, y):
"""Find the correlation coefficient between the objective variable and the explanatory variable"""
X_label = X.columns.tolist()
X = X.T.values
y = y.values.reshape(1, -1)
corr = 1 - cdist(X, y, metric='correlation')
corr = pd.DataFrame(corr, columns=['Correlation coefficient with the objective variable'])
corr['Explanatory variable'] = X_label
return corr
for i, g in groups:
X = g[X_columns]
Y = g[Y_column].squeeze()
corr = corr_xy(X, Y)
try:
model = sm.OLS(y, sm.add_constant(X, has_constant='add')).fit()
model.X = X.columns
models[i] = model
summary = pd.DataFrame(
{
'Explanatory variable': X.columns,
'coefficient': np.round(model.params, 5),
'standard deviation': np.round(model.bse, 5),
't value': np.round(model.tvalues, 5),
'Pr(>|t|)': np.round(model.pvalues, 5)
},
columns=['Explanatory variable', 'coefficient', 'standard deviation', 't value', 'Pr(>|t|)'])
summary = summary.merge(corr, on='Explanatory variable', how='left')
summaries[i] = summary
except:
continue
res = []
for key, value in summaries.items():
value[group_keys] = key
res.append(value)
concat_summary = pd.concat(res, ignore_index=True)
port_24 = models
port_25 = concat_summary
↓ With nehan, you can avoid the troublesome loop processing with the Create model for each group
option.
And we got the result of regression analysis for each word. Explanatory variable = const
, is the intercept information.
There are various interpretations, but here
Extract the words as they are correlated.
port_27 = port_25[(port_25['Correlation coefficient with the objective variable'] <= -0.4) |
(port_25['Correlation coefficient with the objective variable'] >= 0.4)]
Let's take a closer look at the word information.
Uptrend word
Downtrend word
↓ ** Live **, transition by number of days ↓ ** Government **, change by number of days
The threat of corona is not gone, but the number of words that we often see in the news during the crisis has decreased, and the number of words that have been strongly influenced by self-restraint, such as events and live performances, has increased. You can see how it is doing. Of course, this alone cannot be said to be ** "everyone wants to go live!" **, but I would like to conclude this theme as a consideration based on the data so far.
We hope that you can understand the appeal of nehan, a programming-free analysis tool that can be linked to various analyzes and visualizations from preprocessed data.
Recommended Posts