I'm tired of Python, so I tried to analyze the data with nehan (I want to go live even with corona sickness-Part 1)

greeting

Hello, this is sunfish. ** "Twtter x Corona" ** This is the second time in the series. Last time tried to count the number of tweets, it was a level, but this time I will do my best. Especially, if you are exhausted by natural language processing such as installing MeCab or building an environment, please take a look.

Search for up / down trend words from Twitter data

More than half a year has passed since the coronavirus became a social problem. Let's follow from the tweet what is rising in people and what is forgotten. In the first part, we will carry out morphological analysis and select the words to be analyzed.

data

Use the data after Last preprocessing. In other words, it is the data of tweet date, tweet content. スクリーンショット 2020-10-05 16.14.11.png

Duplicate deletion process

Actually, in this data, the same tweet content occurs over multiple records and multiple days. (Because it includes retweet) This time, we will analyze with 1 tweet content 1 record, excluding the retweet bias.

from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import statsmodels.api as sm
import re
import MeCab
import dask.dataframe as dd
from multiprocessing import cpu_count

#Most created for each tweet_At takes a young day and makes 1 tweet 1 record
port_13['Created_At'] = pd.to_datetime(port_12['Created_At'])
port_13 = port_12.groupby(['Text']).apply(lambda grp: getattr(
    grp, 'nsmallest')(n=1, columns='Created_At', keep='first'))
port_13['Created_At'] = port_12['Created_At'].map(lambda x: x.date())

スクリーンショット 2020-10-05 16.29.00.png

Morphological analysis

I will do the key to language processing. Since it is difficult to understand if all part of speech is given, only ** "general nouns" ** will be analyzed this time.

def tokenizer(text, pos, only_surface):
    def _extract():
        if only_surface:
            return re.sub(r'[\s　]+', '_', feature[0])
        else:
            return re.sub(r'[\s　]+', '_', feature[2])
    _tagger = MeCab.Tagger(
        '-Ochasen -d {}'.format("/var/lib/mecab/dic/mecab-ipadic-neologd"))
    try:
        result = []
        for feature in _tagger.parse(text).split('\n')[:-2]:
            feature = feature.split('\t')
            if pos:
                if feature[3] in pos:
                    result.append(_extract())
            else:
                result.append(_extract())
        return ' '.join(result)
    except UnicodeEncodeError:
        return ''
    except NotImplementedError:
        return ''

port2 = port1.copy()
port2['Text_morpheme'] = port2['Text'].fillna('')
ddf = dd.from_pandas(port2, npartitions=cpu_count()-1)
target_cols = ['Text_morpheme']
pos = ['noun-General']
for target_col in target_cols:
    ddf[target_col] = ddf[target_col].apply(
        tokenizer, pos=pos, only_surface=True, meta=(f'{target_col}', 'object'))
port2 = ddf.compute(scheduler='processes')

↓ nehan's morphological analysis combines morphemes separated by spaces and inserts them into a column. スクリーンショット 2020-10-05 16.43.11.png

Note that tweets that do not contain general nouns will lose the results of morphological analysis, so delete them. Use missing value processing.

port_15 = port_14.copy()
port_15 = port_15.dropna(subset=None, how='any')

スクリーンショット 2020-10-05 16.53.19.png

Select words that appear frequently (*)

Since it is unavoidable to target words that rarely appear, we analyzed those that appeared more than 1,500 in the entire period.

#Aggregation of word frequency
port_18 = port_15.copy()
flat_words = list(chain.from_iterable(port_18['Text_morpheme'].str.split(' ')))
c = Counter(flat_words)
res = pd.DataFrame.from_dict(c, orient='index').reset_index()
res.columns = ['word', 'count']
port_18 = res

#Row filter by condition
port_20 = port_18[(port_18['count'] >= 1500.0)]

#Column selection
port_21 = port_20[['word']]

スクリーンショット 2020-10-05 17.36.21.png

↓ The number of appearances of the selected 27 words looks like this スクリーンショット 2020-10-05 17.53.26.png

↓ As a bonus, it is a word cloud before selection. スクリーンショット 2020-10-05 17.36.54.png

Aggregate the number of words that appear each day and narrow down to words that occur frequently

Since the target words were narrowed down in the previous step, the next step is to create daily data. Aggregate word frequency using Created_At as a key column.

port_16 = port_15.copy()
target_col = 'Text_morpheme'
groupby_cols = ['Created_At']
tmp = port_16[groupby_cols+[target_col]]
tmp = tmp.groupby(groupby_cols)[target_col].apply(lambda x: ' '.join(x))
vec_counter = CountVectorizer(tokenizer=lambda x: x.split(' '))
X = vec_counter.fit_transform(tmp)
res = pd.DataFrame(X.toarray(), columns=vec_counter.get_feature_names(), index=tmp.index
                   ).reset_index().melt(id_vars=groupby_cols, var_name='word', value_name='count')
port_16 = res.sort_values(groupby_cols).reset_index(drop=True)

スクリーンショット 2020-10-05 17.39.39.png

The data of * is combined here, the daily data is narrowed down to the words to be analyzed, and the work of the first part is completed.

port_22 = pd.merge(port_21, port_16, how='inner',
                   left_on=['word'], right_on=['word'])

スクリーンショット 2020-10-05 17.46.07.png

↓ As a trial, filter the obtained data and visualize the number of word occurrences by day. A smile is better than a crying face. スクリーンショット 2020-10-05 17.56.28.png

Summary

It's been a long time, but I tried to get the number of frequent words appearing every day. We value simplicity over rigor. When trying to do complicated analysis, the code tends to be long and difficult, but nehan does this with 10 nodes (the number of green circles). Of course, I didn't write any programs. I hope you will be interested in nehan even a little.

Click here for an introduction to the analysis tool nehan (https://nehan.io/product/).
↓ Today's content In addition, the above source code is a copy of the code output by nehan's python export function. Only the morphological analysis part was not applicable to the code output, so I asked the nehan staff to write it.