Hello, this is sunfish. ** "Twtter x Corona" ** This is the second time in the series. Last time tried to count the number of tweets, it was a level, but this time I will do my best. Especially, if you are exhausted by natural language processing such as installing MeCab or building an environment, please take a look.
More than half a year has passed since the coronavirus became a social problem. Let's follow from the tweet what is rising in people and what is forgotten. In the first part, we will carry out morphological analysis and select the words to be analyzed.
Use the data after Last preprocessing. In other words, it is the data of tweet date, tweet content.
Actually, in this data, the same tweet content occurs over multiple records and multiple days. (Because it includes retweet) This time, we will analyze with 1 tweet content 1 record, excluding the retweet bias.
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import statsmodels.api as sm
import re
import MeCab
import dask.dataframe as dd
from multiprocessing import cpu_count
#Most created for each tweet_At takes a young day and makes 1 tweet 1 record
port_13['Created_At'] = pd.to_datetime(port_12['Created_At'])
port_13 = port_12.groupby(['Text']).apply(lambda grp: getattr(
grp, 'nsmallest')(n=1, columns='Created_At', keep='first'))
port_13['Created_At'] = port_12['Created_At'].map(lambda x: x.date())
I will do the key to language processing. Since it is difficult to understand if all part of speech is given, only ** "general nouns" ** will be analyzed this time.
def tokenizer(text, pos, only_surface):
def _extract():
if only_surface:
return re.sub(r'[\s ]+', '_', feature[0])
else:
return re.sub(r'[\s ]+', '_', feature[2])
_tagger = MeCab.Tagger(
'-Ochasen -d {}'.format("/var/lib/mecab/dic/mecab-ipadic-neologd"))
try:
result = []
for feature in _tagger.parse(text).split('\n')[:-2]:
feature = feature.split('\t')
if pos:
if feature[3] in pos:
result.append(_extract())
else:
result.append(_extract())
return ' '.join(result)
except UnicodeEncodeError:
return ''
except NotImplementedError:
return ''
port2 = port1.copy()
port2['Text_morpheme'] = port2['Text'].fillna('')
ddf = dd.from_pandas(port2, npartitions=cpu_count()-1)
target_cols = ['Text_morpheme']
pos = ['noun-General']
for target_col in target_cols:
ddf[target_col] = ddf[target_col].apply(
tokenizer, pos=pos, only_surface=True, meta=(f'{target_col}', 'object'))
port2 = ddf.compute(scheduler='processes')
↓ nehan's morphological analysis combines morphemes separated by spaces and inserts them into a column.
Note that tweets that do not contain general nouns will lose the results of morphological analysis, so delete them. Use missing value processing.
port_15 = port_14.copy()
port_15 = port_15.dropna(subset=None, how='any')
Since it is unavoidable to target words that rarely appear, we analyzed those that appeared more than 1,500 in the entire period.
#Aggregation of word frequency
port_18 = port_15.copy()
flat_words = list(chain.from_iterable(port_18['Text_morpheme'].str.split(' ')))
c = Counter(flat_words)
res = pd.DataFrame.from_dict(c, orient='index').reset_index()
res.columns = ['word', 'count']
port_18 = res
#Row filter by condition
port_20 = port_18[(port_18['count'] >= 1500.0)]
#Column selection
port_21 = port_20[['word']]
↓ The number of appearances of the selected 27 words looks like this
↓ As a bonus, it is a word cloud before selection.
Since the target words were narrowed down in the previous step, the next step is to create daily data.
Aggregate word frequency using Created_At
as a key column.
port_16 = port_15.copy()
target_col = 'Text_morpheme'
groupby_cols = ['Created_At']
tmp = port_16[groupby_cols+[target_col]]
tmp = tmp.groupby(groupby_cols)[target_col].apply(lambda x: ' '.join(x))
vec_counter = CountVectorizer(tokenizer=lambda x: x.split(' '))
X = vec_counter.fit_transform(tmp)
res = pd.DataFrame(X.toarray(), columns=vec_counter.get_feature_names(), index=tmp.index
).reset_index().melt(id_vars=groupby_cols, var_name='word', value_name='count')
port_16 = res.sort_values(groupby_cols).reset_index(drop=True)
The data of * is combined here, the daily data is narrowed down to the words to be analyzed, and the work of the first part is completed.
port_22 = pd.merge(port_21, port_16, how='inner',
left_on=['word'], right_on=['word'])
↓ As a trial, filter the obtained data and visualize the number of word occurrences by day. A smile is better than a crying face.
It's been a long time, but I tried to get the number of frequent words appearing every day. We value simplicity over rigor. When trying to do complicated analysis, the code tends to be long and difficult, but nehan does this with 10 nodes (the number of green circles). Of course, I didn't write any programs. I hope you will be interested in nehan even a little.
Recommended Posts