We will look back on the past year by TF-IDF analysis for news articles related to the new coronavirus.

⑴ Document creation

1. Data source

I used Multilingual information transmission site "nippon.com" as a data resource.
Information on Japanese politics, economy, society and culture is distributed in 7 languages: Japanese, English, Simplified Chinese, Simplified Chinese, French, Spanish and Russian.

2. Data acquisition and preprocessing

To create data for TF-IDF analysis from a website, ➊ Get HTML data and extract the necessary parts, ➋ Remove noise, ➌ Extract only arbitrary part of speech by morphological analysis and list words Create.
After such processing, use the text file that has already been downloaded to your local PC. Click here for details 3. Natural language processing with Python 3-4. A year of corona looking back on TF-IDF [Data creation].
When we extracted news articles about the new coronavirus from the beginning of the year to the latest, the total number was 1216 from 1/16 to 12/19. The breakdown is shown below along with the main events of each month.

Moon	article number	Main events
1	64	1/6 Ministry of Health, Labor and Welfare calls attention "Pneumonia of unknown cause in Wuhan, China" 1/16 First confirmed infected person in Japan, Chinese man traveling to Wuhan
2	210	2/3 Cruise ship confirmed to be infected by passengers, entering Yokohama port 2/13 A woman in her 80s living in Kanagawa prefecture who died for the first time in Japan
3	88	3/9 Expert meeting calls for avoidance of "three dense" 3/24 Decided to postpone the Tokyo Olympics
4	320	4/7 Declaration of emergency in the Greater Tokyo Area and 7 prefectures of Osaka, Hyogo, and Fukuoka 4/16 Expand the state of emergency nationwide
5	357	5/4 State of emergency extended until May 31 5/25 Completely lift the state of emergency
6	65	6/Alleviate self-restraint from moving across 19 prefectures nationwide 6/29 Over 500,000 dead in the world
7	35	7/3 Over 200 people infected in Japan for the first time in 2 months 7/22 GoTo Travel Start/795 people infected daily in Japan, the highest number ever
8	18	8/17 4-June GDP is 27 annually.8%Decrease 8/20 Countermeasures subcommittee views that the epidemic has reached its peak
9	7	9/5 WHO “Vaccine distribution will start in the middle of next year” 9/18 GoTo Travel Reservation to / from Tokyo lifted
10	12	10/1 GoTo eat start 10/12 Rapid spread of infection in Europe
11	25	11/19 The number of domestically infected people reached a record high for the second consecutive day 11/20 Government Subcommittee Recommendations for Government to Review GoTo
12	15	12/14 GoTo Travel Stopped all over Japan 12/17 Tokyo, 822 new infections per day, to the highest alert level
Total	1216

3. Upload text file

Upload 12 months' worth of text files on Google Colaboratory.

from google.colab import files
uploaded = files.upload()

The UI for upload operation will appear and click [Select File] to display the dialog. Select the file and click [Open] to start uploading.

4. Reading a text file

#Fill the arithmetic progression from 1 to 12 with 0 and make it 2 digits
months = ['{0:02d}'.format(i) for i in range(1,13,1)]

docs = []
for month in months:
    #Generate file name
    file_name = "nipponcom_covid19_2020-" + month + ".txt"
    #Read as text
    with open(file_name, mode='rt', encoding='utf-8-sig') as f:
        text = f.read()
        docs.append(text)

The data docs used for TF-IDF is for 12 months on a monthly basis, and only nouns are separated by single-byte spaces.

⑵ Overview of data

First, give a quantitative overview of the data. Counts the number of elements and the number of unique vocabularies each month.

1. Monthly number of extracts and vocabulary

import pandas as pd

metrics = []
for doc in docs:
    value = []
    #Split with whitespace as delimiter
    words = pd.Series(doc.split(" "))
    #Count the number of elements
    value.append(len(words))
    #Count the number of unique elements
    value.append(words.nunique())
    metrics.append(value)

#Formatted to data frame
names = ["Number of extracts", "Vocabulary number"]
months = ['{0}Moon'.format(i) for i in range(1, 13, 1)]
pd.DataFrame(metrics, columns=names, index=months)

2. Top 10 words of monthly appearance frequency

Get the top 10 words in descending order of appearance frequency every month.

from collections import Counter

rank_frequency = []
for doc in docs:
    value = []
    #Split with whitespace as delimiter
    words = pd.Series(doc.split(" "))
    #Count the number of unique vocabularies
    cnt = Counter(words)
    v = cnt.most_common(10) #Top
    value.append(v)
    rank_frequency.append(value)
    
rank_frequency

Put together in a data frame.

import numpy as np

#Get the top 10 words each month
ranking = []
for a in rank_frequency:    
    temp = []
    for i in a:
        for n in range(0,10,1):
            j = i[n]
            temp.append(j[0])
    ranking.append(temp)

#Data frame
data = np.array(ranking).T
rank = ['{0}Rank'.format(i) for i in range(1, 11, 1)]
pd.DataFrame(data, columns=months, index=rank)

⑶ TF-IDF analysis

Use scikit-learn's TfidfVectorizer to calculate the $ tfidf $ score.

from sklearn.feature_extraction.text import TfidfVectorizer

#Generate model
vectorizer = TfidfVectorizer(smooth_idf=False)
X = vectorizer.fit_transform(docs)

#Data frame
values = X.toarray()
feature_names = vectorizer.get_feature_names()
month_num = ['{0:02d}'.format(i) for i in range(1,13,1)]
df_score = pd.DataFrame(values, columns = feature_names, index=month_num)

print(df_score)

12 rows x 14367 columns, and 14637 words starting with "abenomics" have been extracted for the entire 12 months.
Try to get the top 10 words of each month based on the $ tfidf $ score.

for i in range(0,12,1):
    monthly_rank = []
    df_score_ = df_score[i:i+1].T
    df_score_sorted = df_score_.sort_values(month_num[i], ascending=False)
    print(df_score_sorted.head(10))

Put together in a data frame.

result = []
for i,j in zip(range(0,12,1), month_num):
    test = df_score[i:i+1].T
    #Get the top 10 words
    test_sorted = test.sort_values(j, ascending=False)
    test_rank = test_sorted.head(10)
    #Extract only noun labels
    r = test_rank.index
    result.append(r)

pd.DataFrame(result,columns=rank,index=months).T

⑷ Comparison of frequency of occurrence and TF-IDF analysis

Compare and contrast the top 10 words of frequency of occurrence with the top 10 words of TF-IDF. We will gray out the words that appear in multiple months, and look at the characteristic words for each month.

Top 10 words of appearance frequency

Top 10 words of TF-IDF

At first glance, TF-IDF has fewer duplicate words, and characteristic words are ranked in, making it easier to understand the subject of each month.
For example, in January, a Chinese man who had traveled to Wuhan was confirmed as the first infected person in the country, and in February, the movement around a cruise ship where passengers were confirmed to be infected was drawing attention.
From March to May, the period from the spread of infection and the issuance of the state of emergency to its cancellation, and the number of words that characterize each month has decreased during this period.
After June, when the first wave was over, the feature words will become more prominent again, but the number of articles may decrease significantly, and it is presumed that there will be differences depending on the resources in terms of content. ..
Then, in November, the third wave of rapid progress began, and at the same time, Japan is no exception to the background of the re-expansion of global infections.

⑸ Transition of new words

Applying TF-IDF, we will introduce the concept of ** "new word" ** to the trial.
Exclude words that appeared up to the previous month (existing words) and focus on words that first appeared in that month (new words). It can be said to represent ** new information and aspects ** that are encountered each month.
Hereafter, for October, first get the existing words up to September and use them as word_list.

import itertools

#Specify October
n = 10

word_list = []
for i in range(0,n,1):
    df = df_score[i:n-1]
    df = df.loc[:, (df != 0).any(axis=0)]
    word = list(df.columns)
    word_list.append(word)

#Flatten to one dimension
word_list = list(itertools.chain.from_iterable(word_list))

len(word_list)

The number of words that appeared by September is 75673.
Find the top 10 words based on the tfidf score, excluding the words that appeared in October and the words that have already appeared in September.
The df_score used here is calculated separately from the data from January to October.

#Extract only this month
df_current = df_score[n-1:n]
df_current = df_current.loc[:, (df_current != 0).any(axis=0)]

#Removal of existing words
for i in word_list:
    if i in df_current:
        df_current = df_current.drop(i, axis=1)

# TF-Extract the top 10 words of IDF
df_current = df_current.T
df_sorted = df_current.sort_values(str(n), ascending=False)
df_sorted.head(10)

The above processing was performed for 12 months, and the new words in each month were extracted and summarized in the table below.
In January, there are no words already mentioned up to the previous month, so there is no difference from the top 10 words of the previous TF-IDF.

As of January, there was already the name "new coronavirus", but the words ** "pneumonia" and "new pneumonia" ** stand out. As mentioned earlier, the name "new coronavirus" will take root from February to March as the infection spreads, but it can be seen that the abbreviation ** "corona" ** is becoming more common in April.
In addition, the first place "Tan" in October is "Tan" of Taiwanese IT Minister Audrey Tan. She called on the online community to complete the mask map app in a very short period of time, eliminating the situation where she couldn't buy a mask.
App development started on the afternoon of February 4th and early morning on the following 5th. Following the release of the government's data format, it was officially released at 10am on the 6th. After that, the inventory data update cycle was shortened, while various mask map apps were developed, and after April 30, the supply and demand of masks was balanced, and the app is said to have finished its role. Sometimes in Japan, it was around the time when foreign substances and mold were being removed from the so-called Abenomask.
If you are interested, please see "Au Audrey Tan Genius IT Minister 7 Faces" (Bungeishunju).

3. Natural language processing with Python 3-3. A year of corona looking back at TF-IDF

** ⑴ Document creation **

** 1. Data source **

** 2. Data acquisition and preprocessing **

** 3. Upload text file **

** 4. Reading a text file **

** ⑵ Overview of data **

** 1. Monthly number of extracts and vocabulary **

** 2. Top 10 words of monthly appearance frequency **

** ⑶ TF-IDF analysis **

** ⑷ Comparison of frequency of occurrence and TF-IDF analysis **

** Top 10 words of appearance frequency **

** Top 10 words of TF-IDF **

** ⑸ Transition of new words **

⑴ Document creation

1. Data source

2. Data acquisition and preprocessing

3. Upload text file

4. Reading a text file

⑵ Overview of data

1. Monthly number of extracts and vocabulary

2. Top 10 words of monthly appearance frequency

⑶ TF-IDF analysis

⑷ Comparison of frequency of occurrence and TF-IDF analysis

Top 10 words of appearance frequency

Top 10 words of TF-IDF

⑸ Transition of new words