0. in short

1. ** TF-IDF ** is a good way to easily quantify the similarity of sentences.
The similarity is calculated by multiplying the frequency (** TF ) of the words appearing in the sentence by the rarity ( IDF **).
1. If convenience is important, ** janome ** is recommended from ** Mecab ** for morphological analysis, and ** TF-IDF ** is recommended from ** doc2vec ** for similarity judgment. ⇒ ** I think it will be useful for quantifying the degree of similarity with the prior literature in patent search and for objectively evaluating the answers to the descriptive questions of the university entrance common test. ** However, it cannot be judged unless a person reads it whether or not the sentence makes sense. Well, of course (laughs)

1. 1. Problem awareness

When I read a sentence, I sometimes think that it is a similar sentence. ** TF-IDF ** is a convenient way to objectively express the vague feeling that they are somewhat similar. There is a site that explains TF-IDF in an easy-to-understand manner, so please google it. If you raise the recommended site, it will be as follows. [For beginners] I briefly summarized TFIDF

The point is (1) if the word that appears frequently in a certain sentence (** TF: Term Frequency frequency ), (2) if it is a rare word that does not appear often in a normal sentence ( IDF: Inverse) Document Frequency Rarity **), starting from the idea that the sentence is a topic related to the word. The basic idea of TF-IDF is to compare the sum of the words TF and IDF multiplied and judge the similarity as a sentence.

2. Preparation

Well, it doesn't start even if I say mess, so let's calculate using ** scikit-learn (sklearn) **, which is often used to do AI-like things in Python. First, prepare.

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In the calculation of TF-IDF, ** CountVectorizer ** and ** TfidfTransformer ** can be used in combination instead of ** TfidfVectorizer ** described above (I will try it later). When combining, the frequency of words is vectorized with CountVectorizer, and then TF-IDF is calculated with TfidfTransformer. However, if you want to calculate TF-IDF, it is easier to calculate with TfidfVectorizer at once, so I used TfidfVectorizer this time. The results were slightly different between the case using TfidfVectorizer and the case using CountVectorizer and TfidfTransformer (details will be described later). I think it's because of the parameters, but unfortunately the reason is unknown. If you know, please comment. In any case, the similarity is calculated by calculating the ** cosine cos ** of the vector.

Then, when dealing with Japanese, morphological analysis is required. In English, the words are separated, so it can be analyzed as it is, but in Japanese, the words are seamlessly attached in the sentence, so it is necessary to separate each word. The famous morphological analysis is ** Mecab ** in python, but ** janome ** is easy to handle, so I will use ** janome ** this time.

from janome.tokenizer import Tokenizer

3. 3. Analysis target

It's a sentence to compare, but let's try using easy-to-understand news. Let's try focusing on your favorite soccer with your own judgment and prejudice. The first sentence is the news that Frontale, who runs in the lead, beat Cerezo in second place.

`text1.txt`


Kawasaki F, 2nd place C Defeat Osaka with 3 shots and run alone! 8 consecutive wins that widen the difference in points to "14"
Round 20 of the Meiji Yasuda Life J1 League was held on the 3rd, and Cerezo Osaka welcomed Kawasaki Frontale to the home of Yanmar Stadium Nagai.
Kawasaki F (53 points) and Cerezo Osaka (42 points) are chasing after 20 games so far. Home Cerezo Osaka won for the first time in three games in the previous section, while Kawasaki F faced this summit confrontation with overwhelming momentum of winning seven consecutive games.

Next, let's use another news with the same content. Let's hypothesize and test the similarities between the first two news stories.

`text2.txt`


The difference in points is 14 ... 8 consecutive wins over Kawasaki F and C Osaka!!
The J1 league held the 20th round on the 3rd, and Cerezo Osaka ranked second in the Yanmar Stadium Nagai(42 points)Kawasaki Frontale are in the lead with 11 points(53 points)Is a match. Kawasaki F took the lead with an own goal in the 37th minute of the first half, but Cerezo Osaka caught up with the goal of FW Hiroaki Okuno in the 17th minute of the second half. However, in the 38th minute FW Leandro Damian and in the 39th minute MF Kaoru Mitoma shook the net in quick succession, and Kawasaki F was 3-I won one.

The third is the same soccer news, but with different content.

`text3.txt`


Transferred to Gamba Osaka, Yasuhito Endo and Iwata! "Human relationship with director Miyamoto" behind "decision"
Gamba Osaka reported by some sports newspapers, and Japan national team legend Yasuhito Endo's transfer to J2 Iwata for a limited time. As soon as this news came out, not only Gamba Osaka supporters but also many soccer fans were surprised on the internet. Speaking of Endo, he has been active as the mainstay of Gamba Osaka since 2001 when he transferred from Kyoto. As No. 7 of Gamba Osaka, as a command tower, he has contributed as a core player to all the titles won by the team.

Also, the genre is the same in sports, but let's compare it with baseball news.

`text4.txt`


Now off FA Masahiro Tanaka is "a pitcher worth the value"
Masahiro Tanaka, a pitcher who will become a free agent (FA) after the end of this season, has been voiced early on by the team and local media asking him to remain. The right arm, who had been showing a masterpiece in the playoffs until last season, started in the second round of the wild card series with the Indians on September 30 (1st Japan time). In the bad conditions of rainfall, he struggled with 6 goals in the 4th inning, but the team as a whole decided to advance to the Division Series.

Finally, news is news, but a completely different genre.

`text5.txt`


[New Corona] US Presidential Hospitalization, White House "Cluster" and WHO
US President Trump arrived at the Walter Reed Medical Center near Washington on the 2nd from the White House on a presidential helicopter to receive treatment for the new coronavirus infection (COVID19). It suggests widespread concern about the severity of the condition.

The above sentences will be taken in and analyzed. The degree of similarity with sentence 1 is expected to be sentence 2> sentence 3> sentence 4> sentence 5. Well, will that be true?

4. Try

First, read the text. Don't forget to perform morphological analysis.

filenames=['text1.txt','text2.txt','text3.txt','text4.txt','text5.txt']
wakati_list = []
for filename in filenames: #Read a text file and assign it to text
    with open(filename,mode='r',encoding = 'utf-8-sig') as f:
        text = f.read()    
    wakati = ''
    t = Tokenizer() 
    for token in t.tokenize(text):  #Morphological analysis
        hinshi = (token.part_of_speech).split(',')[0]  #Part of speech information
        hinshi_2 = (token.part_of_speech).split(',')[1]
        if hinshi in ['noun']:  # 品詞がnounの場合のみ以下実行
            if not hinshi_2 in ['Blank','*']:  
            #Is the second item of part of speech information blank?*In the case of, do not execute below
                word = str(token).split()[0]  #Get the word
                if not ',*,' in word:  #To the word*If is not included, execute the following
                    wakati = wakati + word +' ' 
                    #Add words and spaces to object wakati
    wakati_list.append(wakati) #Add word-separation results to the list
wakati_list_np = np.array(wakati_list) #Convert list to ndarray

Finally, the calculation of similarity. Let's use TfidfVectorizer.

vectorizer = TfidfVectorizer(token_pattern=u'\\b\\w+\\b')
transformer = TfidfTransformer()#Generation of transformer. TF-Use IDF
tf = vectorizer.fit_transform(wakati_list_np) #Vectorization
tfidf = transformer.fit_transform(tf) # TF-IDF
tfidf_array = tfidf.toarray()
cs = cosine_similarity(tfidf_array,tfidf_array)  #cos similarity calculation
print(cs)

The results are as follows. The relative magnitude of the similarity is, of course, as expected.

[[1.         0.48812198 0.04399067 0.02065671 0.00164636]
 [0.48812198 1.         0.02875532 0.01380959 0.00149348]
 [0.04399067 0.02875532 1.         0.02595705 0.        ]
 [0.02065671 0.01380959 0.02595705 1.         0.00350631]
 [0.00164636 0.00149348 0.         0.00350631 1.        ]]

By the way, the case of combining Count Vectorizer and Tfidf Transformer written at the beginning is as follows. You have to import it before you can use it.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

#Generate vectorizer. token_pattern=u'\\b\\w+\\b'Settings that include one-letter words in
vectorizer = CountVectorizer(token_pattern=u'\\b\\w+\\b')
#Generation of transformer. TF-Use IDF
transformer = TfidfTransformer()
tf = vectorizer.fit_transform(wakati_list_np) #Vectorization
tfidf = transformer.fit_transform(tf) # TF-IDF
tfidf_array = tfidf.toarray()
cs = cosine_similarity(tfidf_array,tfidf_array)  #cos similarity calculation
print(cs)

The results are as follows. This one has a slightly higher similarity value.

[[1.         0.59097619 0.07991729 0.03932476 0.00441963]
 [0.59097619 1.         0.05323053 0.03037231 0.00418569]
 [0.07991729 0.05323053 1.         0.03980858 0.        ]
 [0.03932476 0.03037231 0.03980858 1.         0.01072682]
 [0.00441963 0.00418569 0.         0.01072682 1.        ]]

5. Finally

If you want to calculate similarity in Python, ** doc2vec ** is also good. However, it is difficult to load the trained model here. In that sense, I think ** TF-IDF ** should be able to easily calculate the similarity of sentences.

For the code, I referred to the following site. We would like to take this opportunity to thank you.

Try various things with Python

[For beginners] Quantify the similarity of sentences with TF-IDF

0. in short

1. 1. Problem awareness

2. Preparation

3. 3. Analysis target

text1.txt

text2.txt

text3.txt

text4.txt

text5.txt