The janome introduced at PyCon 2015 is convenient. I would like to sort the relationship between the retired pitcher Masahiro Yamamoto and each team in the text of wikipedia.
Extract Japanese nouns from wikipedia texts using the morphological analyzer janome, and extract feature vectors with TF-IDF. By inner producting the feature vectors of each article and taking cosθ, the similarity of entries in the range of 0 to 1 can be obtained. Sort the articles by sentence similarity and you're done.
Installation of mecab requires dependency on python version and additional installation of dictionary, so it is troublesome to build anyway. Janome, which can be installed with pip, is convenient. You can easily take on the challenge when you need morphological analysis.
pip install janome
from janome.tokenizer import Tokenizer
t = Tokenizer()
text = """
Two years after joining the Hiroshima era, no buds appeared and were overtaken by younger Tomonori Maeda and Akira Eto.
The batting was so weak that the coach at that time said, "Roll and use your legs."
The outfielder is also nicknamed "Mole Killer" because of the bad habit of throwing the ball toward the ground....
"""
for token in t.tokenize(text):
print(token)
------------------
Hiroshima noun,Proper noun,area,General,*,*,Hiroshima,Hiroshima,Hiroshima
Period noun,General,*,*,*,*,Era,Jidai,Jidai
Join noun,Change connection,*,*,*,*,Join,New Dan,New Dan
After noun,suffix,Adverbs possible,*,*,*,rear,Go,Go
Particles,Attributive,*,*,*,*,of,No,No
2 nouns,number,*,*,*,*,*,*,*
Annual noun,suffix,Classifier,*,*,*,Year,Nenkan,Nenkan
Is a particle,Particle,*,*,*,*,Is,C,Wow
Bud noun,General,*,*,*,*,Bud,Me,Me
Is a particle,Case particles,General,*,*,*,But,Moth,Moth
Verb,Independence,*,*,One step,Imperfective form,Get out,De,De
Auxiliary verb,*,*,*,Special,Continuous connection,Nu,Zu,Zu
, Symbol,Comma,*,*,*,*,、,、,、
Younger noun,General,*,*,*,*,younger,Toshishita,Toshishita
Particles,Attributive,*,*,*,*,of,No,No
Maeda noun,Proper noun,Personal name,Surname,*,*,Maeda,Maeda,Maeda
Wisdom noun,Proper noun,Personal name,Name,*,*,Wisdom,Tomonori,Tomonori
And particles,Parallel particles,*,*,*,*,Or,Ya,Ya
Eto noun,Proper noun,Personal name,Surname,*,*,Eto,Eto,Eto'o
Wisdom noun,Proper noun,Personal name,Name,*,*,Satoshi,Satoshi,Satoshi
Noun,suffix,General,*,*,*,Et al.,La,La
Particles,Case particles,General,*,*,*,To,D,D
Overtake or verb,Independence,*,*,Five-dan / Ka line,Imperfective form,Overtake,Oinuka,Oinuka
...
The idea is that if you count the number of nouns in a sentence, you can get the feature vector of the sentence. Unlike English, Japanese does not have white space as a delimiter, so morphological analysis is required. In other words, the morphological analyzer janome comes into play.
TF-IDF = number of specific nouns in the sentence / number of all nouns in the sentence
Example sentence
Meat festival NIIGATA for a night full of meat ❤︎ Steak House Azuma-san's Yukimuro Aged Niigata Prefecture Beef Steak Delicious*\(^o^)/*Perfect for salt or wasabi!
result
from simple_tfidf_japanese.tfidf import TFIDF
text = "Meat festival NIIGATA for a night full of meat ❤︎ Steak House Azuma-san's Yukimuro Aged Niigata Prefecture Beef Steak Delicious*\(^o^)/*Perfect for salt or wasabi!"
result = TFIDF.gen(text, enable_one_char=1)
for key, value in result:
print key, value
Meat 0.0952380952381
Steak 0.0952380952381
0 0.047619047619
Rice 0.047619047619
Snow 0.047619047619
Niigata 0.047619047619
Aging 0.047619047619
Salt 0.047619047619
Festival 0.047619047619
Azuma 0.047619047619
Wasabi 0.047619047619
Cow 0.047619047619
House 0.047619047619
Night 0.047619047619
Samadhi 0.047619047619
Meat is good ~! This sentence seems to have the best features of meat, steak, and rice.
I registered the created tool in PyPi.
pip install simple_tfidf_japanese
Let's compare the relationship between Masa and each team based on the wikipedia page. I'll also mix in soccer articles that have nothing to do with the exam.
from simple_tfidf_japanese.tfidf import TFIDF
#Masahiro Yamamoto
_base_url = "https://ja.wikipedia.org/wiki/%E5%B1%B1%E6%9C%AC%E6%98%8C"
#Comparison
data = [
['Yakult', 'https://ja.wikipedia.org/wiki/%E6%9D%B1%E4%BA%AC%E3%83%A4%E3%82%AF%E3%83%AB%E3%83%88%E3%82%B9%E3%83%AF%E3%83%AD%E3%83%BC%E3%82%BA'],
['Giant', 'https://ja.wikipedia.org/wiki/%E8%AA%AD%E5%A3%B2%E3%82%B8%E3%83%A3%E3%82%A4%E3%82%A2%E3%83%B3%E3%83%84'],
['Hanshin', 'https://ja.wikipedia.org/wiki/%E9%98%AA%E7%A5%9E%E3%82%BF%E3%82%A4%E3%82%AC%E3%83%BC%E3%82%B9'],
['Hiroshima', 'https://ja.wikipedia.org/wiki/%E5%BA%83%E5%B3%B6%E6%9D%B1%E6%B4%8B%E3%82%AB%E3%83%BC%E3%83%97'],
['Chunichi', 'https://ja.wikipedia.org/wiki/%E4%B8%AD%E6%97%A5%E3%83%89%E3%83%A9%E3%82%B4%E3%83%B3%E3%82%BA'],
['Yokohama', 'https://ja.wikipedia.org/wiki/%E6%A8%AA%E6%B5%9CDeNA%E3%83%99%E3%82%A4%E3%82%B9%E3%82%BF%E3%83%BC%E3%82%BA'],
['Softbank', 'https://ja.wikipedia.org/wiki/%E7%A6%8F%E5%B2%A1%E3%82%BD%E3%83%95%E3%83%88%E3%83%90%E3%83%B3%E3%82%AF%E3%83%9B%E3%83%BC%E3%82%AF%E3%82%B9'],
['Nippon-Ham', 'https://ja.wikipedia.org/wiki/%E5%8C%97%E6%B5%B7%E9%81%93%E6%97%A5%E6%9C%AC%E3%83%8F%E3%83%A0%E3%83%95%E3%82%A1%E3%82%A4%E3%82%BF%E3%83%BC%E3%82%BA'],
['Lotte', 'https://ja.wikipedia.org/wiki/%E5%8D%83%E8%91%89%E3%83%AD%E3%83%83%E3%83%86%E3%83%9E%E3%83%AA%E3%83%BC%E3%83%B3%E3%82%BA'],
['Seibu', 'https://ja.wikipedia.org/wiki/%E5%9F%BC%E7%8E%89%E8%A5%BF%E6%AD%A6%E3%83%A9%E3%82%A4%E3%82%AA%E3%83%B3%E3%82%BA'],
['Orix', 'https://ja.wikipedia.org/wiki/%E3%82%AA%E3%83%AA%E3%83%83%E3%82%AF%E3%82%B9%E3%83%BB%E3%83%90%E3%83%95%E3%82%A1%E3%83%AD%E3%83%BC%E3%82%BA'],
['Rakuten', 'https://ja.wikipedia.org/wiki/%E6%9D%B1%E5%8C%97%E6%A5%BD%E5%A4%A9%E3%82%B4%E3%83%BC%E3%83%AB%E3%83%87%E3%83%B3%E3%82%A4%E3%83%BC%E3%82%B0%E3%83%AB%E3%82%B9'],
['Japan national football team', 'https://ja.wikipedia.org/wiki/%E3%82%B5%E3%83%83%E3%82%AB%E3%83%BC%E6%97%A5%E6%9C%AC%E4%BB%A3%E8%A1%A8'],
]
#Calculation
result = TFIDF.some_similarity(_base_url, data)
#Result display
result.sord(key=lambda x: x[2], reverse=True)
for title, url, value in result:
print title, value
"""
Giants 0.437053886215
Yakult 0.399745780763
Hanshin 0.383247816027
Hiroshima 0.356147904333
Lotte 0.351312791912
Chunichi 0.344772305253
Yokohama 0.334360056622
Nippon-Ham 0.326226324436
Orix 0.317250711462
Softbank 0.285703674673
Seibu 0.283181229507
Rakuten 0.275111280558
Japan national football team 0.177026402257
"""
From a bird's-eye view, Se has the highest degree of similarity, Pa has the lowest degree of similarity, and soccer, which has nothing to do with it, has the lowest degree of similarity. For more than 30 years, he has been a pitcher for the Chunichi Dragons, but surprisingly, he is not the Chunichi but the giant. WikiPedia also has a lot of stories about playing against giants, and it seems that the degree of similarity with giants has increased.
Also, Hiroshima is ranked higher than Chunichi because there are many descriptions about Mr. Koji Yamamoto, director of Mr. Akahel. It can be inferred that Yamamoto was ranked high in the connection. The reason why Yakult, Hanshin, and Lotte are ranked higher than Chunichi seems to have changed depending on the number of appearances of the words "Record, Victory, Baseball, Professional, Player".
simple_tfidf_japanese is a Japanese-only TFIDF calculation module that eliminates all alphabets as noise.
#Output tfidf from text(Get TF-IDF from text)
from simple_tfidf_japanese.tfidf import TFIDF
text = "Meat festival NIIGATA for a night full of meat ❤︎ Steak House Azuma-san's Yukimuro Aged Niigata Prefecture Beef Steak Delicious*\(^o^)/*Perfect for salt or wasabi!"
tfidf1 = TFIDF.gen(text, enable_one_char=1)
for key, value in tfidf1:
print key, value
>>>Meat 0.0952380952381
>>>Steak 0.0952380952381
>>>0 0.047619047619
>>>Rice 0.047619047619
>>>Snow 0.047619047619
>>>Niigata 0.047619047619
>>>Aging 0.047619047619
...
#Output tfidf from the web(Get TF-IDF from Web)
url = "https://ja.wikipedia.org/wiki/%E6%B7%A1%E8%B7%AF%E3%83%93%E3%83%BC%E3%83%95"
tfidf2 = TFIDF.gen_web(url)
for key, value in tfidf2:
print key, value
>>>Awaji 0.0453257790368
>>>Beef 0.0396600566572
>>>Tajima 0.0198300283286
>>>Awaji Island 0.0169971671388
>>>Page 0.0169971671388
>>>Display 0.014164305949
# TF-Calculate similarity with IDF Cosine Similarity(calc TF-IDF Cosine Similarity)
tfidf1 = [['Apple', 1], ['Orange', 2], ['Banana', 1], ['Kiwi', 0]]
tfidf2 = [['Apple', 1], ['Orange', 0], ['Banana', 2], ['Kiwi', 1]]
print TFIDF.similarity(tfidf1, tfidf2)
>>> 0.5
Read the slides that the creators presented at PyCon 2015! Slides for morphological analysis made and learned in Python