--LDA = latent dirichelet allocation
In LDA, each word in a sentence belongs to a hidden topic (topic, category), and it is assumed that the sentence is generated from that topic according to some probability distribution, and the topic to which it belongs is inferred.
--Papers http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
--alpha;: Parameters to get the topic --beta;: Parameters to get the words in the topic --theta;: Multinomial distribution parameter --w: word --z: topic
This time, we will use this LDA to see if sentences can be categorized by topic.
20 Validated using Newsgroups
--Approximately 20000 documents, 20 categories of datasets --The following 20 categories
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
talk.politics.misc
talk.politics.guns
talk.politics.mideast
talk.religion.misc
alt.atheism
misc.forsale
soc.religion.christian
--This time, we use the following 4 types
--'rec.sport.baseball': Baseball --'rec.sport.hockey': Hockey --'comp.sys.mac.hardware': mac computer --'comp.windows.x': windows computer
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import mglearn
import numpy as np
#data
categories = ['rec.sport.baseball', 'rec.sport.hockey', \
'comp.sys.mac.hardware', 'comp.windows.x']
twenty_train = fetch_20newsgroups(subset='train',categories=categories, \
shuffle=True, random_state=42)
twenty_test = fetch_20newsgroups(subset='test',categories=categories, \
shuffle=True, random_state=42)
tfidf_vec = TfidfVectorizer(lowercase=True, stop_words='english', \
max_df = 0.1, min_df = 5).fit(twenty_train.data)
X_train = tfidf_vec.transform(twenty_train.data)
X_test = tfidf_vec.transform(twenty_test.data)
feature_names = tfidf_vec.get_feature_names()
#print(feature_names[1000:1050])
#print()
# train
topic_num=4
lda =LatentDirichletAllocation(n_components=topic_num, max_iter=50, \
learning_method='batch', random_state=0, n_jobs=-1)
lda.fit(X_train)
Check the status of confirmation below
sorting = np.argsort(lda.components_, axis=1)[:, ::-1]
mglearn.tools.print_topics(topics=range(topic_num),
feature_names=np.array(feature_names),
topics_per_chunk=topic_num,
sorting=sorting,n_words=10)
topic 0 topic 1 topic 2 topic 3
-------- -------- -------- --------
nhl window mac wpi
toronto mit apple nada
teams motif drive kth
league uk monitor hcf
player server quadra jhunix
roger windows se jhu
pittsburgh program scsi unm
cmu widget card admiral
runs ac simms liu
fan file centris carina
--topic1: windows computer --topic2: mac computer --topic0: Baseball or hockey cannot be classified as expected --topic3: Computer related? I couldn't classify as expected
It is considered that topic1 and topic2 could be classified neatly at the learning stage.
For the data for inference, I borrowed an English article from apple on wikipedia. Set some wikipedia articles to text11 and text12.
text11="an American multinational technology company headquartered in Cupertino, "+ \
"California, that designs, develops, and sells consumer electronics,"+ \
"computer software, and online services."
text12="The company's hardware products include the iPhone smartphone,"+ \
"the iPad tablet computer, the Mac personal computer,"+ \
"the iPod portable media player, the Apple Watch smartwatch,"+ \
"the Apple TV digital media player, and the HomePod smart speaker."
Perform inference below
# predict
test1=[text11,text12]
X_test1 = tfidf_vec.transform(test1)
lda_test1 = lda.transform(X_test1)
for i,lda in enumerate(lda_test1):
print("### ",i)
topicid=[i for i, x in enumerate(lda) if x == max(lda)]
print(text11)
print(lda," >>> topic",topicid)
print("")
### 0
an American multinational technology company headquartered in Cupertino, California, that designs, develops, and sells consumer electronics,computer software, and online services.
[0.06391161 0.06149079 0.81545564 0.05914196] >>> topic [2]
### 1
an American multinational technology company headquartered in Cupertino, California, that designs, develops, and sells consumer electronics,computer software, and online services.
[0.34345051 0.05899806 0.54454404 0.05300738] >>> topic [2]
It can be said that all the sentences related to MAC (apple) were correctly classified because it was inferred that they were likely to belong to topic2 (mac computer).