from janome.tokenizer import Tokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.cluster.hierarchy import linkage, fcluster
It is assumed that one document is described in one line in input.txt.
with open('input.txt') as f:
org_sentences = f.readlines()
Separate each document with a single-byte space for each word.
t = Tokenizer()
sentences = []
for s in org_sentences:
tmp = ' '.join(t.tokenize(s, wakati=True))
sentences.append(tmp)
This time, Tf-Idf is used to vectorize the document. Other means such as BoW / LSI / LDA / Word2Vec average / Doc2Vec / FastText average / BERT.
vectorizer = TfidfVectorizer(use_idf=True, token_pattern=u'(?u)\\b\\w+\\b')
vecs = vectorizer.fit_transform(sentences)
v = vecs.toarray()
The cosine distance, which is common in natural language processing tasks, defines the distance between each vector. Based on that distance, documents are bundled into clusters by hierarchical clustering (single link method).
z = linkage(v, metric='cosine')
An example in which the final cluster is determined with a distance of 0.2 as the threshold value. If the number of documents becomes huge, it takes a considerable amount of time to calculate the distance, so if you want to try multiple thresholds, you should verify by saving the above distance calculation result in pickle once. It is also possible to use the number of clusters as a threshold. The cluster number of each document is stored in group.
group = fcluster(z, 0.2, criterion='distance')
print(group)
Recommended Posts