from janome.tokenizer import Tokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.cluster.hierarchy import linkage, fcluster

Data reading

It is assumed that one document is described in one line in input.txt.

with open('input.txt') as f:
    org_sentences = f.readlines()

Morphological analysis

Separate each document with a single-byte space for each word.

t = Tokenizer()
sentences = []
for s in org_sentences:
    tmp = ' '.join(t.tokenize(s, wakati=True))
    sentences.append(tmp)

Vectorization

This time, Tf-Idf is used to vectorize the document. Other means such as BoW / LSI / LDA / Word2Vec average / Doc2Vec / FastText average / BERT.

vectorizer = TfidfVectorizer(use_idf=True, token_pattern=u'(?u)\\b\\w+\\b')
vecs = vectorizer.fit_transform(sentences)
v = vecs.toarray()

Distance calculation

The cosine distance, which is common in natural language processing tasks, defines the distance between each vector. Based on that distance, documents are bundled into clusters by hierarchical clustering (single link method).

z = linkage(v, metric='cosine')

Clustering

An example in which the final cluster is determined with a distance of 0.2 as the threshold value. If the number of documents becomes huge, it takes a considerable amount of time to calculate the distance, so if you want to try multiple thresholds, you should verify by saving the above distance calculation result in pickle once. It is also possible to use the number of clusters as a threshold. The cluster number of each document is stored in group.

group = fcluster(z, 0.2, criterion='distance')
print(group)

Clustering text in Python

Data reading

Morphological analysis

Vectorization

Distance calculation

Clustering