What to introduce in this article

--Introduction of packages that promotes clustering while using several clustering methods together [https://pypi.org/project/flexible-clustering-tree/) -↑ Such a clustering use case -↑ An example of doing this kind of clustering

Why do you need clustering?

When I am involved in data-related research and work, I often come across new data.

Since the contents of new data are not well understood in the first place, we must first understand the data before considering research policies and analysis policies.

Clustering is necessary in such cases. [^ 1]

For example, what kind of clustering use cases do you have?

I have been working in natural language processing in the research business. NLP in the research industry can be called text mining.

The basis of the research industry is data aggregation. The basic idea is to create a survey report based on the aggregated data and industry knowledge.

And, of course, the text data must also be aggregated. In many cases, the aggregation unit uses a "label appropriate for the purpose of the survey".

There is no problem if the "label suitable for the purpose of investigation" is known from the beginning.

However, in the case of new text data, "What kind of label should I use in the first place? I don't know", you must first consider the label.

In such a case, it will be easier to think of labels if you can get a quick overview of the data by clustering.

Isn't there a method other than clustering?

Of course, approaches other than clustering are also possible.

For example, the following approach is possible.

Sample and label the data. In some cases, it is OK if only the number of confidence intervals can be sampled.
Use various text mining tools. Khcoder is a long-established good software. I also use it from time to time.

Each has its own advantages and disadvantages, so I think the method that suits your use case is the best.

However, when I get tired of thinking like that, I have a rough idea of "observing with clustering for the time being".

Clustering? Isn't it okay with Kmeans?

I think the combination of vectorized text + Kmeans is a royal road that has been used for a long time.

However, this is not the best combination.

――A huge cluster like "Other" is formed. But when I look at the contents of the cluster, I feel like I can still divide it __ ――After clustering with a large number of clusters from the beginning, it became difficult to interpret. __ I wish I could divide the first time roughly and the second time a little more finely __ ――There are some features that say "I want you to pay attention to this and cluster", and I wish you could see different features for the first clustering and the second clustering __ --Interpretation after clustering is troublesome. .. .. __ I wish I could visualize it nicely __

I emphasized the part "I wish this happened" in black.

It is rather troublesome to write such a clustering program by yourself. To be honest, it's troublesome.

So, I made a package that does such clustering.

Introduction of flexible_clustering_tree

This package does the following, for example:

――The first time is roughly divided (number of clusters = 3), and the second time is clustered a little more finely (number of clusters = 8). ――The first time is divided by Kmeans roughly, and the second time is clustered by DBSCAN while considering the distribution. --There is a text data set, and I want you to divide it by the feature amount of the title for the first time. In the second clustering, please divide by the body of the text. --With a method to visualize the tree structure with D3.js after clustering

For example, the image below is an image when clustering is performed under the following conditions.

--The dataset is 20-news dataset --In the first clustering, only the title of the news text was used as the feature (average of word embedding). In the second clustering, the Bag-of-word features of the news text were used. --HDBSCAN was used for the first clustering. Kmeans was used for the second clustering. --After clustering, output to html and visualize.

Isn't it okay with bottom-up hierarchical clustering?

Bottom-up hierarchical clustering is not bad either. Rather, there are times when hierarchical clustering is better for use cases.

However, bottom-up hierarchical clustering has the characteristic that when the number of data becomes huge, it becomes difficult to calculate and execute.

The idea of this package is still the reverse version of bottom-up hierarchical clustering.

I tried it with Livedoor News Corpus

There is a data set called Livedoor News Corpus published by Ronwitt Co., Ltd.

The Livedoor News Corpus is divided by news category, but let's assume that there was no category label here.

It's not fun just to "try it", so I thought it would be interesting to have an analysis story. Therefore, I considered the following example as an example close to the business use case.

――You are a person who does something like data analysis of a company that operates a Web service. ――One day, such a theme came down from somewhere. --The articles of the news service that we operate have accumulated. I would like to tag news articles to make them easier for users to search. ――But if you mess up, tagging may only confuse the user. --Observe the content of the news article and think of a good tag. Regards ☆ ――You actually have other tasks, and you can't afford to spend time on the subject that came down.

~~ Ah. .. .. This use case seems to have been my example ... ~~

What kind of procedure would you like to observe?

I will follow the steps below.

Data preprocessing. It's called word division.
Text feature quantification.
Execute clustering.
Visualization of contents with D3.js tree structure
Prepare data to explain the cluster

Data preprocessing and word splitting

Livedoor News Corpus has two types of text, "Title" and "Body".

This time, let's treat these two as separate features.

Split words with Mecab. This script preprocesses.

Text feature quantification

Livedoor News Corpus has two types of text, "Title" and "Body".

First, the title text. The title is short text. With this shortness, averaging word embedding should be fine.

Next is the text. The text is rather long. It seems a bit rough to average this length by word embedding.

Doc2vec takes a lot of time and effort to make a model. It is also troublesome to prepare computer resources that can execute Bert quickly.

There are other options for vector embedding documents, but __ above all, I don't have much time __

Therefore, we use the classical word frequency matrix & matrix compression approach.

The whole flow is this script.

Performing clustering

Here, I will explain while showing the code. The whole flow is this script.

First, set the feature matrix.

The first title matrix is title_vectors. The second body matrix is low_dim_matrix.

Both are matrices of (number of documents * number of feature dimensions).

When setting a matrix to flexible_clustering_tree.FeatureMatrixObject, specify the number of times you want to set it to level.

import flexible_clustering_tree

#Various processing is omitted here
feature_1st_layer = flexible_clustering_tree.FeatureMatrixObject(level=0, matrix_object=numpy.array(title_vectors))
feature_2nd_layer = flexible_clustering_tree.FeatureMatrixObject(level=1, matrix_object=low_dim_matrix)

Next, combine these two FeatureMatrixObjects into one.

The dict_index2 attributes can store information that can be used as an auxiliary explanation for the data (option). It is not used as a feature at all. This information will be displayed in the tree that you visualized later, which makes it easier to interpret. Here, the title, body, and category label are stored.

The text_aggregation_field can contain a two-dimensional list[[word]](option). Words are aggregated with this information, and the aggregated information is displayed in the visualized tree, which makes interpretation easier. The two-dimensional list is number of documents * number of words (variable and OK).

multi_matrix_obj = flexible_clustering_tree.MultiFeatureMatrixObject(
    matrix_objects=[feature_1st_layer, feature_2nd_layer],
    dict_index2label={i: label for i, label in enumerate(livedoor_labels)},
    dict_index2attributes={i: {
        'file_name': livedoor_file_names[i],
        'document_text': ''.join(document_text[i]),
        'title_text': ''.join(title_text[i]),
        'label': livedoor_labels[i]
    } for i, label in enumerate(livedoor_labels)},
    text_aggregation_field=document_morphs_text_aggregation
)

Next, specify the clustering method. Let's have HDBSCAN divide the first time while considering the overall distribution [^ 2] Since HDBSCAN does not specify the number of clusters, specify n_cluster = -1.

from hdbscan import HDBSCAN
clustering_operator_1st = flexible_clustering_tree.ClusteringOperator(level=0, n_cluster=-1, instance_clustering=HDBSCAN(min_cluster_size=3))

The second time, I will divide it into 8 clusters.

from sklearn.cluster import KMeans
clustering_operator_2nd = flexible_clustering_tree.ClusteringOperator(level=1, n_cluster=8, instance_clustering=KMeans(n_clusters=8))

Next, combine these two Clustering Operators into one.

multi_clustering_operator = flexible_clustering_tree.MultiClusteringOperator([clustering_operator_1st, clustering_operator_2nd])

Then, perform clustering.

If you specify max_depth = 3, it will be divided as much as the depth of 3. If it cannot be divided, it will stop automatically.

The clustering condition specified last is used for the second and subsequent clustering.

In other words, text features are clustered in Kmeans.

# run flexible clustering
clustering_runner = flexible_clustering_tree.FlexibleClustering(max_depth=3)
index2cluster_no = clustering_runner.fit_transform(multi_matrix_obj, multi_clustering_operator)

Visualize the tree and save it in html.

html = clustering_runner.clustering_tree.to_html()
with open(PATH_OUTPUT_HTML, 'w') as f:
    f.write(html)

I want to aggregate the data later, so let's output the table to tsv.

#You can get table information for aggregation purpose
import pandas
table_information = clustering_runner.clustering_tree.to_objects()
pandas.DataFrame(table_information['cluster_information']).to_csv('cluster_relation.tsv', sep='\t')
pandas.DataFrame(table_information['leaf_information']).to_csv('leaf_information.tsv', sep='\t')

Visualization of contents with D3.js tree structure

This cluster is the result of being divided only by the title. Word aggregation information is displayed next to the node in the red frame. From the contents of apps, Android, and Google, it seems to be a topic about Android smartphones.

Next, as you can see from this cluster ... word aggregation, it was still a single woman communication. There could be tags like # I want to marry a rich man.

This tree is sorted by cluster size from top to bottom. Let's take a look at the best cluster. The cluster size is written in the data-id field. It seems that 6,515 documents are in this cluster. This looks like an "other" cluster. This is common in density clustering such as HDBSCAN.

Now let's split the "other" cluster. It is divided by Kmeans. The image is a cluster with a lot of sports news. From the words, it can be inferred that the content is something like "Japan National Sports News".

Well, first of all, I was able to confirm the rough contents of the clustering result like this.

Preparing data to describe the cluster

You can look at the tree with your own eyes, but it can be a bit tedious. Since the purpose is to devise tags, we need good information to think about tags.

Furthermore, from the context of this theme, there is a high possibility that "data or materials that the service management team can understand the reason for the tag" will be requested.

You're someone who does something data analysis, so you can't just ask the service management team why you're tagging and say "I don't know." I would like to go "based on this data" (glasses quick).

Therefore, consider the next strategy.

Obtain feature words with TF-IDF for each cluster. Export to an Excel file.
Write the appropriate tags inferred from the feature words in the Excel file (manual)

First, I wrote the result of TFIDF weighting to csv. label is the cluster number. The 374 cluster is likely to be clothing related content.

Therefore, write "clothing" in the tag column.

You just have to repeat this work.

When I actually tried it, I was able to guess the tag in 20 to 30 seconds per cluster.

Since there are 348 clusters this time, this work will be completed in 348 clusters * 30 seconds = 10440 seconds = 178 minutes.

~~ Youtube watching time ~~ Even considering the break time, the tag guessing work will be completed in 4-6 hours.

Then, you can easily set up a work schedule, and it will be easier to negotiate "Hey boss ~. It takes about 8 hours for tagging work, so it's good (working time + buffer time + skipping time)". I will.

After the tag guessing work, if you make a nice slide material and give it to the service management team, you will be grateful ~

This is my brain poem.

By the way, I introduced the flexible_clustering_tree package.

The code used for the brain poem can be found in this repository.

Where this package hasn't gone yet

There is still plenty of room for improvement in this package. PR is welcome.

For example

--When a large number of clusters occur, processing slows down. Since the breadth-first search process is written in a while statement, it tends to take a lot of time in proportion to the number of clusters. --D3.js-based tree visualization is not good enough. I'm vulnerable to the front end, so I can't make a nice tree.

[^ 1]: It is also called exploratory data analysis. [^ 2]: An evolved version of DBSCAN. The algorithm is also less computationally expensive than DBSCAN, and the implementation is devised and fast.

Let's do clustering that gives a nice bird's-eye view of the text dataset