Saito problem

Is there "Saito (Mr. Saito)" near you?
Saito-san, "** Sai ** Wisteria" "** Sai ** Wisteria" "** Sai ** Wisteria" "** Sai ** Wisteria". Which character is it?
Yes! I forget which kanji it was. Commonly known as the "Saito problem".

What to do

Better yet, I thought if I could decide on one character for the representative Saito and unify it to that character, .

** We will send you the process of considering the form of kanji ** to ** representative Saito **.

We will use dimensional compression (UMAP) and clustering (mainly KMeans).

GW @ Free study of self-restraint from going out. It is a story. It is a story. I will say it again. It is a story.

"Saito" recognized as a kanji

4 types (2 new fonts, 2 old fonts). According to Toyo Keizai Online

① Saito is the source,

② Saito is the old font of the headwaters (①).

③ Saito made a mistake in writing the new font (①). (Amazing fact 1)

④ Saito made a mistake in writing the old font (②). (Amazing fact 2

Below, the population in Japan in parentheses is ** ①, which is the largest, and Saito is the source **.

New font Old font

Headwaters ①U+658E (542,000 people)

<fontcolor=red>Headwaters ②U+9F4B（86,800people)

Headwaters(①)Old font

In fact,
Writingmistake ③U+6589（323,000 people)

Headwaters(①)のWritingmistake ④U+9F4A（37,300people)

Oldfont(②)のWritingmistake

And feelings

After all, I want ** "sai" ** ((1) source) to be ** representative of all rhinoceros (4 types) ** (in the middle) **.

So let's check

I want to make a Saito map to confirm the ** representative (middle) ** of Saito.

On the other hand, the image used is 58x58 = 3,364 pixels (3,364 dimensions) and cannot be mapped to the XY coordinates (2 dimensions).

Therefore, I would like to use a technique called ** dimensional compression ** to compress to ** 3,364 dimensions ⇒ 2 dimensions **.

The dimensional compression of characters is covered in this article, so I will link to it.

This time, ** UMAP ** is used as the dimension compression algorithm.

Then, dimensionally compress with UMAP.

from umap import UMAP # Umap decomposition decomp = UMAP(n_components=2,random_state=42) # fit_transform umap(Saito 4 character data) embedding4 = decomp.fit_transform(all.T[[1,12,31,32]])

Verification 1) Decide the representatives of the four Saito

Using UMap, map ** Kanji image ** to ** 2D (plane) ** and check ** "representative" **.

Let's look at the ** "center of gravity" for all data as the "representative" **, not the "center" (0.5, 0.5).

The "center of gravity" is represented by the ** x mark **, but how about it? .. (It's below the center of gravity x.)

from sklearn.cluster import KMeans #clustering (1 cluster) clustering = KMeans(n_clusters=1,random_state=42,) # fit_predict cluster cl_y = clustering.fit_predict(embedding4) # visualize (Implementation will be described later) showScatter( embeddings = embedding4, clusterlabels = cl_y, centers = clustering.cluster_centers_, imgs = all.T[[1,12,31,32]].reshape(-1,h,w) )

** It's subtle, **

When calculating the Euclidean distance from the "center of gravity" to "each character", it looks like this.

In this result, ** ② headwaters (old font) Sai ** became the representative. ..

Order of proximity from the center of gravity letter Distance from the center of gravity Note

1st place 0.6281 ②Headwaters(oldfont)

2nd place 0.6889 ③Mistake(newfont)

3rd place 0.7339 ①Headwaters(newfont)

4th place 0.8743 ④Mistake(oldfont)

Verification 2) Determine the "representative" of 33 Saito

By the way, how many types of Saito are there?

There are only four types of kanji, but to tell the truth, according to wikipedia

There are 31 patterns of variant characters other than "Sai, Sai".

On the other hand, the Ministry of Justice recognizes only four rhino characters, "sai, sai, sai, and sai."

In other words, ** Of all 33 patterns, only 4 are accepted as Kanji **

In addition to Saito, which is recognized as a kanji, I would like to see ** "Representatives" of all 33 Saito **

Now, dimensionally compress 33 characters with UMAP.

from umap import UMAP # Umap decomposition decomp = UMAP(n_components=2,random_state=42) # fit_transform umap(All 33 character data) embeddings = decomp.fit_transform(all.T)

What is the "representative" of the 33 "Saito"?

Similarly, use UMAP to compress the dimensions and check the kanji that are close to the "center of gravity".

from sklearn.cluster import KMeans # clustering(Number of clusters: 1) clustering = KMeans(n_clusters=1, random_state=42) # fit_predict cluster cl_y = clustering.fit_predict(embeddings) # visualize showScatter(embeddings, cl_y, clustering.cluster_centers_)

Instead of the expected "sai" , is close to the representative (middle). ..

The order of distance from the center of gravity (top) is as follows. Ww that does not go as expected

Order of proximity from the center of gravity letter Distance from the center of gravity Note

1st place 0.494

2nd place 0.787

3rd place 1.013

4th place 1.014

Verification 3) Select 4 characters for the representative "Saito"

"Don't middle" didn't work, but ** 4 types ** are accepted as kanji.

Then, the kanji on this map is divided into 4 clusters, and which kanji is the center of gravity of each cluster?

In other words, I would like to select and see the representative 4 characters ** from all 33 characters.

Using the clustering algorithm KMeans, it is divided into 4 clusters as shown below.

from sklearn.cluster import KMeans # clustering(Number of clusters: 4) clustering = KMeans(n_clusters=4, random_state=42) # fit_predict cluster cl_y = clustering.fit_predict(embeddings) # visualize showScatter(embeddings, cl_y, clustering.cluster_centers_)

The characters of each cluster and the characters near the center of gravity are as follows.

Somehow it seems to be a cluster that captures the characteristics of kanji (month and indication).

Are the points near the center of gravity of the cluster capturing the characteristics of the cluster? Is subtle.

4 It cannot be classified as a cluster, and red cluster contains multiple patterns.

It seems that we need to classify it a little more **

If you take a quick look, if you have ** times 8 clusters **, you will feel that you can be divided into beautiful ones.

No cluster Center of gravity Other characters included

1 Red 　　

2 orange

3 Blue 　

4 Green 　

Verification 4) 8 Try clustering

Earlier, there were 4 clusters to select 4 representative kanji characters.

However, looking at the results, there were some clusters that could not be separated cleanly, so let's set the number of clusters to 8.

The results are as follows.

from sklearn.cluster import KMeans # clustering(Number of clusters: 8) clustering = KMeans(n_clusters=8, random_state=42) # fit_predict cluster cl_y = clustering.fit_predict(embeddings) # visualize showScatter(embeddings, cl_y, clustering.cluster_centers_)

It's not that they are separated neatly, but it feels like they were sorted.

No cluster clusterに含まれる字

1 peach

2 Red

3 tea

4 Ash 　

5 orange 　

6 Blue

7 purple

8 Green

Unfortunately, orange .amazonaws.com/0/183826/666133d0-3d55-6c0a-de96-ef8ab943afbe.jpeg) and gray ![28.jpg] At (https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/183826/0ba6a86b-3e46-d8a4-d2ac-6443ecda4d83.jpeg), the cluster is broken.

However, since both are data on the boundary of the cluster, my thoughts are conveyed (laughs).

Verification 5) Check how many clusters are valid

4 clusters to match the 4 characters registered as kanji.

Then, looking at the results of 4 clusters, I tried to separate them into 8 clusters.

As expected, ** how many clusters is appropriate? ** **

Here, as a method of selecting the number of clusters, I would like to visualize and examine the cluster status using the following three methods.

Elbow Chart

Silhouette Chart

dendrogram

Elbow Chart

The elbow chart is a chart with ** data variation in each cluster ** on the vertical axis and ** number of clusters ** on the horizontal axis.

Increasing the number of clusters can reduce the variability, but too many clusters is also a problem.

Therefore, ** the number of clusters ** is reasonable, and ** the number of clusters that can reduce data variability ** is considered in this figure.

Yellowbrick will be used for drawing.

from yellowbrick.cluster import KElbowVisualizer vis = KElbowVisualizer( KMeans(random_state=42), k=(1,34) #Number of clusters (range on the horizontal axis)) ) vis.fit(embeddings) vis.show()

The feeling you see is

Up to 5 clusters, ** data variability (average) decreased **, but thereafter it became flat.

Therefore, ** classified into 5 clusters ** seems to be good = the representative kanji is ** 5 types ** seems to be good.

But let's take a look at the enlarged version (enlarged by 4-18)

There seems to be an inflection point at 5, but it has become flat from ** around 10 **.

In other words, it seems that there is no mistake in ** classifying into 8 clusters and deciding 8 representative kanji **.

from yellowbrick.cluster import KElbowVisualizer vis = KElbowVisualizer( KMeans(random_state=42), k=(4,19) #Number of clusters (range on the horizontal axis)) ) vis.fit(embeddings) vis.show()

Silhouette Chart

Silhouette Chart is a chart that expresses the following for each cluster.

Vertical axis (thickness of graph): Number of samples in the cluster

Horizontal axis (graph length): Silhouette coefficient of the cluster

Dashed line: Average silhouette coefficient

From the perspective, the point is to find the number of Glasta that satisfies the following.

Same number of samples for all clusters = same thickness

Silhouette coefficient is close to average for all Glasta = Length is close to broken line

We will also use Yellowbrick for drawing.

from yellowbrick.cluster import silhouette_visualizer fig = plt.figure(figsize=(15,25)) #Draw together from 4 to 9 clusters for i in range(4,10): ax = fig.add_subplot(4,2,i-1) silhouette_visualizer(KMeans(i),embeddings)

As you can see, the pattern on the upper right (** clusters 5 **) is nice.

dendrogram

It is a graph that expresses ** closeness ** between clusters like a tournament table.

Since it is a diagram that can be used in hierarchical clustering, Scipy's hierarchical clustering is used instead of KMeans.

The view is as follows.

Clusters with leaves as data and branches of the same color with the same range

Height is the distance between clusters

from scipy.cluster.hierarchy import linkage, dendrogram Z = linkage( y = embeddings, method = 'weighted', metric = "euclidean", ) R = dendrogram( Z=Z, color_threshold=1.2, #Adjust the number of clusters with this threshold show_contracted=False, )

** It would be nice if the number of branches of each color is well-balanced and the heights are the same. After all, is the number of clusters about 5?

Number of clusters Dendrogram comment

４ RedJust a little expensive

５ The height is uniform
purpleI'm worried about a few
It feels pretty good

８ The height and number are the same,
Is it divided too finely?

Verification 6) Try to make 5 clusters

Since I examined the number of clusters, I would like to plot again what it will look like with 5 clusters.

Sounds pretty good. After all, is it 5 clusters?

from sklearn.cluster import KMeans # clustering(Number of clusters: 5) clustering = KMeans(n_clusters=5, random_state=42) # fit_predict cluster cl_y = clustering.fit_predict(embeddings) # visualize showScatter(embeddings, cl_y, clustering.cluster_centers_)

No cluster Center of gravity Other characters included

1 Blue

2 purple

3 Green 　　

4 Red 　

5 orange 　

Summary

Impressions

As a flow,

Start by selecting a representative of the 4 characters registered as kanji

For all 33 characters that are not registered as kanji, select 1, 4 or 8 characters.

Considering the appropriate number of clusters, I chose 5 characters at the end because 5 clusters seemed to be good.

The representative kanji are as follows, but more than deciding the representative

It is also interesting that on the XY plane where 3000 dimensions are compressed, ** kanji with similar shapes are placed nearby **.

It was interesting to be able to create ** groups by radical ** with distance-based clustering.

The number of clusters was also judged to be 5 clusters based on the results of the elbow method, silhouette method, and dendogram.

It was also interesting that the ** results of the clustering visualization of 5 clusters were reasonably good **.

Verification list

No How to choose Representative Saito

1 From the 4 recognized kanji
1 characterIf you choose, the representative is

2 From all 33 kanji
1 characterIf you choose

3 From all 33 kanji
4 charactersIf you choose

4 From all 33 kanji
8 charactersIf you choose

5 All 33 kanji
How many clustersShould be divided into About 5 clustersLooks good

6 From all 33 kanji
5 charactersIf you choose

Finally

Thank you for dealing with such a silly story.

If you like, I'd appreciate it if you could share it.

Reference information

About UMAP

https://umap-learn.readthedocs.io/en/latest/index.html

Discussion about Clustering in UMAP (seems to be)

https://umap-learn.readthedocs.io/en/latest/index.html

About KMeans

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

Examining the number of clusters (Yellow Brick)

https://www.scikit-yb.org/en/latest/api/cluster/elbow.html

https://www.scikit-yb.org/en/latest/api/cluster/silhouette.html

Drawing with dendrogram

https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.dendrogram.html

Visualization function

I referred to this article. Thank you very much. I will link.

%matplotlib inline %config InlineBackend.figure_format = 'retina' import numpy as np import seaborn as sns import matplotlib.pyplot as plt import matplotlib.lines as mlines from matplotlib import offsetbox from sklearn.preprocessing import MinMaxScaler from PIL import Image import matplotlib.patches as patches rc = { 'font.family': ['sans-serif'], 'font.sans-serif': ['Open Sans', 'Arial Unicode MS'], 'font.size': 12, 'figure.figsize': (8, 6), 'grid.linewidth': 0.5, 'legend.fontsize': 10, 'legend.frameon': True, 'legend.framealpha': 0.6, 'legend.handletextpad': 0.2, 'lines.linewidth': 1, 'axes.facecolor': '#fafafa', 'axes.labelsize': 10, 'axes.titlesize': 14, 'axes.linewidth': 0.5, 'xtick.labelsize': 10, 'xtick.minor.visible': True, 'ytick.labelsize': 10, 'figure.titlesize': 14 } sns.set('notebook', 'whitegrid', rc=rc) def colorize(d, color, alpha=1.0): rgb = np.dstack((d,d,d)) * color return np.dstack((rgb, d * alpha)).astype(np.uint8) colors = sns.color_palette('tab10') def showScatter( embeddings, clusterlabels, centers = [], imgs = all.T.reshape(-1,h,w), ): fig, ax = plt.subplots(figsize=(15,15)) #Scaling before drawing scatter plot scaler = MinMaxScaler() embeddings = scaler.fit_transform(embeddings) source = zip(embeddings, imgs ,clusterlabels) #Draw kanji on a scatter plot cnt = 0 for pos, d , i in source: cnt = cnt + 1 img = colorize(d, colors[i], 0.5) ab = offsetbox.AnnotationBbox(offsetbox.OffsetImage(img),0.03 + pos * 0.94,frameon=False) ax.add_artist(ab) #Draw concentric circles from the center of gravity if len(centers) != 0: for c in scaler.transform(centers): for r in np.arange(3,0,-1)*0.05: circle = patches.Circle( xy=(c[0], c[1]), radius=r, fc='#FFFFFF', ec='black' ) circle.set_alpha(0.3) ax.add_patch(circle) ax.scatter(c[0],c[1],s=300,marker="X") #Axis drawing range limit = [-0.1,1.1] plt.xlim(limit) plt.ylim(limit) plt.show()

Recommended Posts
Science "Is Saito the representative of Saito?"

Is the probability of precipitation correct?

[python] [meta] Is the type of python a type?

The update of conda is not finished.

The backslash of the Japanese keyboard is "ro"

The answer of "1/2" is different between python2 and 3

The origin of Manjaro Linux is "Mount Kilimanjaro"

FAQ: Why is the comparison of numbers inconsistent?

The value of pyTorch torch.var () is not distributed

This is the only basic review of Python ~ 1 ~

This is the only basic review of Python ~ 2 ~

This is the only basic review of Python ~ 3 ~

The beginning of cif2cell

Around the place where the value of Errbot is stored

The meaning of self

the zen of Python

The story of sys.path.append ()

What is the true identity of Python's sort method "sort"? ??

Zip 4 Gbyte problem is a story of the past

What is a recommend engine? Summary of the types

When you think the update of ManjaroLinux is strange

Why is the first argument of [Python] Class self?

Revenge of the Types: Revenge of types

The copy method of pandas.DataFrame is deep copy by default

What is the default TLS version of the python requests module?

Initial setting of Mac ~ Python (pyenv) installation is the fastest

Numerical approximation method when the calculation of the derivative is troublesome

[Data science memorandum] Confirmation of the contents of DataFrame type [python]

Is there a secret to the frequency of pi numbers?

Is the lottery profitable? ～ LOTO7 and the law of large numbers ～

Align the version of chromedriver_binary

Scraping the result of "Schedule-kun"

10. Counting the number of lines

The story of building Zabbix 4.4

Towards the retirement of Python2

[Apache] The story of prefork

What is the activation function?

What is the Linux kernel?

Compare the fonts of jupyter-themes

About the ease of Python

Get the number of digits

Explain the code of Tensorflow_in_ROS

Reuse the results of clustering

GoPiGo3 of the old man

Calculate the number of changes

Change the theme of Jupyter

Change the style of matplotlib

Visualize the orbit of Hayabusa2

About the components of Luigi

Connected components of the graph

Filter the output of tracemalloc

What is the interface for ...

About the features of Python

What is the Callback function?

The image is a slug

Simulation of the contents of the wallet

The Power of Pandas: Python

Unfortunately there is no sense of unity in the where method

On Linux, the time stamp of a file is a little past.

Organize useful blogs in the field of data science (overseas & Japan)

The hot battle of professional wrestling is illustrated in UML (overview)

	New font	Old font
Headwaters	①U+658E (542,000 people) <fontcolor=red>Headwaters	②U+9F4B（86,800people) Headwaters(①)Old font
In fact, Writingmistake	③U+6589（323,000 people) Headwaters(①)のWritingmistake	④U+9F4A（37,300people) Oldfont(②)のWritingmistake

Order of proximity from the center of gravity	Distance from the center of gravity	Note
1st place	0.6281	②Headwaters(oldfont)
2nd place	0.6889	③Mistake(newfont)
3rd place	0.7339	①Headwaters(newfont)
4th place	0.8743	④Mistake(oldfont)

No	cluster	clusterに含まれる字
1	peach
2	Red
3	tea
4	Ash
5	orange
6	Blue
7	purple
8	Green

Number of clusters	Dendrogram	comment
４		RedJust a little expensive
５		The height is uniform purpleI'm worried about a few It feels pretty good
８		The height and number are the same, Is it divided too finely?

No	How to choose	Representative Saito
1	From the 4 recognized kanji 1 characterIf you choose, the representative is
2	From all 33 kanji 1 characterIf you choose
3	From all 33 kanji 4 charactersIf you choose
4	From all 33 kanji 8 charactersIf you choose
5	All 33 kanji How many clustersShould be divided into	About 5 clustersLooks good
6	From all 33 kanji 5 charactersIf you choose