There are many difficulties in handling Unicode. I've been studying a lot lately. So there may be terrible mistakes by Unicode beginners in the following:
I knew about the confusing points of Unicode normalization differences (NFC, NFD, NFKC, NFKD), In another layer, when counting Thai characters, Arabic characters, Devanagari characters, etc. visually, it seems necessary to count in a higher layer called Grapheme.
Reference: 7 ways to count the number of characters
Grapheme
In other words
--If you count the number of characters normally in a programming language, it will be the number of Code points. --Actually, one character may be visually composed of multiple Code points. --The visually correct single character unit is Grapheme cluster
It seems.
So what tools are there in Python to count Grapheme clusters? It didn't seem to be included in Python's standard library, unicodedata.
There seems to be a package called uniseg.
In this article, I mainly show examples in Python 3. (I won't touch on the differences in how unicode, str, and bytes are handled between Python 2 and Python 3. If you touch it, it will deviate significantly.)
$ pip install uniseg
>>> import uniseg.graphemecluster
>>> graphme_split = lambda w: tuple(uniseg.graphemecluster.grapheme_clusters(w))
>>>
>>> phrase = 'กินข้าวเย็น' #It seems to be a phrase that means to eat dinner in Thai
>>> len(phrase.encode('UTF-8')) # UTF-Bytes at 8
33
>>> len(phrase) # Code Points
11
>>> len(graphme_split(phrase)) # Graphme clusters
8
And so on.
It seems that uniseg has word and sentence-based word-separation. It seems that it can be cut with space, so it seems that it is not possible to divide the word in Japanese, which is an agglutinative language.
Recommended Posts