Count the number of Thai and Arabic characters well in Python

Unicode difficult

There are many difficulties in handling Unicode. I've been studying a lot lately. So there may be terrible mistakes by Unicode beginners in the following:

I knew about the confusing points of Unicode normalization differences (NFC, NFD, NFKC, NFKD), In another layer, when counting Thai characters, Arabic characters, Devanagari characters, etc. visually, it seems necessary to count in a higher layer called Grapheme.

Reference: 7 ways to count the number of characters

Grapheme

In other words

--If you count the number of characters normally in a programming language, it will be the number of Code points. --Actually, one character may be visually composed of multiple Code points. --The visually correct single character unit is Grapheme cluster

It seems.

So what about Python?

So what tools are there in Python to count Grapheme clusters? It didn't seem to be included in Python's standard library, unicodedata.

answer

There seems to be a package called uniseg.

In this article, I mainly show examples in Python 3. (I won't touch on the differences in how unicode, str, and bytes are handled between Python 2 and Python 3. If you touch it, it will deviate significantly.)

Installation method

$ pip install uniseg

Example of use

>>> import uniseg.graphemecluster
>>> graphme_split = lambda w: tuple(uniseg.graphemecluster.grapheme_clusters(w))
>>>
>>> phrase = 'กินข้าวเย็น'  #It seems to be a phrase that means to eat dinner in Thai
>>> len(phrase.encode('UTF-8'))  # UTF-Bytes at 8
33
>>> len(phrase)  # Code Points
11
>>> len(graphme_split(phrase))  # Graphme clusters
8

And so on.

Other

It seems that uniseg has word and sentence-based word-separation. It seems that it can be cut with space, so it seems that it is not possible to divide the word in Japanese, which is an agglutinative language.

Recommended Posts

Count the number of Thai and Arabic characters well in Python

Divides the character string by the specified number of characters. In Ruby and Python.

Count the number of characters in the text on the clipboard on mac

[Homology] Count the number of holes in data with Python

Project Euler # 17 "Number of Characters" in Python

Count the number of characters with echo

Output the number of CPU cores in Python

Fill the string with zeros in python and count some characters from the string

plot the coordinates of the processing (python) list and specify the number of times in draw ()

How to get the number of digits in Python

Count the number of parameters in the deep learning model

How to count the number of elements in Django and output to a template

Get the size (number of elements) of UnionFind in Python

How to identify the element with the smallest number of characters in a Python list?

How to count the number of occurrences of each element in the list in Python with weight

Check the processing time and the number of calls for each process in python (cProfile)

Get the number of specific elements in a python list

Python --Find out number of groups in the regex expression

[Tips] Problems and solutions in the development of python + kivy

Maximum number of characters in Python3 shell call (per OS)

The story of Python and the story of NaN

"A book to train programming skills to fight in the world" Python code answer example --1.2 Count the number of the same characters

How to quickly count the frequency of appearance of characters from a character string in Python?

[Python] Let's reduce the number of elements in the result in set operations

Get the title and delivery date of Yahoo! News in Python

Get the number of readers of a treatise on Mendeley in Python

Check the behavior of destructor in Python

Count / verify the number of method calls.

The result of installing python in Anaconda

The basics of running NoxPlayer in Python

In search of the fastest FizzBuzz in Python

Project Euler # 1 "Multiples of 3 and 5" in Python

Graph of the history of the number of layers of deep learning and the change in accuracy

Comparing the basic grammar of Python and Go in an easy-to-understand manner

python> array> Determine the number and initialize> mylist = [idx for idx in range (10)] / mylist = [0 for idx in range (10)] >> mylist = [0] * 10

Change the saturation and brightness of color specifications like # ff000 in python 2.5

Check the in-memory bytes of a floating point number float in Python

[Python] Calculate the number of digits required when filling in 0s [Note]

Open an Excel file in Python and color the map of Japan

Get the number of articles accessed and likes with Qiita API + Python

Count the number of times two values appear in a Python 3 iterator type element at the same time

4 methods to count the number of occurrences of integers in a certain interval (including imos method) [Python implementation]

[Python] Sort the list of pathlib.Path in natural sort

Check if the characters are similar in Python

Summary of the differences between PHP and Python

Get the caller of a function in Python

Match the distribution of each group in Python

The answer of "1/2" is different between python2 and 3

View the result of geometry processing in Python

Prime number enumeration and primality test in Python

Calculate the total number of combinations with python

Specifying the range of ruby and python arrays

Divide the string into the specified number of characters

Make a copy of the list in Python

About the difference between "==" and "is" in python

Find the number of days in a month

Find the divisor of the value entered in python

Compare the speed of Python append and map

Find the solution of the nth-order equation in python

The story of reading HSPICE data in Python

[Note] About the role of underscore "_" in Python