[RUBY] How to count UTF-8 code points fast

Introduces an algorithm that calculates the number of code points from a UTF-8 string. Code point counting isn't too difficult to write simply, but its highly efficient implementation is surprisingly confusing.

The content is double-barreled.

--As for the practical implementation, I will introduce the one used in the internal implementation (string.c) of Ruby (CRuby). --Beyond the standard C range, we will also describe the implementation using SIMD instructions (AVX / AVX2). ――I couldn't find any known algorithm as far as I searched lightly, so I twisted an ad hoc implementation, but it seems that it is not so inefficient.

As a bonus, I tried a simple performance evaluation. It is assumed that the UTF-8 string has been validated (it is known that it is not an invalid sequence).

How to do it with an internal implementation of Ruby

First, define ʻis_utf8_lead_byte` to determine if it is the leading byte of the code point. It is a preprocessor macro.

#define is_utf8_lead_byte(c) (((c)&0xC0) != 0x80)

https://github.com/ruby/ruby/blob/v2_6_2/string.c#L1629

Masking 0xC0 or 0b11000000 leaves only the top 2 bits. Bytes that are not the beginning of UTF-8 are arranged as 0b10xxxxxx, so if it is a non-first byte, it will be 0b10000000 after masking, but this means that the comparison operation ! = 0x80 determines whether it is the first byte. It means you can.

This macro makes it easy to implement code point counting.

#define is_utf8_lead_byte(c) (((c)&0xC0) != 0x80)
int64_t count_utf8_codepoint(const char *p,  const char *e) {
    while (p < e) {
        if (is_utf8_lead_byte(*p)) len++;
        p++;
    }
    return len;
}

By the way, it is simple so far, but the implementation in the Ruby processing system is a little more devised. In the above simple implementation, character string processing is performed in byte units, but in a typical environment such as a modern PC / server, 32 / 64-bit integers can be handled without any inconvenience, so processing for each byte is possible. It is relatively inefficient. Therefore, multiple bytes are read together as ʻuintptr_t` type, and the number of all the first bytes included is counted at once. ** The number of code points is equal to the number of first bytes **, so you only need to count the first bytes to count the code points.

#define NONASCII_MASK UINT64_C(0x8080808080808080)
static inline uintptr_t
count_utf8_lead_bytes_with_word(const uintptr_t *s)
{
    uintptr_t d = *s;
    d = (d>>6) | (~d>>7);
    d &= NONASCII_MASK >> 7;
    return __builtin_popcountll(d);
}

https://github.com/ruby/ruby/blob/v2_6_2/string.c#L1631-L1665

(Excluding the include header) It has been slightly rewritten so that it can be compiled with only a piece of code. Also, the description is based on the environment of sizeof (uintptr_t) == 8.

It's a little confusing because it's calculated all together, but please note the following points.

--NONASCII_MASK >> 7 is a mask in which only the least significant bit of each byte is set. -- d is bit and with this, so only the least significant bit remains --When focusing on the least significant bit of each byte of (d >> 6) | (~ d >> 7), it is judged that "the 7th bit is up or the 8th bit is down". ――The 7th bit stands out: Remember that the non-first byte of UTF-8 is always in the order 0b10xxxxxx, that is, the first byte. --The 8th bit is missing: Since it is ASCII, the first byte

After the bit operation, d will indicate that it is the first byte if the least significant bit is set for each byte. The other bits are masked and are always 0. Therefore, to count the number of first bytes included, you can do popcnt.

Finally, finally, here's the big picture of the implementation using count_utf8_lead_bytes_with_word.

int64_t ruby_count_utf8_codepoint(const char *p, const char *e)
{
    uintptr_t len = 0;
    if ((int)sizeof(uintptr_t) * 2 < e - p) {
        const uintptr_t *s, *t;
        const uintptr_t lowbits = sizeof(uintptr_t) - 1;
        s = (const uintptr_t*)(~lowbits & ((uintptr_t)p + lowbits));
        t = (const uintptr_t*)(~lowbits & (uintptr_t)e);
        while (p < (const char *)s) {
            if (is_utf8_lead_byte(*p)) len++;
            p++;
        }
        while (s < t) {
            len += count_utf8_lead_bytes_with_word(s);
            s++;
        }
        p = (const char *)s;
    }
    while (p < e) {
        if (is_utf8_lead_byte(*p)) len++;
        p++;
    }
    return (long)len;
}

https://github.com/ruby/ruby/blob/v2_6_2/string.c#L1680-L1700

I'm talking about why it's such a complicated implementation, but in order for count_utf8_lead_bytes_with_word to work properly, a pointer aligned to the n-byte boundary of ʻuintptr_t` (typically I think n = 4 or 8). Because it needs to be read from. The specific method is as follows.

--By adding lowbits and then masking, you can advance p to the aligned position and make it s. --By pulling lowbits and then masking, ʻe is returned to the aligned position and made t`.

You can use count_utf8_lead_bytes_with_word for the preprocessed s and t. Of course, there may be a gap between s, t and p, ʻe`, so we use normal loop processing to fill in the differences before and after.

AVX algorithm / implementation

(Added on 2019/04/18) @umezawatakeshi improved the implementation here. If you are interested in implementing AVX, please refer to that article as well. https://qiita.com/umezawatakeshi/items/ed23935788756c800b86

It is not an algorithm, so I will explain it in parallel with the implementation.

As a basic policy, like UTF-8 validation algorithm, use the property that you can tell if it is the first byte by looking at the upper nibble of the byte. .. You can use vpshufb to set 1 only at the position where the first byte was. Although it is a way of counting 1, it is costly to add in the horizontal direction in SIMD, so we will add the vectors cut out in units of 32 bytes as they are. However, since the range of values that can be expressed in bytes is 0 to 255, overflow may occur if vector addition is performed more than 255 times. The workaround is to split the loop and aggregate the values by horizontal addition every 255 times (of course, if there are no more inputs, the loop will end before 255 times). In this way, horizontal addition is performed at least once and at most once every 255 times, so you can expect a relatively low cost.

Now it's implementation. First, define a function that adds horizontally in byte units as an auxiliary function.

inline int32_t avx2_horizontal_sum_epi8(__m256i x)
{
    __m256i sumhi = _mm256_unpackhi_epi8(x, _mm256_setzero_si256());
    __m256i sumlo = _mm256_unpacklo_epi8(x, _mm256_setzero_si256());
    __m256i sum16x16 = _mm256_add_epi16(sumhi, sumlo);
    __m256i sum16x8 = _mm256_add_epi16(sum16x16, _mm256_permute2x128_si256(sum16x16, sum16x16, 1));
    __m256i sum16x4 = _mm256_add_epi16(sum16x8, _mm256_shuffle_epi32(sum16x8, _MM_SHUFFLE(0, 0, 2, 3)));
    uint64_t tmp = _mm256_extract_epi64(sum16x4, 0);
    int32_t result = 0;
    result += (tmp >> 0 ) & 0xffff;
    result += (tmp >> 16) & 0xffff;
    result += (tmp >> 32) & 0xffff;
    result += (tmp >> 48) & 0xffff;
    return result;
}

Using ʻavx2_horizontal_sum_epi8`, it is relatively easy to write a function that counts code points for multiples of 32.

int64_t avx_count_utf8_codepoint(const char *p,  const char *e)
{
    // `p` must be 32B-aligned pointer
    p = static_cast<const char *>(__builtin_assume_aligned(p, 32));
    const size_t size = e - p;
    int64_t result = 0;
    for (size_t i = 0; i + 31 < size;) {
        __m256i sum = _mm256_setzero_si256();
        size_t j = 0;
        for (; j < 255 * 32 && (i + 31) + j < size; j += 32) {
            const __m256i table = _mm256_setr_epi8(
                    1, 1, 1, 1, 1, 1, 1, 1, //     .. 0x7
                    0, 0, 0, 0,             // 0x8 .. 0xB
                    1, 1, 1, 1,             // 0xC .. 0xF
                    1, 1, 1, 1, 1, 1, 1, 1, //     .. 0x7
                    0, 0, 0, 0,             // 0x8 .. 0xB
                    1, 1, 1, 1              // 0xC .. 0xF
                    );
            __m256i s = _mm256_load_si256(reinterpret_cast<const __m256i *>(p + i + j));
            s = _mm256_and_si256(_mm256_srli_epi16(s, 4), _mm256_set1_epi8(0x0F));
            s = _mm256_shuffle_epi8(table, s);
            sum = _mm256_add_epi8(sum, s);
        }
        i += j;
        result += avx2_horizontal_sum_epi8(sum);
    }
    return result;
}

The inner loop adds for each vector, and the outer loop aggregates the element values of the added vector and adds them to result. table is a conversion table for detecting the first byte from the upper nibble of UTF-8 and setting only the first byte to 1.

The conditional expression j <255 * 32 && (i + 31) + j <size for the inner loop is a bit confusing. The loop variable j is incremented by 32, so the first half j <255 * 32 of the conditional expression means that the loop is terminated at 255 times. The latter half, (i + 31) + j <size, is a guard to prevent the input length size from exceeding a multiple of 32. When the inner loop ends, j is added to ʻi, so ʻi + 31 <size of the outer loop also becomes 0, and the loop ends.

Collaboration implementation

The part that extends beyond the multiple of 32 can be combined with the Ruby implementation. This time, I tried to fill the protruding part with an implementation that processes byte by byte. Since it is possible that p is not aligned with the 32B boundary, I also inserted a process to advance the process to the 32B boundary with a scalar one byte at a time before the vector processing.

_ * (Added on 2019/04/07 00:30) The code has been fixed because there was a bug. _

int64_t count_utf8_codepoint(const char *p, const char *e)
{
    int64_t count = 0;
#if defined(__AVX2__)
    if (32 <= e - p) {
        // increment `p` to 32B boundary
        while (((uintptr_t)p % 32) != 0) {
            if (is_utf8_lead_byte(*p)) count++;
            p++;
        }
        // vectorized count
        count += avx_count_utf8_codepoint(p, e);
        p += static_cast<uintptr_t>(e - p) / 32 * 32;
    }
#endif
    while (p < e) {
        if (is_utf8_lead_byte(*p)) count++;
        p++;
    }
    return count;
}

Performance evaluation

This time, we will look at whether there is sufficient throughput for long strings and how much it is quantitative.

I will briefly describe the conditions for performance evaluation.

--Environment: Ubuntu 18.04.1 on VirtualBox on MBP Late 2013 --In other words, the CPU is the Haswell generation --The size of the string is 100MiB --Use a random number string instead of a proper string --Use the number of clock cycles obtained from the rdtsc instruction as the measured value. --Measure 3 times and take the median value --At 100MiB, one process is too short, so measure the section where the same process is performed 100 times. --Use Clang 8.0 and GCC 7.3.0 as compilers --Compiler options (common): -O3 -masm = intel --std = c ++ 14 -mavx -mavx2 -Wall -Wextra

The measurement targets are the scalar version introduced, the Ruby version, and the count_utf8_codepoint (hereinafter referred to as AVX version) described in the section on collaboration implementation.

Measurement result

Implementation	compiler	Number of clock cycles(Mclk)
Scalar version	GCC	7323
Scalar version	Clang	8793
Ruby version	GCC	4573
Ruby version	Clang	2643
AVX version	GCC	2361
AVX version	Clang	2345

Why did this result?

The overall result is that the scalar version is the slowest and the AVX version is the fastest.

The Ruby implementation has a big difference between GCC and Clang, which seems to be due to Clang's automatic vectorization working well. GCC's automatic vectorization doesn't work for the heaviest part of the Ruby version (the loop of while (s <t)), where it makes a big difference.

Interestingly, GCC seems to be more efficient at code generation when the scalar version is automatically vectorized. Did Clang get a hint from the Ruby version of the algorithm ... I hope it's not a measurement error.

Comparing the Ruby version and the AVX version for the same processing system (Clang), the speed can be increased by about 10% by handwriting the AVX code. However, with the support of the compiler, it can be said that about 90% of the performance of intrinsics handwriting was obtained relatively easily, so it seems that the problem of cost performance is likely to occur.

By the way, in real time, AVX version-Clang was about 0.838 seconds. In other words, about 12 GiB / s is out. Since the character string is large enough, the cache is not working, and considering that the DDR3-1600 2ch memory attached to this MBP has a theoretical bandwidth of 25.6 GiB / s, about 50% of the bandwidth can be used. Become. I think it's more than enough for a single-threaded program.

Summary

--Introduced a practical implementation based on UTF-8 code point count --Shows a counting algorithm that applies the UTF-8 validation algorithm, and verified its practicality when the character string is huge --I also discovered that there are large differences in performance and characteristics depending on the compiler.

Appendix

Source code

https://gist.github.com/saka1/5bdd0f0ad4e498d22258982f6b9699c5