How many bits is wchar_t
defined in wchar.h
in C language?
Did you think it was 16 bit? (I also thought) Actually, it depends on the environment. Sometimes it's not 16-bit.
** * For the sake of simplicity, we will not consider surrogate pairs here. ** **
main.c
#include <stdio.h>
#include <wchar.h>
int main() {
wchar_t *s = L"ABCD";
printf("%d %d\n", wcslen(s), sizeof(s[0]));
return 0;
}
When I compile this with gcc, the result is
4 2
4 4
have become.
L ""
, which represents wide characters, and wcslen ()
, which counts the number of characters, are made to match each other, but the reality of wchar_t
is 16 bits for the former and the latter for the latter. Is different from 32-bit.
Let's assume that Unicode string data given from the outside is represented as a null-terminated string of 16 bits per character (UTF-16). If you try to output the character string and the number of characters, it looks like the following program, for example.
#include <stdio.h>
#include <wchar.h>
int main() {
char data[] = {0x40, 0x00, 0x41, 0x00, 0x42, 0x00, 0x00, 0x00}; //Suppose this is given
wchar_t *s = (wchar_t *)data;
printf("%ls %d\n", s, wcslen(s));
return 0;
}
However, this is in an environment where wchar_t
is 16 bits.
@AB 3
Is displayed as expected, but it behaves unexpectedly in a 32-bit environment.
In C ++ 11, a type called char16_t
was created as a data type that represents one UTF-16 character (excluding surrogate pairs). Similarly, 32-bit characters are char32_t
.
In addition, UTF-16 / UTF-32 has been added to represent string literals.
The example at the beginning seems to be good to write as follows (although it is C ++).
main.cpp
#include <stdio.h>
#include <string>
using namespace std;
int main() {
char16_t s[] = u"ABCD"; // UTF-16 string literal
printf("%d %d\n", char_traits<char16_t>::length(s), sizeof(s[0]));
return 0;
}
Specify C ++ 11 in the compile options.
terminal
g++ -std=c++11 main.cpp
4 2
is output regardless of the size of wchar_t
.
Unlike wchar_t
, char16_t
has no output in the printf ()
function.
Instead, it looks like this. (Save the source code in UTF-8)
#include <string>
#include <codecvt>
#include <locale>
#include <iostream>
using namespace std;
int main() {
char16_t s[] = u"AIUEO";
wstring_convert<codecvt_utf8<char16_t>, char16_t> cv;
cout << cv.to_bytes(s) << endl;
return 0;
}
However, the following conditions apply to the use of codecvt_utf8
. → codecvt_utf8 --cpprefjp --C ++ Japanese Reference
--Not recommended for C ++ 17 --Requires GCC 5.1 or higher
The char
type, which represents one character in Java, is 16 bits.
For example, in JNI (Java Native Interface), when you want to return the character string data represented by UTF-16 (null terminated) as Java String
type ( jstring
), the number of characters I'm addicted to counting with wcslen
.
C++Code
// jbyteArray (byte[])Given the type argument arg
jbyte *arg_ptr = env->GetByteArrayElements(arg, NULL);
//wcslen may give unexpected results
jstring ret_string = env->NewString((jchar *)arg_ptr, wcslen((wchar_t *)arg_ptr));
env->ReleaseByteArrayElements(arg, arg_ptr, 0);
Is it like this as a countermeasure? (Write down the required header declaration and ʻusing namespace std;
`)
jstring ret_string = env->NewString((jchar *)arg_ptr, char_traits<char16_t>::length((char16_t *)arg_ptr));
If you just want to find out the length of the null-terminated string, you can do it yourself by looping ...
There is a library called ctypes
for calling C / C ++ shared libraries (.dll, .so) from Python.
Again, you may be addicted to creating and manipulating an array of Unicode characters from a string of bytes or passing it to another function.
Python
import ctypes
wstr = ctypes.create_unicode_buffer(u"AIUEO")
print(ctypes.sizeof(wstr))
If wchar_t
is a 16-bit environment, 12
is output, and if it is a 32-bit environment, 24
is output.
The "environment" here is like the environment of the compiler used to build Python.
In fact, create_unicode_buffer ()
itself may not be very useful.
When dealing with Windows API, it would be good to deal with arguments of type wchar_t *
.
wchar_t
I'm scared. wcslen
I'm scared.
I pray that more people like me will not be licked and terrified.