Introducing C ++ characters and strings for Swift programmers

Verification environment

Terminal: MacBook Air OS: macOS High Sierra Swift: 4.0.3 (swiftlang-900.0.74.1 clang-900.0.39.2) Clang: Apple LLVM version 9.0.0 (clang-900.0.39.2)

Overview

I think everyone is casually using the following methods to convert C strings (char \ *) to Swift or Objective-C strings (String, NSString).

String.swift


init?(cString: UnsafePointer<CChar>) //typealias CChar = Int8

First, I would like to introduce the cString that is passed to this method, that is, the C string. Next, in fact, on the NSString side, the above method has been abolished, and a method that specifies encoding at the same time is prepared as shown below. I would like to think about the encoding value passed at this time.

String.swift


init?(cString: UnsafePointer<CChar>, encoding enc: String.Encoding)

About the letter C

Char is provided as a type that represents C language characters. A 1-byte value can be stored in this char type variable. The C character encloses the character in single quotes. And the value enclosed in single quotes is called a character literal.

C letter.c


int main(void) {
    char c = '*'; //「'It is converted to a 1-byte ASCII value by enclosing the character with a "mark".
    printf("%c\n", c); // *
    printf("%ld\n", sizeof(c)); // 1

}

As mentioned in the comments, the above is actually syntactic sugar as follows.

letter c.c


int main(void) {
    char c = 42;
    printf("%c\n", c); // *
    printf("%ld\n", sizeof(c)); // 1
}

In other words, the substance of a character literal is "just a number".

Also, if you enclose multibyte characters in single quotation marks as shown below, the generated number (* 1) will exceed the size of char (1 byte), resulting in a compile error.

(\ * 1: For the generated value, see the section below "[When multibyte characters are stored in the C string](https://qiita.com/ysn551/items/446074b22103233edd95#c%E3%81] % AE% E6% 96% 87% E5% AD% 97% E5% 88% 97% E3% 81% AB% E3% 83% 9E% E3% 83% AB% E3% 83% 81% E3% 83% 90 % E3% 82% A4% E3% 83% 88% E6% 96% 87% E5% AD% 97% E3% 82% 92% E6% A0% BC% E7% B4% 8D% E3% 81% 97% E3 % 81% 9F% E5% A0% B4% E5% 90% 88) ”)

C letter.c


//This source code file is saved in UTF8
int main(void) {
    char c = 'Ah'; // error: character too large for enclosing character 
}

In other words, multi-byte characters cannot be stored as they are in char type variables.

The verification that the substance of a character literal is a 1-byte number can also be proved by being able to directly store the value in an int type (4 bytes) as shown below.

The substance of a character literal is a 1-byte number.c


int main(void) {
    int num = 'abcd';
    printf("%0x\n", num); // 64656667
}

The result of outputting the value of num in hexadecimal is "64656667", and if you read the value in byte units, you can see that it can be decomposed into "64,65,66,67".

Summary here

About the C string

A C language string is an array of type char. In other words, it is represented by an array for storing 1-byte data. Also, the C string encloses the character in double quotes. Values enclosed in double quotes are called string literals.

C string.c


int main(void) {
    char str[] = "Hello"; //The number of elements can be omitted by initializing at the same time.
    printf("sizeof(str)/sizeof(char) = %ld\n", sizeof(str)/sizeof(char)); // 6
}

The character string * Hello * used for initialization above is 5 characters, but the number of elements is 6. In fact, it has the following syntactic sugar.

About the C string.c


int main(void) {
    char str[] = {'H','e','l','l','o','\0'};
    printf("sizeof(str)/sizeof(char) = %ld\n", sizeof(str)/sizeof(char)); //6
}

In other words, the string literal " Hello " returns a char array with 6 elements that contains the null character at the end.

Summary here

When multibyte characters are stored in the C string

Although not mentioned in the above section, the results of saving the source code in UTF8 file and Shift-JIS file when initialized with multibyte characters as shown below are shown. I would like to see it.

About the C string.c


int main(void) {
    char str[] = "Ah"; //If you declare an array at the same time as initialization, you can omit the number of elements
    int size = sizeof(str);
    for (int i = 0; i < size; i++) {
        printf("%hhx ", str[i]); //Validate this output with each encoding
    }
}

Method of verification:

  1. Open the editor
  2. Change the encoding setting of the editor to Shift-JIS or UTF-8
  3. Paste the source code and save
  4. Compile with clang compiler (\ $ cc file.c)
  5. Execute (\ $ ./a.out)

Output when saved in UTF8:

case_utf8_result.txt


e3 81 82 0 

Output when saving in Shift-JIS:

case_shift_jis_result.txt


82 a0 0

Regarding each of the above values, please enter "A" in this Site to display the result. result

In other words, you can see that the result of a string literal of C language multibyte characters matches the encoding of a text editor.

This is a very natural result because we are passing the "file" in which the source code is written to the compiler, not the "source code".

So in UTF8, char str [] =" a " can be said to be syntactic sugar as follows.

About the C string.c


int main(void) {
    //e3 81 82 0 
    char str[] = {0xe3, 0x81, 0x82, 0x0};
    printf("%s \n", str); //If the terminal encoding setting is UTF8, "A" will be displayed.
}

When the above terminal encoding is set to UTF-8 and executed, "A" is displayed. If you use Shift-JIS, the characters will be garbled. (Settings → Profiles → Advance tag) Screen Shot 2017-12-25 at 13.57.40.png

As a result, the top is the result when Shift-JIS is set, and the bottom is the result when UTF8 is set. Screen Shot 2017-12-25 at 13.56.20.png

Summary here

About the encoding specified when converting to a Swift string

I would like to use the Swift API below to convert the characters passed from the C API to a Swift String. What should the encoding value specified at this time be?

String.swift


init?(cString: UnsafePointer<CChar>, encoding enc: String.Encoding) //CChar = Int8

The C program code to be verified is as follows.

libc.c


char* file_name() {
    return "hello.txt";
}

char* new_file_header_str() {
    FILE *f = fopen(file_name(), "r");
    if (f == NULL) return NULL;

    char *str = calloc(256, sizeof(char));
    fgets(str, 256, f); //Only one line
    fclose(f);
    return str;
}

If you call the above from Swift, the C char * type will be passed as the ʻUnsafeMutablePointer ` type.

First of all, I would like to verify that the C character obtained from the file_name function is converted to the Swift character. This string literal is returned as it is. In other words, you can see that the encoding value when converting this to a Swift string must be the same as the encoding in the libc.c file.

Next, what about the encoding value used to convert the C character obtained from the new_file_header_str function to a Swift string? Here, the character string of the hello.txt file is returned. So you can see that the encoding value you have to specify here must be the same as the encoding value where the hello.txt file is stored.

Below is a sample source code that saves the lib.c file in UTF-8 and the hello.txt file in Shift-JIS and calls each function from Swift.

get_str_from_c.swift


let name = file_name() //Optional<UnsafeMutablePointer<Int8>>
if let name = name,
    let converted = String(cString: name, encoding: .utf8) {
    print(converted)
} 

let header = new_file_header_str() //Optional<UnsafeMutablePointer<Int8>>
if let header = header,
    let converted = String(cString: header, encoding: .shiftJIS) {
    print(converted)
}

Please refer to the following for calling the C library. https://qiita.com/ysn551/items/83e06cf74ae628cb573c

Summary here

Python3 string literal

In this way, C string literals store encoding values directly, so they depend on the development environment. By the way, in the case of Swift compiler, only UTF8 files can be compiled.

On the other hand, in Python3, the value generated by a string literal is a number, but this one generates a Unicode value. Therefore, there is no need to consider encoding when exchanging string literals between files.

The verification result with python3 is as follows. By the way, Python2 uses the encoding value, so it is useless if the encoding between the source codes is different.

Save the following shift_jis.py file in Shift-JIS encoding

shift_jis.py


#! coding=shift-jis

word = "Nice to meet you"

Save the following utf8.py file in UTF8 and execute it.

utf8.py


#! coding=utf-8

import shift_jis as sh

if sh.word == "Nice to meet you": 
    print("true")
else:
    print("false")

When I run the above in python3, true is displayed, but in python2, false is displayed.

Final summary

I look forward to working with you in 2018. m (__) m

Recommended Posts

Introducing C ++ characters and strings for Swift programmers
C macros for watchOS and Swift API Availability
C macros for tvOS and Swift API Availability
Treat Swift functions and closures as C function pointers
Create UnsafeMutablePointer <UnsafeMutablePointer <Int8>?>! In Swift for C char ** hoge
[Swift] Use for where rather than nesting for and if