When converting Shift_JIS characters to Unicode, problems may occur if the character encoding (= conversion table) is different.

Shift_JIS In Shift_JIS, "a" is represented by "0x82A0", "i" is represented by "0x82A2", and "u" is represented by "0x82A4". You can check it by writing "Ai" in the Shift_JIS text file and opening it with a binary editor.

A 0x82A0 (Shift_JIS) 0x82A2 (Shift_JIS) 0x02A4 (Shift_JIS)

Figure. I opened a Shift_JIS text file with a binary editor

Shift_JIS-> Unicode conversion (MS932)

Since the assignment of characters ("A" in the above example) and code ("0x82A0" in the above example) are different between Shift_JIS and Unicode, the conversion rule (character encoding) is determined. Using MS932, which is the standard character encoding in Windows Java, Shift_JIS "Ai" is converted as follows.

A 0x82A0 (Shift_JIS)-> U + 3042 (Unicode) Result of conversion with MS932 0x82A2 (Shift_JIS)-> U + 3044 (Unicode) Result of conversion with MS932 U 0x02A4 (Shift_JIS)-> U + 3046 (Unicode) Result of conversion with MS932

The result of conversion with MS932 can also be confirmed using the Java tool native2ascii.

native2ascii -encoding MS932 sjis_abc.txt \u3042\u3044\u3046

Not only MS932C but also Cp943C is a character encoding that is often used in Japanese strings. The result is the same if you use Cp943C to convert "Ai" to Unicode.

A 0x82A0 (Shift_JIS)-> U + 3042 (Unicode) Cp943C conversion result 0x82A2 (Shift_JIS)-> U + 3044 (Unicode) Cp943C conversion result U 0x02A4 (Shift_JIS)-> U + 3046 (Unicode) Cp943C conversion result

Even native2ascii is as follows.

native2ascii -encoding Cp943C sjis_abc.txt \u3042\u3044\u3046

However, some characters

－ 0x817C (Shift_JIS) ― 0x815C (Shift_JIS) ～ 0x8160 (Shift_JIS) ∥ 0x8161 (Shift_JIS) ￤ 0xFA55 (Shift_JIS)

As for, when converted to Unicode in MS932 and CP943C respectively, the Unicode characters assigned will be different. ** Encoding spec, not a bug **

The table is as follows.

letter	Shift_JIS	Conversion result with MS932	Conversion result with Cp943C
－	0x817C	u+FF0D	u+2212
―	0x815C	u+2015	u+2014
～	0x8160	u+FF5E	u+301C
∥	0x8161	u+2225	u+2016
￤	0xFA55	u+FFE4	u+00A6

Below are the results with native2ascii.

native2ascii -encoding MS932 sjis.txt \uff0d\u2015\uff5e\u2225\uffe4

native2ascii -encoding Cp943C sjis.txt \u2212\u2014\u301c\u2016\u00a6

This is a reprinted article from the link below. https://sites.google.com/site/myitmemo/java-kanren/unicode/ms932-vs-cp943c

MS932 CP943C conversion problem

Shift_JIS-> Unicode conversion (MS932)