[Ruby] A story packed in the conversion of UTF-8 to Shift-jis character code in Ruby

3 minute read

Ruby talks about converting character code from UTF-8 to Shift-jis

In my current position, there was a request to have Shift-jis be the default character code of the character string when sending out CSV.

Since the default character code of Ruby’s character string is UTF-8, I thought that it would be sufficient because it can be realized simply by converting it to Shift-jis at the time of CSV discharge, but I was addicted to it.

Some people may be addicted to it as well, so here is a solution.

I was addicted

Using the generate method of the CSV class of the Ruby library makes it easy to implement CSV output.

Mounting image ↓

require "csv"

text =<<-EOS
id,first name,last name,age
1,taro,tanaka,20
2,jiro,suzuki,18
3,ami,sato,19
4,yumi,adachi,21
EOS

csv = CSV.generate(text, headers: true) do |csv|
  csv.add_row(["5", "saburo", "kondo", "34"])
end

Code reference: https://docs.ruby-lang.org/ja/latest/method/CSV/s/generate.html

When you want to convert to shift-jis, you can use the key :encoding to specify the output encoding.

CSV.generate(text, headers: true, encoding: "SJIS")

With this option, the output encoding will be automatically converted from utf-8 to shift-jis.

For some reason, the following error occurred when I implemented it here…..

incompatible character encodings: Windows-31J and UTF-8

Research of cause

Investigate why there is an encoding error.

As a result of investigating which character string caused an error, the following character string had an error.

"AAA-0001"

It’s a string that doesn’t look strange, why is it an error?

Upon closer inspection, it seems that an exception error occurs if the following characters are converted to shif-jis.

| Character code (UTF-8) | Character | Remarks | | :— | :—: | —: | | U+00A2 | ¢ | Cent symbol (currency) | | U+00A3 | £ | Pound symbol (currency) | | U+00AC | ¬ | NOT symbol | | U+2016 | ‖ | Double vertical line | | U+2212 | − | Minus sign | | U+301C | ~ | Wave dash | Reference: https://osa.hatenablog.com/entry/2014/08/21/113602

"AAA-0001"

This string contains a-(minus sign), so it is assumed that an exception error has occurred.

Solution

I understand the cause of the error. So how should we solve it?

The easiest way is to extend the string class using Ruby open class.

Ruby has no restrictions on class inheritance. Even built-in library classes such as the String class and Array class can be inherited to define their own class.

So add the method to prevent exception when converting to Windows-31J to the String class as follows.

class String
  def sjisable
    str = self
# Replace the character on the conversion table with the character below
    from_chr = "\u{301C 2212 00A2 00A3 00AC 2013 2014 2016 203E 00A0 00F8 203A}"
    to_chr = "\u{FF5E FF0D FFE0 FFE1 FFE2 FF0D 2015 2225 FFE3 0020 03A6 3009}"
    str.tr!(from_chr, to_chr)
# Illegal characters leaked from the conversion table are converted to? And returned to UTF8 so that no exception will be thrown in the future
    str = str.encode("Windows-31J","UTF-8",:invalid => :replace,:undef=>:replace).encode("UTF-8","Windows-31J")
  end
end

Code reference: https://qiita.com/yugo-yamamoto/items/0c12488447cb8c2fc018

By executing this method at the place where an exception error occurs, no exception error will occur.

"AAA-0001".sjisable

If you don’t want to use open class

Open classes and are very powerful and, if used well, can improve development efficiency.

On the other hand, it is good to add an original method to the standard class of Ruby, but other than the added person, even if you read the code, you do not know who is the method defined for what purpose, rather the development efficiency of the entire team is improved. Drop it.

Alternatively, there may be a disadvantage that an error occurs at an unexpected timing.

Or, some people are responsible for changing the character code to Shift_JIS is the String class’s responsibility, or is it not the responsibility of the class that handles CSV because it is necessary when converting to CSV? I think there will be questions.

So, if you do not use open class, it is better to create something like CsvUtility class, aggregate the procedures that handle CSV, and output it in Shift_JIS or UTF-8. Become.

Reference

Junichi Ito. Introducing Ruby language for professionals to test-driven development and debugging techniques