[RUBY] I was swallowed by the darkness of the romaji, trying to convert my name to romaji

Since it became necessary to convert a large number of names (furigana) to romaji, I searched for such a gem, but there was one that converted general sentences to romaji (example: [Romaji](https: //)). github.com/makimoto/romaji)) But I couldn't find anything specializing in "name", so I made it myself.

You can use it to properly convert 99% of your names to Romaji!

… By the way, “Specialize in“ name ”? Romaji is the same for ordinary sentences and names, isn't it? "" 99% name "? Some of you may have thought, "Convert Romaji 100% accurately."

However, when I looked it up, the world of Roman letters was quite dark ...

About romaji conversion of names

The only time you need your Romaji name is to apply for a passport. The Ministry of Foreign Affairs has explained the notation in the passport application, but according to this, it seems that rules that are slightly different from general Roman letters are adopted.

-Hepburn Romanization Table

I will make a conversion based on this rule.

I will explain using the following code I wrote. [^ 1]

[^ 1]: Actually, character code conversion and Hiragana → Katakana conversion are entered before this.

kana.gsub(/(?<=[Okosotonohomoyorowogozodobopoyo])Oh\z/){ "o" }
    .gsub(/(?<=[Otho])Oh/){ oh ? "h" : "" }
    .gsub(/(?<=[Okosotonohomoyorowogozodobopoyo])C/){ oh ? "h" : "" }
    .gsub(/(?<=[Ukustunufumyurvuguzuzubupu])C/){ "" }
    .gsub(/[A-Vu-\-][Aィゥェォャュョ]?/, ConversionTable)
    .gsub(/Tsu(.)/){ ($1 == "c" ? "t" : $1) + $1 }
    .gsub(/n(?=[bmp])/){ "m" }

Basic conversion

Romaji conversion basically requires conversion of "a" to "a", "ka" to "kya", and so on, one character at a time (however, two characters if followed by a small waieuyo). ..

In the code above, this is the part.

    .gsub(/[A-Vu-\-][Aィゥェォャュョ]?/, ConversionTable)

A ConversionTable is an associative array containing the basic conversion rules between Katakana and Romaji, such as {" a "=>" a ", ...}. In addition, since "tsu" will be converted later by another method, it is not converted here as {...," tsu "=>" tsu "}.

Sound repellency

Basically, "n" should be converted to "n", except that it is supposed to be converted to "m" before "b, m, p". The processing is performed in the following part.

    .gsub(/n(?=[bmp])/){ "m" }

Sokuon (tsu)

Basically, "tsu" should be converted to the back consonant, except for "chi, cha, chu, cho", which is to be converted to "t". However, in Romaji, there is nothing that starts with "c" other than the first four, so it is enough to check only whether the back is "c". The following part is doing that processing.

    .gsub(/Tsu(.)/){ ($1 == "c" ? "t" : $1) + $1 }

... By the way, there are no names that end with "tsu" or names that have a vowel after "tsu" ...

Long vowels (o o, o u, u u)

As for long vowels, the long vowels of "o" and "u" are not written in principle. [^ 2] In other words, the names such as "Oono", "Kouta", and "Hyuga" are "ono", "kota", and "hyuga".

[^ 2]: By the way, the long vowels of "i" such as "Nina" are written, but the long vowels of "-" such as "Nina" are written even if they are pronounced the same. No.

However, there are exceptions, and the trailing "o" (such as "Senoo") becomes "oo". The processing is performed in the following part.

kana.gsub(/(?<=[Okosotonohomoyorowogozodobopoyo])Oh\z/){ "o" } #Exception handling at the end
    .gsub(/(?<=[Otho])Oh/){ oh ? "h" : "" } # oOh
    .gsub(/(?<=[Okosotonohomoyorowogozodobopoyo])C/){ oh ? "h" : "" } # oC
    .gsub(/(?<=[Ukustunufumyurvuguzuzubupu])C/){ "" } # uC

By the way, it seems that it is allowed to write "oh" for "o o" and "o u", so I am making it possible to switch between them as an option.

"Isn't there something wrong with" o "? ], It is sharp. This will be explained later.

Main subject: Dark part

That is all for the rules written in the previous Ministry of Foreign Affairs materials. I'm sure some of you may have thought, "What an easy thing to do."

However, the problem is the "long sound" mentioned earlier. If it's a "long vowel", just follow the rules above.

But before that, we have to make a judgment as to whether it is a long sound or not. From here is the realm of darkness ...

Not long vowels "o o" "o u" "u u"

"O o", "o u", and "u u" should be uniformly erased to "o, u" except for the trailing exception. Instead, "o o" and "o" There may be cases where the shape of "u" or "u" is not a long vowel.

For example, "Hirooka (Hirooka)", "Kouchiwa (small fan)", "Matsuura (Matsuura)". [^ 3] These include "o o", "o u", and "u u", but they are divided like "hiro" + "oka" and are not extended to "low". , And it becomes "hirooka".

[^ 3]: The two other than "Hirooka" are from the example of Saitama Prefecture Passport Center.

It's a relatively simple story for humans to see kanji, but how can a machine judge this?

Can you handle it with only Kana?

As I wrote above, it's relatively easy for humans to see "Kanji". So if you only have Frigana, is there a way to make a definite decision?

… Unfortunately, I don't think it's possible. For example, the previously written "small fan" is "ko" + "uchiwa", so it is not a long sound, but let's say you have a surname that reads "kochiwa" and "kochiwa". [^ 4] Even with the same "kochiwa", the former must be converted to "kouchiwa" and the latter to "kochiwa". In other words, it is impossible to deal with it only with kana.

[^ 4]: "Kodan Ougi" seems to be a real surname, but "Kochiwa" is a surname made for explanation and I don't know if it actually exists.

What if I also give kanji data?

Then, how about giving not only kana but also kanji data? Have the general reading of kanji as data, and check which part of the kanji corresponds to which kanji, such as "Matsuura" + "Matsuura" → "Matsuura (Ura)". , Convert for each kanji ... Then the above example seems to work. (It seems to be very troublesome ...)

However, in the materials of the Ministry of Foreign Affairs, there is an example of the surname "Misono (Mizonosei)". I think this is "Mi-en (Sono) student (U)", but in Roman letters it is "misono". ** What do you mean! ** **

Correspondence to non-long vowels such as "Hirooka"

In the first place, I feel that there are only two types of long vowels, "o", "o (large, etc.)" and "too (far)". Therefore, as shown below, it seems that the number of names that can be supported can be increased by applying the rule of "o" only to "o" and "to".

    .gsub(/(?<=[Otho])Oh/){ oh ? "h" : "" } # oOh

But if there was a surname like "Hitooka", this wouldn't work.

In conclusion

That's why I gave up on these names and decided to be happy with the 99% conversion rate. If you are aiming for a higher conversion rate, I think it is reasonably realistic to have surname data that is difficult to convert and convert it exceptionally. We are waiting for the challenge of those who are prepared to look into the darkness.

Recommended Posts

I was swallowed by the darkness of the romaji, trying to convert my name to romaji
I was addicted to the roll method
I was addicted to the Spring-Batch test
I was addicted to not being able to connect to AWS-S3 from the Docker container
I was swallowed by the darkness of the romaji, trying to convert my name to romaji
[RxSwift] I want to deepen my understanding by following the definition of Observable
I want to display the name of the poster of the comment
I was addicted to the setting of laradock + VSCode + xdebug
I tried to deepen my understanding of object orientation by n%
I want to limit the input by narrowing the range of numbers
I was addicted to the API version min23 setting of registerTorchCallback
By checking the operation of Java on linux, I was able to understand compilation and hierarchical understanding.
I was addicted to the roll method
[CircleCI] I was addicted to the automatic test of CircleCI (rails + mysql) [Memo]
I was addicted to the Spring-Batch test
I want to get the field name of the [Java] field. (Old tale tone)
I want you to use Enum # name () for the Key of SharedPreference
<Java> Quiz to batch convert file names separated by a specific character string with a part of the file name
I want to get a list of only unique character strings by excluding fixed character strings from the file name
I want to output the day of the week
I want to var_dump the contents of the intent
I tried to investigate the mechanism of Emscripten by using it with the Sudoku solver
The part I was addicted to in "Introduction to Ajax in Java Web Applications" of NetBeans
A memorandum because I was addicted to the setting of the Android project of IntelliJ IDEA
I was addicted to the NoSuchMethodError in Cloud Endpoints
[Kotlin] Get the argument name of the constructor by reflection
I tried to summarize the state transition of docker
Convert the array of errors.full_messages to characters and output
05. I tried to stub the source of Spring Boot
I tried to reduce the capacity of Spring Boot
I want to know the answer of the rock-paper-scissors app
[Rails] How to change the column name of the table
[VBA] I tried to make a tool to convert the primitive type of Entity class generated by Hibernate Tools to the corresponding reference type.
[Kotlin] Convert ZonedDateTime to String by specifying the format
I want to be aware of the contents of variables!
I want to return the scroll position of UITableView!
I had to figure out where the eclipse plugins folder was on my Mac. (Memo)
I was able to convert my GMS app to G + H support within 10 minutes using ToolKit
I tried to implement the like function by asynchronous communication
How to dynamically change the column name acquired by MyBatis
What I was addicted to when introducing the JNI library
Get the object name of the instance created by the new operator
I tried to summarize the basics of kotlin and java
I want to expand the clickable part of the link_to method
I was addicted to looping the Update statement on MyBatis
I want to change the log output settings of UtilLoggingJdbcLogger
[Swift] I tried to implement the function of the vending machine
I tried to summarize the basic grammar of Ruby briefly
What I was addicted to with the Redmine REST API
I want to put the JDK on my Mac PC
I want to give a class name to the select attribute
[Rails] How to display the list of posts by category
I want to narrow down the display of docker ps
[Ruby] I want to reverse the order of the hash table
I want to temporarily disable the swipe gesture of UIPageViewController
I wrote a code to convert numbers to romaji in TDD
The story I was addicted to when setting up STS
[Rails] How to convert the URI of the image sent by http to https when using Twitter API
I want to judge the necessity of testing by comparing the difference of class files when refactoring Java
I tried to make it possible to set the delay for the UDP client of Android by myself
[Controller] I want to retrieve the numerical value of a specific column from the DB (my memo)
When I was worried about static methods in java interface, I arrived in the order of name interpretation
What I was addicted to when updating the PHP version of the development environment (Docker) from 7.2.11 to 7.4.x
When I regained my mind about Tribuo released by Oracle, the person inside was a hot person.