Since it became necessary to convert a large number of names (furigana) to romaji, I searched for such a gem, but there was one that converted general sentences to romaji (example: [Romaji](https: //)). github.com/makimoto/romaji)) But I couldn't find anything specializing in "name", so I made it myself.
You can use it to properly convert 99% of your names to Romaji!
… By the way, “Specialize in“ name ”? Romaji is the same for ordinary sentences and names, isn't it? "" 99% name "? Some of you may have thought, "Convert Romaji 100% accurately."
However, when I looked it up, the world of Roman letters was quite dark ...
The only time you need your Romaji name is to apply for a passport. The Ministry of Foreign Affairs has explained the notation in the passport application, but according to this, it seems that rules that are slightly different from general Roman letters are adopted.
I will make a conversion based on this rule.
I will explain using the following code I wrote. [^ 1]
[^ 1]: Actually, character code conversion and Hiragana → Katakana conversion are entered before this.
kana.gsub(/(?<=[Okosotonohomoyorowogozodobopoyo])Oh\z/){ "o" }
.gsub(/(?<=[Otho])Oh/){ oh ? "h" : "" }
.gsub(/(?<=[Okosotonohomoyorowogozodobopoyo])C/){ oh ? "h" : "" }
.gsub(/(?<=[Ukustunufumyurvuguzuzubupu])C/){ "" }
.gsub(/[A-Vu-\-][Aィゥェォャュョ]?/, ConversionTable)
.gsub(/Tsu(.)/){ ($1 == "c" ? "t" : $1) + $1 }
.gsub(/n(?=[bmp])/){ "m" }
Romaji conversion basically requires conversion of "a" to "a", "ka" to "kya", and so on, one character at a time (however, two characters if followed by a small waieuyo). ..
In the code above, this is the part.
.gsub(/[A-Vu-\-][Aィゥェォャュョ]?/, ConversionTable)
A ConversionTable is an associative array containing the basic conversion rules between Katakana and Romaji, such as {" a "=>" a ", ...}
. In addition, since "tsu" will be converted later by another method, it is not converted here as {...," tsu "=>" tsu "}
.
Basically, "n" should be converted to "n", except that it is supposed to be converted to "m" before "b, m, p". The processing is performed in the following part.
.gsub(/n(?=[bmp])/){ "m" }
Basically, "tsu" should be converted to the back consonant, except for "chi, cha, chu, cho", which is to be converted to "t". However, in Romaji, there is nothing that starts with "c" other than the first four, so it is enough to check only whether the back is "c". The following part is doing that processing.
.gsub(/Tsu(.)/){ ($1 == "c" ? "t" : $1) + $1 }
... By the way, there are no names that end with "tsu" or names that have a vowel after "tsu" ...
As for long vowels, the long vowels of "o" and "u" are not written in principle. [^ 2] In other words, the names such as "Oono", "Kouta", and "Hyuga" are "ono", "kota", and "hyuga".
[^ 2]: By the way, the long vowels of "i" such as "Nina" are written, but the long vowels of "-" such as "Nina" are written even if they are pronounced the same. No.
However, there are exceptions, and the trailing "o" (such as "Senoo") becomes "oo". The processing is performed in the following part.
kana.gsub(/(?<=[Okosotonohomoyorowogozodobopoyo])Oh\z/){ "o" } #Exception handling at the end
.gsub(/(?<=[Otho])Oh/){ oh ? "h" : "" } # oOh
.gsub(/(?<=[Okosotonohomoyorowogozodobopoyo])C/){ oh ? "h" : "" } # oC
.gsub(/(?<=[Ukustunufumyurvuguzuzubupu])C/){ "" } # uC
By the way, it seems that it is allowed to write "oh" for "o o" and "o u", so I am making it possible to switch between them as an option.
"Isn't there something wrong with" o "? ], It is sharp. This will be explained later.
That is all for the rules written in the previous Ministry of Foreign Affairs materials. I'm sure some of you may have thought, "What an easy thing to do."
However, the problem is the "long sound" mentioned earlier. If it's a "long vowel", just follow the rules above.
But before that, we have to make a judgment as to whether it is a long sound or not. From here is the realm of darkness ...
"O o", "o u", and "u u" should be uniformly erased to "o, u" except for the trailing exception. Instead, "o o" and "o" There may be cases where the shape of "u" or "u" is not a long vowel.
For example, "Hirooka (Hirooka)", "Kouchiwa (small fan)", "Matsuura (Matsuura)". [^ 3] These include "o o", "o u", and "u u", but they are divided like "hiro" + "oka" and are not extended to "low". , And it becomes "hirooka".
[^ 3]: The two other than "Hirooka" are from the example of Saitama Prefecture Passport Center.
It's a relatively simple story for humans to see kanji, but how can a machine judge this?
As I wrote above, it's relatively easy for humans to see "Kanji". So if you only have Frigana, is there a way to make a definite decision?
… Unfortunately, I don't think it's possible. For example, the previously written "small fan" is "ko" + "uchiwa", so it is not a long sound, but let's say you have a surname that reads "kochiwa" and "kochiwa". [^ 4] Even with the same "kochiwa", the former must be converted to "kouchiwa" and the latter to "kochiwa". In other words, it is impossible to deal with it only with kana.
[^ 4]: "Kodan Ougi" seems to be a real surname, but "Kochiwa" is a surname made for explanation and I don't know if it actually exists.
Then, how about giving not only kana but also kanji data? Have the general reading of kanji as data, and check which part of the kanji corresponds to which kanji, such as "Matsuura" + "Matsuura" → "Matsuura (Ura)". , Convert for each kanji ... Then the above example seems to work. (It seems to be very troublesome ...)
However, in the materials of the Ministry of Foreign Affairs, there is an example of the surname "Misono (Mizonosei)". I think this is "Mi-en (Sono) student (U)", but in Roman letters it is "misono". ** What do you mean! ** **
In the first place, I feel that there are only two types of long vowels, "o", "o (large, etc.)" and "too (far)". Therefore, as shown below, it seems that the number of names that can be supported can be increased by applying the rule of "o" only to "o" and "to".
.gsub(/(?<=[Otho])Oh/){ oh ? "h" : "" } # oOh
But if there was a surname like "Hitooka", this wouldn't work.
That's why I gave up on these names and decided to be happy with the 99% conversion rate. If you are aiming for a higher conversion rate, I think it is reasonably realistic to have surname data that is difficult to convert and convert it exceptionally. We are waiting for the challenge of those who are prepared to look into the darkness.
Recommended Posts