The program introduced in this article can handle the example sentences described to some extent, but I found that some words do not properly add furigana.
The following site introduces a better version of the code in this article, so please take a look! https://www.karelie.net/python3-mecab-html-furigana-1/
We would like to thank Kimera for pointing out the problems in this article.
I want to quit college
Automatically when there is html
<ruby><rb>I</rb><rt>I</rt></ruby>Is
<ruby><rb>University</rb><rt>University</rt></ruby>To
<ruby><rb>Resignation</rb><rt>Or</rt></ruby>Metai
I </ rb>
And ruby will come into contact
The environment is MacOS, and the Python version is 3.5.1. 1.Mecab (In my case, it was included in mac by default, so I will omit it) 2.mecab-python3 3. pip (required to install mecab-python3)
Mecab is a dictionary analysis tool that allows you to get the kanji readings by using it.
You can execute / get it by executing the mecab command from the command line and entering text on the next line.
$ mecab
I want to quit college
My noun,Pronoun,General,*,*,*,I,I,I
Is a particle,Particle,*,*,*,*,Is,C,Wow
University noun,General,*,*,*,*,University,Daigaku,Daigaku
Particles,Case particles,General,*,*,*,To,Wo,Wo
Quit verb,Independence,*,*,One step,Continuous form,Quit,Yame,Yame
Tai auxiliary verb,*,*,*,Special Thailand,Uninflected word,Want,Thailand,Thailand
EOS
In order to handle this Mecab with python3, a package called python-mecab3 is required.
Because the tool that manages such packages is pip If pip is not installed, install it first. (Similar to gem in Ruby)
$ easy_install pip
By the way, it seems that it is included by default from python3.4.
After installing pip, install mecab-python3.
$ pip install mecab-python3
You can get a list of currently included packages with the pip list command. If you have mecab-python, the installation is successful.
$ pip list
mecab-python3 (0.7)
Now that the environment is ready for the time being, let's try using mecab from python and see how we can get the data.
macab.py
#!/usr/local/src/pyenv/shims/python
# -*- coding: utf_8 -*-
import sys
import MeCab
mecab = MeCab.Tagger("-Ochasen")#call mecab
text=mecab.parse('I want to quit college')#Get furigana
print(text)
$ mecab.py
I am my noun-Pronoun-General
Ha ha is a particle-Particle
University Daigaku University Noun-General
Wo Wo particle-Case particles-General
Quit quit verb-Independent one-stage continuous form
Tai Thai Tai Tai Auxiliary verb Special Thai basic form
EOS
Parse seems to mean "analysis"
mecab = MeCab.Tagger("-Ochasen")
After calling Mecab with
text=mecab.parse('I want to quit college')
You can get it in the format like the execution result by calling the parse method and specifying the text you want to analyze in the argument.
I was able to get it for the time being, but this format is very difficult to handle programmatically because it is separated by spaces.
Therefore, change it to a format called Node to make it easier to handle.
mecab.py
import sys
import MeCab
mecab = MeCab.Tagger("-Ochasen")
mecab.parse('')#Need to parse in the sky
node=mecab.parseToNode('I want to quit college')
while node :
print(node.surface+"\t"+node.feature)
node=node.next
$ mecab.py
BOS/EOS,*,*,*,*,*,*,*,*
My noun,Pronoun,General,*,*,*,I,I,I
Is a particle,Particle,*,*,*,*,Is,C,Wow
University noun,General,*,*,*,*,University,Daigaku,Daigaku
Particles,Case particles,General,*,*,*,To,Wo,Wo
Quit verb,Independence,*,*,One step,Continuous form,Quit,Yame,Yame
Tai auxiliary verb,*,*,*,Special Thailand,Uninflected word,Want,Thailand,Thailand
BOS/EOS,*,*,*,*,*,*,*,*
This makes it easier to handle because the analysis results can be obtained separated by ",".
As before, when you call mecab, it seems that you need to empty the argument and parse it for the time being, so do so.
afterwards
node=mecab.parseToNode('I want to quit college')
I will specify the sentence in the argument of the parseToNode method like this.
Since node is processed word by word, use a while statement Loop until the end of the sentence (end the value in node).
In the loop The original argument text in node.surface, You can get comma separated analysis data with node.feature.
By the way
node=node.next
If you forget, it will loop infinitely, so be careful.
So far, I succeeded in using mecab from python for the time being.
Up to this point, there was a lot of information on the Internet, but From here onward, there is not much information and it is a difficult part.
However, basically it is a work to apply to html while playing with the above node.feature.
How to swing ruby in html
<ruby><rb>letter</rb><rt>Moji</rt></ruby>
Characters </ rb> Moji </ rt> </ ruby>
Because it will be in the form of
Let's apply the original kanji in the
mecab.py
#!/usr/local/src/pyenv/shims/python
# -*- coding: utf_8 -*-
import sys
import MeCab
import re
mecab = MeCab.Tagger("-Ochasen")
mecab.parse('')#Need to parse in the sky
node=mecab.parseToNode('I want to quit college')
while node :
origin=node.surface#Substitute the original word
kana=node.feature.split(",")[7]#Substitute reading kana
#Check if the regular expression matches the kanji
pattern = "[one-龥]"
matchOB = re.match(pattern , origin)
#When the origin is empty, it is not necessary to shake the furigana when it is not a kanji, so it is output as it is
if origin != "" and matchOB:
print("<ruby><rb>{0}</rb><rt>{1}</rt></ruby>".format(origin,kana),end="")
else :
print(origin)
node=node.next
$ mecab.py
<ruby><rb>I</rb><rt>I</rt></ruby>Is
<ruby><rb>University</rb><rt>Daigaku</rt></ruby>To
<ruby><rb>Quit</rb><rt>Yame</rt></ruby>Want
I </ rb> I </ rt> </ ruby> University </ rb> Daigaku </ rt> </ ruby> ruby> Quit </ rb> Yame </ rt> </ ruby> I want to
It's getting more like that.
First, in the first two lines, substitute words and readings for origin and kana, respectively.
Regarding the yomigana, since the yomigana is the 7th comma-separated node.feature, the 7th comma-separated array is specified by the split function.
origin=node.surface #Substitute the original word
kana=node.feature.split(",")[7] #Substitute reading kana
Next, this time I want to use furigana only when the original word is kanji. Use regular expressions to check if the original word is a kanji.
To use a regular expression, first import re at the beginning.
import re
I won't go into details about regular expressions here, There are various methods for the re imported earlier.
This time I want to find out if the first character of the word origin is a kanji Use the match method. Specify a pattern (Kanji this time) in the first argument and a word in the second argument.
If there is a match, information about the match called a match object is stored. If there is no match, "None" is stored.
#Check if the regular expression matches the kanji
pattern = "[one-龥]" #Kanji pattern
matchOB = re.match(pattern , origin) #None when not in kanji
Finally, if is used for judgment.
If you organize it in Japanese, ・ When origin is a kanji ➡ Shake ruby with html and output ・ When origin is other than kanji (hiragana, katakana, numbers, etc.) ➡ Output as it is It will be.
#When the origin is empty, it is not necessary to shake the furigana when it is not a kanji, so it is output as it is
if matchOB and origin != "" :
print("<ruby><rb>{0}</rb><rt>{1}</rt></ruby>".format(origin,kana),end="")
else :
print(origin)
When origin is other than Kanji, None is assigned to marchOB, so it will be judged and it will shift to else.
Up to this point, it has become quite a shape, but there is a problem.
I want to swing the okurigana with hiragana, but the reading kana that can be obtained with Mecab is katakana, so I want to convert this to hiragana.
In Ruby, you can do it in one shot by using a function, but as far as I checked with Python, it seems that it can not be done.
Therefore, I will devise something here.
First, prepare an array of such hiragana and katakana characters.
hiragana=["Ah","I","U","Hmm"]#hiragana[0]=="Ah"
katakana=["A,"I","C","Down"]#katakana[0]=="A"
And for example Suppose you want to change the word "ai" to "ai".
text=list("Ai")#text[0]=="A" text[1]=="I"become
kana=""#Variable to put hiragana
for hoge in len(text)#Repeat for the number of characters(Substitute for hoge)
for i in list(katakana)
katakana[i]==hoge
kana+=hiragana[i]
print(kana)#Ai
I want to divide "Ai" into A and I and collate them one by one. Divide the "eye" character by character into an array.
Then, loop with for for the number of arrays (= number of characters).
In it, loop for the number of katakana and match with hoge. Since the content of the hoge of the first loop this time is "A" When i is 0, it matches katakana [0].
Finally, hiragana [i], that is, hiragana [0], that is, "a" To the variable kana.
If you repeat this for the number of characters, kana will have hiragana.
Here is the code that modularizes this.
mecab.py
#!/usr/local/src/pyenv/shims/python
# -*- coding: utf_8 -*-
import sys
import MeCab
import re
def henkan(text) :
hiragana=[chr(i) for i in range(12353, 12436)]
katakana=[chr(i) for i in range(12449, 12532)]
kana=""
#Hiragana with Kana Kana
for text in list(text):
for i in range(83):
if text == katakana[i]:
kana+=hiragana[i]
return kana
mecab = MeCab.Tagger("-Ochasen")
mecab.parse('')#Need to parse in the sky
node=mecab.parseToNode('I want to quit college')
while node :
origin=node.surface#Substitute the original word
kana=node.feature.split(",")[7]#Substitute reading kana
kana=henkan(kana)#Call the conversion function and use katakana as hiragana
#Check if the regular expression matches the kanji
pattern = "[one-龥]"
matchOB = re.match(pattern , origin)
#When the origin is empty, it is not necessary to shake the furigana when it is not a kanji, so it is output as it is
if origin != "" and matchOB:
print("<ruby><rb>{0}</rb><rt>{1}</rt></ruby>".format(origin,kana),end="")
else :
print(origin)
node=node.next
I will omit the details,
range(83)
The reason for this is that the number of arrays is 83 as well as the Japanese syllabary with the addition of small letters.
Another problem is the problem of okurigana. Since Mecab is separated by word, it also includes okurigana, which means that ruby is not assigned purely to kanji.
In other words
<ruby><rb>Quit</rb><rt>stop</rt></ruby>
Quit </ rb> Stop </ rt> </ ruby>
not
<ruby><rb>Resignation</rb><rt>Or</rt></ruby>Me
Remarks </ rb> and </ rt> </ ruby>
I want to do it.
Let's take a look at this.
Basically, it determines whether the original word and the ending of the okurigana match.
origin: beauty good </ font> kana: Utsukushi Shii </ font> origin: run run </ font> kana: chopsticks ru </ font>
Since these have the same two or one letter at the end, they can be judged as okurigana.
Regarding this as well, as in the case of the katakana problem earlier. Decompose origin and kana character by character and arrange them Judges whether one or two letters at the end of a word match.
mecab.py
#!/usr/local/src/pyenv/shims/python
# -*- coding: utf_8 -*-
import sys
import MeCab
import re
def henkan(text) :
hiragana=[chr(i) for i in range(12353, 12436)]
katakana=[chr(i) for i in range(12449, 12532)]
kana=""
#Hiragana with Kana Kana
for text in list(text):
for i in range(83):
if text == katakana[i]:
kana+=hiragana[i]
return kana
mecab = MeCab.Tagger("-Ochasen")
mecab.parse('')#Need to parse in the sky
node=mecab.parseToNode('I want to quit college')
while node :
origin=node.surface#Substitute the original word
yomi=node.feature.split(",")[7]#Substitute reading kana
kana=henkan(yomi)
#Check if the regular expression matches the kanji
pattern = "[one-龥]"
matchOB = re.match(pattern , origin)
#When the origin is empty, it is not necessary to shake the furigana when it is not a kanji, so it is output as it is
if origin != "" and matchOB:
origin=list(origin)
kana=list(kana)
num1=len(origin)
num2=len(kana)
okurigana=""
if origin[num1-1] == kana[num2-1] and origin[num1-2] == kana[num2-2] :
okurigana=origin[num1-2]+origin[num1-1]
origin[num1-1]=""
origin[num1-2]=""
kana[num2-1]=""
kana[num2-2]=""
origin="".join(origin)
kana="".join(kana)
elif origin[num1-1] == kana[num2-1] :
okurigana=origin[num1-1]
origin[num1-1]=""
kana[num2-1]=""
origin="".join(origin)
kana="".join(kana)
else :
origin="".join(origin)
kana="".join(kana)
print("<ruby><rb>{0}</rb><rt>{1}</rt></ruby>".format(origin,kana),end="")
print(okurigana)
else :
print(origin)
node=node.next
When you do this,
$ mecab.py
<ruby><rb>I</rb><rt>I</rt></ruby>
Is
<ruby><rb>University</rb><rt>University</rt></ruby>
To
<ruby><rb>Resignation</rb><rt>Or</rt></ruby>Me
Want
I </ rb> I </ rt> </ ruby> University </ rb> Daigaku </ rt> </ ruby> Remarks </ rb> and </ rt> </ ruby>
And it's completed! !! !!
Let's take a closer look.
origin=list(origin)
kana=list(kana)
num1=len(origin)
num2=len(kana)
okurigana=""
First of all, as I said earlier, I checked each character, so I will arrange it with the list function.
Also, to find out if the endings match, check the number of arrays with the len function. Substitute for num1 and num2 respectively.
if origin[num1-1] == kana[num2-1] and origin[num1-2] == kana[num2-2] :
okurigana=origin[num1-2]+origin[num1-1]
origin[num1-1]=""
kana[num1-2]=""
origin[num2-1]=""
kana[num2-2]=""
origin="".join(origin)
kana="".join(kana)
This is the processing when the last character and the penultimate match, that is, This is the process for words like "beautiful".
In this case, the last two letters "shii" will be the okurigana, so Assign this to the okurigana variable.
okurigana=origin[num1-2]+origin[num1-1]
Once you put it in a variable, you don't need "Shii", so delete it.
origin[num1-1]=""
kana[num1-2]=""
origin[num2-1]=""
kana[num2-2]=""
Finally left origin = ["Beauty"] kana = ["u", "tsu", "ku"] Use the join function to return to a variable.
origin="".join(origin)
kana="".join(kana)
For the rest, maybe it's a one-character okurigana like "run" This is the process when there is no okurigana like "university".
elif origin[num1-1] == kana[num2-1] :
okurigana=origin[num1-1]
origin[num1-1]=""
kana[num2-1]=""
origin="".join(origin)
kana="".join(kana)
else :
origin="".join(origin)
kana="".join(kana)
Since it has been arranged, even when there is no processing and there is no okurigana Work to return to a variable.
The last is output.
Since it has been assigned to the variable exactly in the processing so far, there is no special change, but since the okurigana is assigned to okurigana, it is ok if you do not forget to output this.
print("<ruby><rb>{0}</rb><rt>{1}</rt></ruby>".format(origin,kana),end="")
print(okurigana)
The rest is a little hard to see as it is, so if you make it a function, it's done!
mecab.py
#!/usr/local/src/pyenv/shims/python
# -*- coding: utf_8 -*-
import sys
import MeCab
import re
def henkan(text) :
hiragana=[chr(i) for i in range(12353, 12436)]
katakana=[chr(i) for i in range(12449, 12532)]
kana=""
#Hiragana with Kana Kana
for text in list(text):
for i in range(83):
if text == katakana[i]:
kana+=hiragana[i]
return kana
def tohensu(origin,kana) :
origin="".join(origin)
kana="".join(kana)
return origin,kana
def kanadelete(origin,kana) :
origin=list(origin)
kana=list(kana)
num1=len(origin)
num2=len(kana)
okurigana=""
if origin[num1-1] == kana[num2-1] and origin[num1-2] == kana[num2-2] :
okurigana=origin[num1-2]+origin[num1-1]
origin[num1-1]=""
origin[num1-2]=""
kana[num2-1]=""
kana[num2-2]=""
origin,kana=tohensu(origin,kana)
elif origin[num1-1] == kana[num2-1] :
okurigana=origin[num1-1]
origin[num1-1]=""
kana[num2-1]=""
origin="".join(origin)
kana="".join(kana)
else :
origin,kana=tohensu(origin,kana)
return origin,kana,okurigana
mecab = MeCab.Tagger("-Ochasen")
mecab.parse('')#Need to parse in the sky
node=mecab.parseToNode("I want to quit college")
while node :
origin=node.surface#Substitute the original word
yomi=node.feature.split(",")[7]#Substitute reading kana
kana=henkan(yomi)
#Check if the regular expression matches the kanji
pattern = "[one-龥]"
matchOB = re.match(pattern , origin)
#When the origin is empty, it is not necessary to shake the furigana when it is not a kanji, so it is output as it is
if origin != "" and matchOB:
origin,kana,okurigana=kanadelete(origin,kana)
print("<ruby><rb>{0}</rb><rt>{1}</rt></ruby>".format(origin,kana),end="")
print(okurigana)
else :
print(origin)
node=node.next
Recommended Posts