Create a tool to automatically furigana with html using Mecab from Python3

2019.3.25 postscript

The program introduced in this article can handle the example sentences described to some extent, but I found that some words do not properly add furigana.

The following site introduces a better version of the code in this article, so please take a look! https://www.karelie.net/python3-mecab-html-furigana-1/

We would like to thank Kimera for pointing out the problems in this article.

What you can do

I want to quit college

Automatically when there is html

<ruby><rb>I</rb><rt>I</rt></ruby>Is
<ruby><rb>University</rb><rt>University</rt></ruby>To
<ruby><rb>Resignation</rb><rt>Or</rt></ruby>Metai

I </ rb> I </ rt> </ ruby> University </ rb> Daigaku </ rt> </ ruby> Remarks </ rb> and </ rt> </ ruby>

And ruby will come into contact

Things necessary

The environment is MacOS, and the Python version is 3.5.1. 1.Mecab (In my case, it was included in mac by default, so I will omit it) 2.mecab-python3 3. pip (required to install mecab-python3)

Get what you need

Mecab is a dictionary analysis tool that allows you to get the kanji readings by using it.

You can execute / get it by executing the mecab command from the command line and entering text on the next line.

$ mecab
I want to quit college
My noun,Pronoun,General,*,*,*,I,I,I
Is a particle,Particle,*,*,*,*,Is,C,Wow
University noun,General,*,*,*,*,University,Daigaku,Daigaku
Particles,Case particles,General,*,*,*,To,Wo,Wo
Quit verb,Independence,*,*,One step,Continuous form,Quit,Yame,Yame
Tai auxiliary verb,*,*,*,Special Thailand,Uninflected word,Want,Thailand,Thailand
EOS

In order to handle this Mecab with python3, a package called python-mecab3 is required.

Because the tool that manages such packages is pip If pip is not installed, install it first. (Similar to gem in Ruby)

pip installation

$ easy_install pip

By the way, it seems that it is included by default from python3.4.

Install mecab-python3

After installing pip, install mecab-python3.

$ pip install mecab-python3

You can get a list of currently included packages with the pip list command. If you have mecab-python, the installation is successful.

$ pip list
mecab-python3 (0.7)

Let's handle Mecab with python3 for the time being

Now that the environment is ready for the time being, let's try using mecab from python and see how we can get the data.

macab.py


#!/usr/local/src/pyenv/shims/python
# -*- coding: utf_8 -*-
import sys
import MeCab

mecab = MeCab.Tagger("-Ochasen")#call mecab
text=mecab.parse('I want to quit college')#Get furigana

print(text)
$ mecab.py
I am my noun-Pronoun-General
Ha ha is a particle-Particle
University Daigaku University Noun-General
Wo Wo particle-Case particles-General
Quit quit verb-Independent one-stage continuous form
Tai Thai Tai Tai Auxiliary verb Special Thai basic form
EOS

Parse seems to mean "analysis"

mecab = MeCab.Tagger("-Ochasen")

After calling Mecab with

text=mecab.parse('I want to quit college')

You can get it in the format like the execution result by calling the parse method and specifying the text you want to analyze in the argument.

Use parseToNode

I was able to get it for the time being, but this format is very difficult to handle programmatically because it is separated by spaces.

Therefore, change it to a format called Node to make it easier to handle.

mecab.py


import sys
import MeCab


mecab = MeCab.Tagger("-Ochasen")
mecab.parse('')#Need to parse in the sky
node=mecab.parseToNode('I want to quit college')

while node :
    print(node.surface+"\t"+node.feature)
    node=node.next
$ mecab.py
	BOS/EOS,*,*,*,*,*,*,*,*
My noun,Pronoun,General,*,*,*,I,I,I
Is a particle,Particle,*,*,*,*,Is,C,Wow
University noun,General,*,*,*,*,University,Daigaku,Daigaku
Particles,Case particles,General,*,*,*,To,Wo,Wo
Quit verb,Independence,*,*,One step,Continuous form,Quit,Yame,Yame
Tai auxiliary verb,*,*,*,Special Thailand,Uninflected word,Want,Thailand,Thailand
	BOS/EOS,*,*,*,*,*,*,*,*

This makes it easier to handle because the analysis results can be obtained separated by ",".

As before, when you call mecab, it seems that you need to empty the argument and parse it for the time being, so do so.

afterwards

node=mecab.parseToNode('I want to quit college')

I will specify the sentence in the argument of the parseToNode method like this.

Since node is processed word by word, use a while statement Loop until the end of the sentence (end the value in node).

In the loop The original argument text in node.surface, You can get comma separated analysis data with node.feature.

By the way

  node=node.next

If you forget, it will loop infinitely, so be careful.

So far, I succeeded in using mecab from python for the time being.

Actually shake ruby in HTML format

Up to this point, there was a lot of information on the Internet, but From here onward, there is not much information and it is a difficult part.

However, basically it is a work to apply to html while playing with the above node.feature.

How to swing ruby in html

<ruby><rb>letter</rb><rt>Moji</rt></ruby>

Characters </ rb> Moji </ rt> </ ruby>

Because it will be in the form of Let's apply the original kanji in the tag and the furigana in the rt tag.

mecab.py


#!/usr/local/src/pyenv/shims/python
# -*- coding: utf_8 -*-

import sys
import MeCab
import re

mecab = MeCab.Tagger("-Ochasen")
mecab.parse('')#Need to parse in the sky
node=mecab.parseToNode('I want to quit college')

while node :
    origin=node.surface#Substitute the original word
    kana=node.feature.split(",")[7]#Substitute reading kana

    #Check if the regular expression matches the kanji
    pattern = "[one-龥]"
    matchOB = re.match(pattern , origin)

    #When the origin is empty, it is not necessary to shake the furigana when it is not a kanji, so it is output as it is
    if origin != "" and matchOB:
        print("<ruby><rb>{0}</rb><rt>{1}</rt></ruby>".format(origin,kana),end="")
    else :
        print(origin)

    node=node.next

$ mecab.py
<ruby><rb>I</rb><rt>I</rt></ruby>Is
<ruby><rb>University</rb><rt>Daigaku</rt></ruby>To
<ruby><rb>Quit</rb><rt>Yame</rt></ruby>Want

I </ rb> I </ rt> </ ruby> University </ rb> Daigaku </ rt> </ ruby> ruby> Quit </ rb> Yame </ rt> </ ruby> I want to

It's getting more like that.

First, in the first two lines, substitute words and readings for origin and kana, respectively.

Regarding the yomigana, since the yomigana is the 7th comma-separated node.feature, the 7th comma-separated array is specified by the split function.

origin=node.surface #Substitute the original word
kana=node.feature.split(",")[7] #Substitute reading kana

Next, this time I want to use furigana only when the original word is kanji. Use regular expressions to check if the original word is a kanji.

To use a regular expression, first import re at the beginning.

import re

I won't go into details about regular expressions here, There are various methods for the re imported earlier.

This time I want to find out if the first character of the word origin is a kanji Use the match method. Specify a pattern (Kanji this time) in the first argument and a word in the second argument.

If there is a match, information about the match called a match object is stored. If there is no match, "None" is stored.

  #Check if the regular expression matches the kanji
    pattern = "[one-龥]" #Kanji pattern
    matchOB = re.match(pattern , origin) #None when not in kanji

Finally, if is used for judgment.

If you organize it in Japanese, ・ When origin is a kanji ➡ Shake ruby with html and output ・ When origin is other than kanji (hiragana, katakana, numbers, etc.) ➡ Output as it is It will be.

  #When the origin is empty, it is not necessary to shake the furigana when it is not a kanji, so it is output as it is
    if matchOB and origin != "" :
        print("<ruby><rb>{0}</rb><rt>{1}</rt></ruby>".format(origin,kana),end="")
    else :
        print(origin)

When origin is other than Kanji, None is assigned to marchOB, so it will be judged and it will shift to else.

Up to this point, it has become quite a shape, but there is a problem.

Katakana problem

I want to swing the okurigana with hiragana, but the reading kana that can be obtained with Mecab is katakana, so I want to convert this to hiragana.

In Ruby, you can do it in one shot by using a function, but as far as I checked with Python, it seems that it can not be done.

Therefore, I will devise something here.

First, prepare an array of such hiragana and katakana characters.

hiragana=["Ah","I","U","Hmm"]#hiragana[0]=="Ah"
katakana=["A,"I","C","Down"]#katakana[0]=="A"

And for example Suppose you want to change the word "ai" to "ai".

text=list("Ai")#text[0]=="A" text[1]=="I"become
kana=""#Variable to put hiragana

for hoge in len(text)#Repeat for the number of characters(Substitute for hoge)
  for i in list(katakana)
    katakana[i]==hoge
    kana+=hiragana[i]

print(kana)#Ai

I want to divide "Ai" into A and I and collate them one by one. Divide the "eye" character by character into an array.

Then, loop with for for the number of arrays (= number of characters).

In it, loop for the number of katakana and match with hoge. Since the content of the hoge of the first loop this time is "A" When i is 0, it matches katakana [0].

Finally, hiragana [i], that is, hiragana [0], that is, "a" To the variable kana.

If you repeat this for the number of characters, kana will have hiragana.

Here is the code that modularizes this.

mecab.py


#!/usr/local/src/pyenv/shims/python
# -*- coding: utf_8 -*-

import sys
import MeCab
import re

def henkan(text) :
    hiragana=[chr(i) for i in range(12353, 12436)]
    katakana=[chr(i) for i in range(12449, 12532)]
    kana=""
    #Hiragana with Kana Kana
    for text in list(text):
        for i in range(83):
            if text == katakana[i]:
                kana+=hiragana[i]
    return kana

mecab = MeCab.Tagger("-Ochasen")
mecab.parse('')#Need to parse in the sky
node=mecab.parseToNode('I want to quit college')

while node :
    origin=node.surface#Substitute the original word
    kana=node.feature.split(",")[7]#Substitute reading kana
    kana=henkan(kana)#Call the conversion function and use katakana as hiragana

    #Check if the regular expression matches the kanji
    pattern = "[one-龥]"
    matchOB = re.match(pattern , origin)

    #When the origin is empty, it is not necessary to shake the furigana when it is not a kanji, so it is output as it is
    if origin != "" and matchOB:
        print("<ruby><rb>{0}</rb><rt>{1}</rt></ruby>".format(origin,kana),end="")
      
    else :
        print(origin)

    node=node.next

I will omit the details,

range(83)

The reason for this is that the number of arrays is 83 as well as the Japanese syllabary with the addition of small letters.

Okurigana problem

Another problem is the problem of okurigana. Since Mecab is separated by word, it also includes okurigana, which means that ruby is not assigned purely to kanji.

In other words

<ruby><rb>Quit</rb><rt>stop</rt></ruby>

Quit </ rb> Stop </ rt> </ ruby>

not

<ruby><rb>Resignation</rb><rt>Or</rt></ruby>Me

Remarks </ rb> and </ rt> </ ruby>

I want to do it.

Let's take a look at this.

Basically, it determines whether the original word and the ending of the okurigana match.

origin: beauty good </ font> kana: Utsukushi Shii </ font> origin: run run </ font> kana: chopsticks ru </ font>

Since these have the same two or one letter at the end, they can be judged as okurigana.

Regarding this as well, as in the case of the katakana problem earlier. Decompose origin and kana character by character and arrange them Judges whether one or two letters at the end of a word match.

mecab.py


#!/usr/local/src/pyenv/shims/python
# -*- coding: utf_8 -*-

import sys
import MeCab
import re

def henkan(text) :
    hiragana=[chr(i) for i in range(12353, 12436)]
    katakana=[chr(i) for i in range(12449, 12532)]
    kana=""
    #Hiragana with Kana Kana
    for text in list(text):
        for i in range(83):
            if text == katakana[i]:
                kana+=hiragana[i]
    return kana


mecab = MeCab.Tagger("-Ochasen")
mecab.parse('')#Need to parse in the sky
node=mecab.parseToNode('I want to quit college')

while node :
    origin=node.surface#Substitute the original word
    yomi=node.feature.split(",")[7]#Substitute reading kana
    kana=henkan(yomi)

    #Check if the regular expression matches the kanji
    pattern = "[one-龥]"
    matchOB = re.match(pattern , origin)

    #When the origin is empty, it is not necessary to shake the furigana when it is not a kanji, so it is output as it is
    if origin != "" and matchOB:

        origin=list(origin)
        kana=list(kana)
        num1=len(origin)
        num2=len(kana)
        okurigana=""

        if origin[num1-1] == kana[num2-1] and origin[num1-2] == kana[num2-2] :
            okurigana=origin[num1-2]+origin[num1-1]

            origin[num1-1]=""
            origin[num1-2]=""
           kana[num2-1]=""
           kana[num2-2]=""

            origin="".join(origin)
            kana="".join(kana)

        elif origin[num1-1] == kana[num2-1] :

            okurigana=origin[num1-1]

            origin[num1-1]=""
            kana[num2-1]=""

            origin="".join(origin)
            kana="".join(kana)

        else :
            origin="".join(origin)
            kana="".join(kana)

        print("<ruby><rb>{0}</rb><rt>{1}</rt></ruby>".format(origin,kana),end="")
        print(okurigana)

    else :
        print(origin)

    node=node.next

When you do this,

$ mecab.py
<ruby><rb>I</rb><rt>I</rt></ruby>
Is
<ruby><rb>University</rb><rt>University</rt></ruby>
To
<ruby><rb>Resignation</rb><rt>Or</rt></ruby>Me
Want

I </ rb> I </ rt> </ ruby> University </ rb> Daigaku </ rt> </ ruby> Remarks </ rb> and </ rt> </ ruby>

And it's completed! !! !!

Let's take a closer look.

origin=list(origin)
kana=list(kana)

num1=len(origin)
num2=len(kana)
 
okurigana=""

First of all, as I said earlier, I checked each character, so I will arrange it with the list function.

Also, to find out if the endings match, check the number of arrays with the len function. Substitute for num1 and num2 respectively.

  if origin[num1-1] == kana[num2-1] and origin[num1-2] == kana[num2-2] :
            okurigana=origin[num1-2]+origin[num1-1]

            origin[num1-1]=""
            kana[num1-2]=""
            origin[num2-1]=""
            kana[num2-2]=""

            origin="".join(origin)
            kana="".join(kana)

This is the processing when the last character and the penultimate match, that is, This is the process for words like "beautiful".

In this case, the last two letters "shii" will be the okurigana, so Assign this to the okurigana variable.

okurigana=origin[num1-2]+origin[num1-1]

Once you put it in a variable, you don't need "Shii", so delete it.

origin[num1-1]=""
kana[num1-2]=""
origin[num2-1]=""
kana[num2-2]=""

Finally left origin = ["Beauty"] kana = ["u", "tsu", "ku"] Use the join function to return to a variable.

origin="".join(origin)
kana="".join(kana)

For the rest, maybe it's a one-character okurigana like "run" This is the process when there is no okurigana like "university".

        elif origin[num1-1] == kana[num2-1] :
            okurigana=origin[num1-1]

            origin[num1-1]=""
            kana[num2-1]=""

            origin="".join(origin)
            kana="".join(kana)

        else :
            origin="".join(origin)
            kana="".join(kana)

Since it has been arranged, even when there is no processing and there is no okurigana Work to return to a variable.

The last is output.

Since it has been assigned to the variable exactly in the processing so far, there is no special change, but since the okurigana is assigned to okurigana, it is ok if you do not forget to output this.

print("<ruby><rb>{0}</rb><rt>{1}</rt></ruby>".format(origin,kana),end="")
print(okurigana)

The rest is a little hard to see as it is, so if you make it a function, it's done!

mecab.py


#!/usr/local/src/pyenv/shims/python
# -*- coding: utf_8 -*-

import sys
import MeCab
import re

def henkan(text) :
    hiragana=[chr(i) for i in range(12353, 12436)]
    katakana=[chr(i) for i in range(12449, 12532)]
    kana=""
    #Hiragana with Kana Kana
    for text in list(text):
        for i in range(83):
            if text == katakana[i]:
                kana+=hiragana[i]
    return kana


def tohensu(origin,kana) :
    origin="".join(origin)
    kana="".join(kana)
    return origin,kana

def kanadelete(origin,kana) :
    origin=list(origin)
    kana=list(kana)
    num1=len(origin)
    num2=len(kana)
    okurigana=""

    if origin[num1-1] == kana[num2-1] and origin[num1-2] == kana[num2-2] :
        okurigana=origin[num1-2]+origin[num1-1]
        origin[num1-1]=""
        origin[num1-2]=""
        kana[num2-1]=""
        kana[num2-2]=""
        origin,kana=tohensu(origin,kana)

    elif origin[num1-1] == kana[num2-1] :

        okurigana=origin[num1-1]

        origin[num1-1]=""
        kana[num2-1]=""
        origin="".join(origin)
        kana="".join(kana)
    else :
        origin,kana=tohensu(origin,kana)

    return origin,kana,okurigana

mecab = MeCab.Tagger("-Ochasen")
mecab.parse('')#Need to parse in the sky
node=mecab.parseToNode("I want to quit college")

while node :
    origin=node.surface#Substitute the original word
    yomi=node.feature.split(",")[7]#Substitute reading kana
    kana=henkan(yomi)

    #Check if the regular expression matches the kanji
    pattern = "[one-龥]"
    matchOB = re.match(pattern , origin)

    #When the origin is empty, it is not necessary to shake the furigana when it is not a kanji, so it is output as it is
    if origin != "" and matchOB:
        origin,kana,okurigana=kanadelete(origin,kana)
        print("<ruby><rb>{0}</rb><rt>{1}</rt></ruby>".format(origin,kana),end="")
        print(okurigana)
    else :
        print(origin)

    node=node.next

Recommended Posts