Introduction

I learned about MeCab cost calculation, so I summarized it. Please point out if something is wrong.

Overview of MeCab morphological analysis

MeCab performs morphological analysis using the registered dictionary. If a word (unknown word) that is not registered in the dictionary appears, it will be divided based on the cost of each word. Among them, ** the one with the lowest total cost ** is output as a result.

Actually try

This time, we will use the ipadic-neologd dictionary to check how the fictitious word "American German Village" is morphologically analyzed.


echo American German Village|mecab -d C:\neologd -N2

American noun,Proper noun,area,Country,*,*,America,America,America
German noun,Proper noun,area,Country,*,*,Germany,Germany,Germany
Village noun,suffix,area,*,*,*,village,village,village
EOS

American noun,Proper noun,area,Country,*,*,America,America,America
German Village Noun,Proper noun,General,*,*,*,German village,German village,German village
EOS

Specify the dictionary with -d and list the number of candidates specified by the NUM option. In this way, two types of divisions for unknown words were listed as candidates. Isn't this division quite convincing for us?

MeCab dictionary

Before we get into the cost calculation, we will explain how to register the MeCab dictionary. The dictionary is

Surface type,Left context ID,Right context ID,cost,Part of speech,Part of speech細分類1,Part of speech細分類2,Part of speech細分類3,Inflected form,Utilization type,Prototype,reading,pronunciation

Save it as a csv file and then build the dictionary. Looking at this, ・ ** Left context ID ** ・ ** Right context ID ** ・ ** Cost ** It contains an unfamiliar word. These are the information used for MeCab morphological analysis.

Occurrence cost

The ** occurrence cost ** is the ** difficulty of appearing ** of the word itself. The higher the value, the less likely the word will appear. The occurrence cost is the value of ** cost ** of the dictionary registered earlier. So what was the cost of the "American German Village"?


echo American German Village|mecab -F "%m,%c,\n" -d C:\neologd -N2

America, 4698,
Germany, 2543,
village, 8707,
EOS

America, 4698,
German village, 611,
EOS

Use% m to display the surface layer type, and% c to display the occurrence cost. One question arises here. The total cost should obviously be lower for the second, but the first candidate for output is the first result. The reason is the existence of a new cost, ** articulation cost **.

Connection cost

** Concatenation cost ** is the difficulty of concatenating the context IDs of two words. The smaller the value, the more likely it is to be continuous. The context ID corresponds to the ** left context ID and right context ID ** of the dictionary. Basically, this ID seems to be the same value at the time of registration. For example, consider the word "before and after". The "before" context ID is 1314 and the "after" context ID is 1313. The concatenation cost is determined by the combination of the left context ID and the right context ID. A list of combinations can be found in matrix.def (or matrix.bin) in MeCab \ dic \ ipadic. Looking at this,

1314 1313 -316
1313 1314 716

Since the connection cost is low (-316) from front to back, it is easy to continue, and from back to front, the connection cost is high (716) and it is difficult to continue. I think this is also quite convincing. Let's take a look at "American German Village".


echo American German Village|mecab  -F"%m,%phl,%phr,%c,%pc,%pn\n" -d C:\neologd -N2

America,1294,1294,4698,3746,3746
Germany,1294,1294,2543,-141,-3887
village,1303,1303,8707,881,1022
EOS

America,1294,1294,4698,3746,3746
German village,1288,1288,611,2614,-1132
EOS

The MeCab command can be summarized as follows.

command	Description
%m	Surface type
%phl	Left context ID
%phr	Right context ID
%c(Or%pw)	Occurrence cost
%pc	Connection cost+Word occurrence cost(Cumulative from the beginning of the sentence)
%pn	Connection cost+Word occurrence cost(Its morpheme alone, %pw+%pC)

All commands are listed here [https://taku910.github.io/mecab/format.html). Since the output is difficult to understand, I will also tabulate this.

Surface type	Left context ID	Right context ID	Occurrence cost	Jacobs bogie + occurrence(Accumulation)	Articulation+Occurrence(Alone)
America	1294	1294	4698	3746	3746
Germany	1294	1294	2543	-141	-3887
village	1303	1303	8707	881	1022

America	1294	1294	4698	3746	3746
German village	1288	1288	611	2614	-1132

Please note that ** BOS and EOS are also given context IDs **. So, the connection cost of the first "America" is from matrix.def

0 1294 -952

It will be. Therefore, the cumulative cost is 4698-952 = 3746. Next, let's look at "Germany". The left and right context IDs are 1294 and the connection cost is -6430, which is quite small. (It is rare that country names continue in a row ...) The cumulative cost was (3746 + 2543) -6430 = -141 and 2543-6430 = -3887 on its own, consistent with the calculation. Also, although it is not output, the final check is performed because EOS also has a context ID. The concatenation cost of context ID 1303 → 0 is 5, and the cumulative cost of context ID 1288 → 0 is -919. Comparing the cumulative costs, the first one is the lowest cost in 886 and 1695, so the mystery mentioned earlier was solved.

reference

Understand MeCab cost calculation. Cost calculation of MeCab learned at NTV Tokyo Peek behind the scenes of Japanese morphological analysis! How MeCab Parses Morphological Analysis

About cost calculation of MeCab