I learned about MeCab cost calculation, so I summarized it. Please point out if something is wrong.
MeCab performs morphological analysis using the registered dictionary. If a word (unknown word) that is not registered in the dictionary appears, it will be divided based on the cost of each word. Among them, ** the one with the lowest total cost ** is output as a result.
This time, we will use the ipadic-neologd dictionary to check how the fictitious word "American German Village" is morphologically analyzed.
echo American German Village|mecab -d C:\neologd -N2
American noun,Proper noun,area,Country,*,*,America,America,America
German noun,Proper noun,area,Country,*,*,Germany,Germany,Germany
Village noun,suffix,area,*,*,*,village,village,village
EOS
American noun,Proper noun,area,Country,*,*,America,America,America
German Village Noun,Proper noun,General,*,*,*,German village,German village,German village
EOS
Specify the dictionary with -d and list the number of candidates specified by the NUM option. In this way, two types of divisions for unknown words were listed as candidates. Isn't this division quite convincing for us?
Before we get into the cost calculation, we will explain how to register the MeCab dictionary. The dictionary is
Surface type,Left context ID,Right context ID,cost,Part of speech,Part of speech細分類1,Part of speech細分類2,Part of speech細分類3,Inflected form,Utilization type,Prototype,reading,pronunciation
Save it as a csv file and then build the dictionary. Looking at this, ・ ** Left context ID ** ・ ** Right context ID ** ・ ** Cost ** It contains an unfamiliar word. These are the information used for MeCab morphological analysis.
The ** occurrence cost ** is the ** difficulty of appearing ** of the word itself. The higher the value, the less likely the word will appear. The occurrence cost is the value of ** cost ** of the dictionary registered earlier. So what was the cost of the "American German Village"?
echo American German Village|mecab -F "%m,%c,\n" -d C:\neologd -N2
America, 4698,
Germany, 2543,
village, 8707,
EOS
America, 4698,
German village, 611,
EOS
Use% m to display the surface layer type, and% c to display the occurrence cost. One question arises here. The total cost should obviously be lower for the second, but the first candidate for output is the first result. The reason is the existence of a new cost, ** articulation cost **.
** Concatenation cost ** is the difficulty of concatenating the context IDs of two words. The smaller the value, the more likely it is to be continuous. The context ID corresponds to the ** left context ID and right context ID ** of the dictionary. Basically, this ID seems to be the same value at the time of registration. For example, consider the word "before and after". The "before" context ID is 1314 and the "after" context ID is 1313. The concatenation cost is determined by the combination of the left context ID and the right context ID. A list of combinations can be found in matrix.def (or matrix.bin) in MeCab \ dic \ ipadic. Looking at this,
1314 1313 -316
1313 1314 716
Since the connection cost is low (-316) from front to back, it is easy to continue, and from back to front, the connection cost is high (716) and it is difficult to continue. I think this is also quite convincing. Let's take a look at "American German Village".
echo American German Village|mecab -F"%m,%phl,%phr,%c,%pc,%pn\n" -d C:\neologd -N2
America,1294,1294,4698,3746,3746
Germany,1294,1294,2543,-141,-3887
village,1303,1303,8707,881,1022
EOS
America,1294,1294,4698,3746,3746
German village,1288,1288,611,2614,-1132
EOS
The MeCab command can be summarized as follows.
command | Description |
---|---|
%m | Surface type |
%phl | Left context ID |
%phr | Right context ID |
%c(Or%pw) | Occurrence cost |
%pc | Connection cost+Word occurrence cost(Cumulative from the beginning of the sentence) |
%pn | Connection cost+Word occurrence cost(Its morpheme alone, %pw+%pC) |
All commands are listed here [https://taku910.github.io/mecab/format.html). Since the output is difficult to understand, I will also tabulate this.
Surface type | Left context ID | Right context ID | Occurrence cost | Jacobs bogie + occurrence(Accumulation) | Articulation+Occurrence(Alone) |
---|---|---|---|---|---|
America | 1294 | 1294 | 4698 | 3746 | 3746 |
Germany | 1294 | 1294 | 2543 | -141 | -3887 |
village | 1303 | 1303 | 8707 | 881 | 1022 |
America | 1294 | 1294 | 4698 | 3746 | 3746 |
German village | 1288 | 1288 | 611 | 2614 | -1132 |
Please note that ** BOS and EOS are also given context IDs **. So, the connection cost of the first "America" is from matrix.def
0 1294 -952
It will be. Therefore, the cumulative cost is 4698-952 = 3746. Next, let's look at "Germany". The left and right context IDs are 1294 and the connection cost is -6430, which is quite small. (It is rare that country names continue in a row ...) The cumulative cost was (3746 + 2543) -6430 = -141 and 2543-6430 = -3887 on its own, consistent with the calculation. Also, although it is not output, the final check is performed because EOS also has a context ID. The concatenation cost of context ID 1303 → 0 is 5, and the cumulative cost of context ID 1288 → 0 is -919. Comparing the cumulative costs, the first one is the lowest cost in 886 and 1695, so the mystery mentioned earlier was solved.
Understand MeCab cost calculation. Cost calculation of MeCab learned at NTV Tokyo Peek behind the scenes of Japanese morphological analysis! How MeCab Parses Morphological Analysis
Recommended Posts