There were many occasions when I was doing natural language processing in my work and I was asked to speed up something, so I came up with the term dictionary lookup algorithm. Perhaps because deep learning is too widespread in the natural language processing area, it seems that there are many people who do not come to the point even if they hear it as a dictionary lookup algorithm.
An algorithm that looks up words in a dictionary from all possible substrings in a text. It supports the back side of processing such as morphological analysis. It seems that MeCab and Human can be made by mastering this path. For example, given the string abc, the substrings are:
Check if these substrings are in the dictionary. In short, it's like an algorithm that pulls words in a dictionary from a character string.
The computational complexity of enumerating this substring is $ O (N \ ^ 2)
By the way, the dictionary is sometimes called a hash, but since it costs $ O (N) $ to calculate the hash value of a character string, it costs $ O (N ^ 3) $ in the end.
Therefore, we have to speed up somehow. An example is the $ common prefix search $.
This is against the sentence "obstruction of public affairs execution"
--Public --Public affairs --Public affairs execution --Interference with the execution of public affairs
It is a process that pulls the word registered in the dictionary from the sentences with the same prefix.
This is accelerated using a data structure called a tri-tree. (See here for trie trees)
public
|---end
Duties
|---end
Execution
|---end
Interference
|
end
When the word "public" appears, make it possible to search by tracing the substring like this. (When end appears, it is registered as one word at that time)
The computational complexity of this tri-tree search is $ O (K) $. (K: average word length)
By using the common prefix search, for the sentence "I want to quit college"
--Common prefix search for the word "large" --Common prefix search for the word "gaku" -(Omitted)
You can look up the dictionary like this. This is $ O (NK) $.
It seems that tri-trees are used in MeCab, and it seems that the fastest tri-tree implementation method called double array is used. (The link destination is an easy-to-understand slide with a double array) A C ++ library called Darts is open to the public, making it easy to use double arrays.
Next, I would like to focus on the processing after looking up the dictionary.
Recommended Posts