Introduction to dictionary lookup algorithm

Introduction

There were many occasions when I was doing natural language processing in my work and I was asked to speed up something, so I came up with the term dictionary lookup algorithm. Perhaps because deep learning is too widespread in the natural language processing area, it seems that there are many people who do not come to the point even if they hear it as a dictionary lookup algorithm.

What is a dictionary lookup algorithm?

An algorithm that looks up words in a dictionary from all possible substrings in a text. It supports the back side of processing such as morphological analysis. It seems that MeCab and Human can be made by mastering this path. For example, given the string abc, the substrings are:

Check if these substrings are in the dictionary. In short, it's like an algorithm that pulls words in a dictionary from a character string.

The computational complexity of enumerating this substring is $ O (N \ ^ 2) , but at this point it is far from practical application. ( O (N) $ is preferred)

By the way, the dictionary is sometimes called a hash, but since it costs $ O (N) $ to calculate the hash value of a character string, it costs $ O (N ^ 3) $ in the end.

Example of speeding up

Therefore, we have to speed up somehow. An example is the $ common prefix search $.

This is against the sentence "obstruction of public affairs execution"

--Public --Public affairs --Public affairs execution --Interference with the execution of public affairs

It is a process that pulls the word registered in the dictionary from the sentences with the same prefix.

This is accelerated using a data structure called a tri-tree. (See here for trie trees)

public
 |---end 
Duties
 |---end
Execution
 |---end
Interference
 |
end

When the word "public" appears, make it possible to search by tracing the substring like this. (When end appears, it is registered as one word at that time)

The computational complexity of this tri-tree search is $ O (K) $. (K: average word length)

By using the common prefix search, for the sentence "I want to quit college"

--Common prefix search for the word "large" --Common prefix search for the word "gaku" -(Omitted)

You can look up the dictionary like this. This is $ O (NK) $.

Trivia

It seems that tri-trees are used in MeCab, and it seems that the fastest tri-tree implementation method called double array is used. (The link destination is an easy-to-understand slide with a double array) A C ++ library called Darts is open to the public, making it easy to use double arrays.

Next, I would like to focus on the processing after looking up the dictionary.

Recommended Posts

Introduction to dictionary lookup algorithm
Introduction to MQTT (Introduction)
Introduction to Scrapy (3)
Introduction to Supervisor
Introduction to Tkinter 1: Introduction
Introduction to PyQt
Introduction to Scrapy (2)
Dictionary learning algorithm
[Linux] Introduction to Linux
[Introduction to Udemy Python3 + Application] 24. Dictionary type
Introduction to Scrapy (4)
Introduction to discord.py (2)
[Introduction to Udemy Python3 + Application] 61. Dictionary comprehension
Introduction to discord.py
[What is an algorithm? Introduction to Search Algorithm] ~ Python ~
[Introduction to Udemy Python 3 + Application] 26. Copy of dictionary
[Introduction to Algorithm] Find the shortest path [Python3]
Introduction to Lightning pytorch
Introduction to Web Scraping
Introduction to Nonparametric Bayes
Introduction to EV3 / MicroPython
Introduction to Python language
Introduction to TensorFlow-Image Recognition
Introduction to OpenCV (python)-(2)
Introduction to PyQt4 Part 1
How to use dictionary {}
Introduction to Dependency Injection
Introduction to Private Chainer
Access to dictionary fields
Introduction to machine learning
[Introduction to Reinforcement Learning] part.1-Epsilon-Greedy Algorithm in Bandit Game
[Introduction to Udemy Python3 + Application] 53. Dictionary of keyword arguments
[Introduction to Udemy Python3 + Application] 27. How to use the dictionary
AOJ Introduction to Programming Topic # 1, Topic # 2, Topic # 3, Topic # 4
Introduction to electronic paper modules
A quick introduction to pytest-mock
Introduction to Monte Carlo Method
[Learning memorandum] Introduction to vim
Introduction to PyTorch (1) Automatic differentiation
opencv-python Introduction to image processing
Introduction to Python Django (2) Win
Introduction to Cython Writing [Notes]
An introduction to private TensorFlow
Kubernetes Scheduler Introduction to Homebrew
An introduction to machine learning
[Introduction to cx_Oracle] Overview of cx_Oracle
A super introduction to Linux
AOJ Introduction to Programming Topic # 7, Topic # 8
[Introduction to pytorch-lightning] First Lit ♬
Add a dictionary to MeCab
Introduction to Anomaly Detection 1 Basics
Introduction to RDB with sqlalchemy Ⅰ
[Introduction to Systre] Fibonacci Retracement ♬
Introduction to Nonlinear Optimization (I)
Introduction to serial communication [Python]
AOJ Introduction to Programming Topic # 5, Topic # 6
Introduction to Deep Learning ~ Learning Rules ~
[Introduction to Python] <list> [edit: 2020/02/22]
Introduction to Python (Python version APG4b)
An introduction to Python Programming
Introduction to Python scikit-learn, matplotlib, single-layer algorithm (~ towards B3 ~ part3)