Implemented TRIE tree in Python LOUDS is used for the data structure to create the TRIE tree
--A TRIE tree is a set of character strings represented by an ordered tree structure. --Composed of nodes and edges, and the position and key (search string) of each node correspond. --The empty string corresponds to the root node
--Fast dictionary search: You can search in a constant time regardless of the size of the tree. --Memory saving: Since the key is not stored in the node, the node can be shared by multiple keys, and memory can be saved. --Efficient prefix search: Since a node is shared by multiple keys, it is possible to efficiently search for child nodes associated with the parent node (prefix).
Due to its characteristics, the TRIE tree is used for kana-kanji conversion and auto-completion.
LOUDS --LOUDS (level order unary degree sequence) is one of the order tree expressions, which can express a tree structure with an extremely small size. --Specifically, when the number of nodes is N, the tree structure is represented by a label array of length N and a bit array of 2N + 1. ――However, in order to search efficiently from the created Tree, auxiliary data (complete dictionary) is required separately.
--A complete dictionary is the most basic data structure in a succinct data structure.
--By using the complete dictionary as auxiliary data, it is possible to search in a constant time.
--Specifically, search the TRIE tree using the operations rank and select.
--rank (): How many bits of 1 are from the beginning of the bit string to the position k
--select (): Where is the position next to the nth 1-bit when viewed from the beginning of the bit string?
The succinct data structure is explained in detail on this page
Implemented the following in Python (GitHub)
--trie.py: TRIE tree --constructor.py: bit array --words.py: Create / read dictionary data --measure.py: Measure memory and search time --search_word.py: Word search --test.py: Measure search time
When executed as follows, dictionary data in which the node numbers and words of the TRIE tree are written separated by commas is created. Data used nltk's wordnet corpus Later, test data will be created from this dictionary data.
from words import CreateWords
CreateWords("./data/origin/wordnet_words.csv")
python search_word.py dictionary data PATH
You can do a single word search by running the above file When you enter an arbitrary word, the node number, word definition, and prefix obtained from the search are output as shown below.
You can create test data for any number of words from dictionary data by executing the following
python words.py Dictionary data PATH Number of samples 1, Number of samples 2, Number of samples 3,…
If you want to create multiple test data, specify the data sample size separated by commas. If "Test data is created." Is output, it is ok. Test data is created in ./data/test
When executed, the test can be executed any number of times with the test data as shown below.
python test.py dictionary data PATH test data PATH test count
If "Test is done." Is output, it is ok. Output results are created in ./results
In the measurement, the exact match search time and the prefix search time are measured.
--Exact match search: Total execution time of search function of trie class --Prefix search: Total execution time of get_blow_nodes function of trie class
Internally, the search function of the trie class is executed for the input word, and the node number of the TRIE tree is output. The output node number is collated with the node number of the dictionary data, and if they match, the number of searches is incremented by 1, and it is confirmed whether the search is accurate. The same applies to prefix search
--Exact match search: Total time taken to search for a word --Search result: Number of items that match the node number of the dictionary data --Prefix search: Total time taken to search all prefixes associated with the search word --Prefix search (per item): Search time per prefix associated with the search word
-Explanation of succinct data structure LOUDS (12 times in total, with exercises) -Explanation of complete dictionary (concise bit vector) -I wrote a tri-tree that can be used for high-speed dictionary implementation in D language and Python
Recommended Posts