It uses curses, a library for creating TUI (text user interface), to output the learning progress of BPE in a nice way.
The whole code is uploaded to gist. :arrow_right: bpe_curses.py
BPE
(If you just want to know about curses, skip it)
Byte Pair Encoding is a technique that is also used in Sentencepiece, which is a tokenizer for neural machine translation. The first appearance was Neural Machine Translation of Rare Words with Subword Units (ACL2016), and the implementation is also described in the paper.
For example, if you have words like lower
, newer
, wider
, you can reduce the number of vocabulary by treating the frequent ʻe r as one symbol ʻer
.
As you can see, although it is famous as a subword division algorithm in NLP, it is a data compression method in the first place, and it is called [Byte pair encoding (Wikipedia)](https://ja.wikipedia.org/wiki/Byte pair encoding) The principle is also introduced.
This time, based on the code of the paper, the compression progress is output.
For the implementation of BPE, I just used the code of the paper as it is and added a type hint.
The implementation consists of two main functions, get_status ()
and merge_vocab ()
.
--get_status
takes a vocal dictionary and checks the frequency of word combinations.
-defaultdict
is used to handle combinations that are not in the key.
def get_stats(vocab: Dict) -> DefaultDict:
pairs = collections.defaultdict(int)
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols)-1):
pairs[symbols[i], symbols[i+1]] += freq
return pairs
--In merge_vocab
, among the combinations examined by get_status
, the most frequent combination is merged to treat it as one word.
def merge_vocab(pair: List, v_in: Dict) -> Dict:
v_out = {}
bigram = re.escape(' '.join(pair))
p = re.compile(r'(?<!<\S)' + bigram + r'(?!\S)')
for word in v_in:
w_out = p.sub(''.join(pair), word)
v_out[w_out] = v_in[word]
return v_out
This time, we will display the state transition of the word after merge_vocab
.
Curses
The curses library provides terminal-independent screen drawing and keyboard processing for text-based terminals (terminals) such as VT100s, Linux consoles, and emulation terminals provided by various programs. From Curses Programming in Python
curses is a standard python module. (It doesn't seem to be included in the windows version ...) If you use curses, you can easily create something like a CUI application.
For example, life.py in the python demo can be found in [Life Games](https: // ja.wikipedia.org/wiki/Lifegame) code.
We will implement the display of state transitions with curses.
As described in Curses Programming in Python (https://docs.python.org/en/3/howto/curses.html), curses.wrapper to avoid error handling and initialization complexity. Use the ()
function.
import curses
def main(stdscr):
#Call curses processing with stdscr
if __name__ == '__main__':
curses.wrapper(main)
The basic processing flow is as follows.
--stdscr.addstr (str)
: Add text str
to the current position
--stdscr.refresh ()
: Refresh display
――Display the text that has been addedstr --
stdscr.getkey ()`: Accept keystrokes
--Waiting role (otherwise the program will end)
for i in range(10):
stdscr.addstr('{}\n'.format(i))
stdscr.refresh()
stdscr.getkey()
In the case of the above code, the number is displayed and the waiting state is repeated.
Attempting to display in a range longer than the height of the screen will result in an error.
To prevent errors, it is necessary to first obtain the current display size and then devise not to specify outside the display range. You can get the size with getmaxyx ()
.
stdscr_y, stdscr_x = stdscr.getmaxyx()
If it is left as it is, it will not taste good, so I will try to devise a display.
This time, if you merge the letters, the words will be bold.
Specifically, add the attribute information curses.A_BOLD
to ʻaddstr`.
You can also color or blink it. The actual attributes and execution results of Attributes and Colors are as follows.
stdscr.addstr('This is A_BOLD\n', curses.A_BOLD)
stdscr.addstr('This is A_BLINK\n', curses.A_BLINK)
stdscr.addstr('This is A_DIM\n', curses.A_DIM)
stdscr.addstr('This is A_STANDOUT\n', curses.A_STANDOUT)
stdscr.addstr('This is A_REVERSE\n', curses.A_REVERSE)
stdscr.addstr('This is A_UNDERLINE\n\n', curses.A_UNDERLINE)
#Specify the background and text color
curses.init_pair(1, curses.COLOR_RED, curses.COLOR_WHITE)
stdscr.addstr("This is curses.init_pair(1, curses.COLOR_RED, curses.COLOR_WHITE)\n", curses.color_pair(1))
Dark Background
.When you execute bpe_curses.py created in consideration of the above, the updated result is output by merging each time you hit the key as shown below. Will be done.
It's small and a little confusing, but it's a word that contains pairs with ** bold ** merged. Initially, newer
and wider
are in bold because ʻeand
r` are the most frequent pairs. (2nd line)
Also, by repeating merge 10 times, you can see that the number of vocabulary displayed on the far left has decreased. (14 → 6)
I don't think there was much need to be interactive this time, but curses can receive keystrokes, so I feel that it can be used in various ways depending on the device. It's easier than implementing a GUI, so it might be nice to show a little output.
--Document
Recommended Posts