It uses curses, a library for creating TUI (text user interface), to output the learning progress of BPE in a nice way.

The whole code is uploaded to gist. :arrow_right: bpe_curses.py

environment

macOS Catalina
Python3.7.3
iTerm2 (3.3.7)

BPE

(If you just want to know about curses, skip it)

What is BPE

Byte Pair Encoding is a technique that is also used in Sentencepiece, which is a tokenizer for neural machine translation. The first appearance was Neural Machine Translation of Rare Words with Subword Units (ACL2016), and the implementation is also described in the paper.

For example, if you have words like lower, newer, wider, you can reduce the number of vocabulary by treating the frequent ʻe r as one symbol ʻer.

As you can see, although it is famous as a subword division algorithm in NLP, it is a data compression method in the first place, and it is called [Byte pair encoding (Wikipedia)](https://ja.wikipedia.org/wiki/Byte pair encoding) The principle is also introduced.

This time, based on the code of the paper, the compression progress is output.

BPE implementation overview

For the implementation of BPE, I just used the code of the paper as it is and added a type hint.

The implementation consists of two main functions, get_status () and merge_vocab ().

--get_status takes a vocal dictionary and checks the frequency of word combinations. -defaultdict is used to handle combinations that are not in the key.

  def get_stats(vocab: Dict) -> DefaultDict:
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[symbols[i], symbols[i+1]] += freq
    return pairs

--In merge_vocab, among the combinations examined by get_status, the most frequent combination is merged to treat it as one word.

def merge_vocab(pair: List, v_in: Dict) -> Dict:
    v_out = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!<\S)' + bigram + r'(?!\S)')
    for word in v_in:
        w_out = p.sub(''.join(pair), word)
        v_out[w_out] = v_in[word]
    return v_out

This time, we will display the state transition of the word after merge_vocab.

Curses

What is Curses?

The curses library provides terminal-independent screen drawing and keyboard processing for text-based terminals (terminals) such as VT100s, Linux consoles, and emulation terminals provided by various programs. From Curses Programming in Python

curses is a standard python module. (It doesn't seem to be included in the windows version ...) If you use curses, you can easily create something like a CUI application.

For example, life.py in the python demo can be found in [Life Games](https: // ja.wikipedia.org/wiki/Lifegame) code.

How to use Curses

We will implement the display of state transitions with curses.

Let's use wrapper

As described in Curses Programming in Python (https://docs.python.org/en/3/howto/curses.html), curses.wrapper to avoid error handling and initialization complexity. Use the () function.

import curses

def main(stdscr):
	#Call curses processing with stdscr

if __name__ == '__main__':
    curses.wrapper(main)

Basic operation

The basic processing flow is as follows.

--stdscr.addstr (str): Add text str to the current position --stdscr.refresh (): Refresh display ――Display the text that has been addedstr --stdscr.getkey ()`: Accept keystrokes --Waiting role (otherwise the program will end)

for i in range(10):
    stdscr.addstr('{}\n'.format(i))
    stdscr.refresh()
    stdscr.getkey()

In the case of the above code, the number is displayed and the waiting state is repeated.

Prevents off-screen errors

Attempting to display in a range longer than the height of the screen will result in an error.

To prevent errors, it is necessary to first obtain the current display size and then devise not to specify outside the display range. You can get the size with getmaxyx ().

stdscr_y, stdscr_x = stdscr.getmaxyx()

Ingenuity of display

If it is left as it is, it will not taste good, so I will try to devise a display.

This time, if you merge the letters, the words will be bold. Specifically, add the attribute information curses.A_BOLD to ʻaddstr`.

You can also color or blink it. The actual attributes and execution results of Attributes and Colors are as follows.

stdscr.addstr('This is A_BOLD\n', curses.A_BOLD)
stdscr.addstr('This is A_BLINK\n', curses.A_BLINK)
stdscr.addstr('This is A_DIM\n', curses.A_DIM)
stdscr.addstr('This is A_STANDOUT\n', curses.A_STANDOUT)
stdscr.addstr('This is A_REVERSE\n', curses.A_REVERSE)
stdscr.addstr('This is A_UNDERLINE\n\n', curses.A_UNDERLINE)
#Specify the background and text color
curses.init_pair(1, curses.COLOR_RED, curses.COLOR_WHITE)
stdscr.addstr("This is curses.init_pair(1, curses.COLOR_RED, curses.COLOR_WHITE)\n", curses.color_pair(1))

To make the display easier to understand, the iTerm setting for the result output is set to Dark Background.

Output result

When you execute bpe_curses.py created in consideration of the above, the updated result is output by merging each time you hit the key as shown below. Will be done.

It's small and a little confusing, but it's a word that contains pairs with ** bold ** merged. Initially, newer and wider are in bold because ʻeandr` are the most frequent pairs. (2nd line)

Also, by repeating merge 10 times, you can see that the number of vocabulary displayed on the far left has decreased. (14 → 6)

I don't think there was much need to be interactive this time, but curses can receive keystrokes, so I feel that it can be used in various ways depending on the device. It's easier than implementing a GUI, so it might be nice to show a little output.

reference

--Document

curses https://docs.python.org/ja/3/library/curses.html --Curses programming in Python https://docs.python.org/ja/3/howto/curses.html --Using the curses module in Python https://torina.top/detail/437/

Interactively output BPE using python curses