Introduction

In high school, the curriculum guidelines have been revised, and the subject "Information" has also changed significantly. The biggest change is to learn programming from high school. Regarding this, the Ministry of Education, Culture, Sports, Science and Technology has released teaching materials for teacher training free of charge. In this article, I would like to write supplementary information related to programming that will be learned in high school in the future.

Various articles have already been written by our ancestor engineers, so please check the links below. [Unraveling the 2020 Course of Study Subject "Information"](https://qiita.com/woadachi/items/75cc334b10e22e4c5cad "Unraveling the 2020 Course of Study" Information ") Python of the Ministry of Education is not Python [Full-time faculty member of the subject "Information" and future dreams "Current status of IT engineer / programmer"](https://qiita.com/woadachi/items/fdcb67bed4d496d5823e "Full-time faculty member of the subject" Information "and future dream" IT engineer " ・ Current status of "programmers" ")

Teaching materials

[High School Information Department "Information I" Teacher Training Materials (Main Volume): Ministry of Education, Culture, Sports, Science and Technology](https://www.mext.go.jp/a_menu/shotou/zyouhou/detail/1416756.htm "High School Information Department "Information I" teaching materials for teacher training (main part): Ministry of Education, Culture, Sports, Science and Technology ") Chapter 2 Communication and Information Design (PDF: 6892KB) PDF

environment

In the new course of study, "Information I" mainly adds programming, and "Information II" mainly adds data science fields. Considering that, there are many explanations by code in python, so this article also uses python. In addition, it is assumed that Google Colab, which makes it easy to build an environment, is used.

ipython
Colaboratory - Google Colab

Overview

This time, we will use "Information I" for teacher training materials, "Chapter 2 Communication and Information Design". In accordance with the implicit policy of Information I, when programming with python, we will confirm and supplement the elements that are not written in the teaching materials. Specifically, on pages 61-63 of the teaching material "Chapter 2 Communication and Information Design"

(9) File compression-Huffman method

If you check the place, you can see the explanation of the Huffman algorithm, but the ** essential programming code ** is not included. So I would like to implement python code.

Other references, etc.

Huffman coding with Python3 I implemented Huffman coding in Python [Huffman Code-Wikipedia](https://ja.wikipedia.org/wiki/%E3%83%8F%E3%83%95%E3%83%9E%E3%83%B3%E7%AC%A6% E5% 8F% B7 "Huffman code-Wikipedia")

Huffman method (character string: about intelligence)

Step 1

Commentary

Arrange the characters in order of appearance count

Implementation

Let's use the python dictionary type and arrange them in order of the number of occurrences.


from collections import Counter
target_string = 'intelligence'

print('{0}Check the number of occurrences of the character' .format(target_string))
list_target_string = list(target_string)
counter_target_string = Counter(target_string)
dict_target_string = dict(counter_target_string)
print(dict_target_string)

Output result

Check the number of occurrences of the intelligence character
{'i': 2, 'n': 2, 't': 1, 'e': 3, 'l': 2, 'g': 1, 'c': 1}

Step 2

Commentary

Connect two characters with the least number of appearances and write the total number of appearances on it. Since the selected characters have already been added to the total, select the two characters with the least number of appearances from the total of the other 5 characters and the total number of characters in ②, enter the total, and all the characters are connected by the same procedure. Continue until the total is intelligence (12).

Implementation

class Node:
    def __init__(self, key_char=None, count=None, left=None, right=None):
        self.key_char = key_char
        self.left = left
        self.right = right
        self.count = count

nodes = []

for k, v in dict_target_string.items():
    node_obj = Node(key_char = k, count = v)
    nodes.append(node_obj)

trial_count = 0
while len(nodes) >= 2:
    #Sort nodes in descending order
    nodes = sorted(nodes, key=lambda x: x.count, reverse=True)

    #Number of branch creations
    trial_count += 1

    #Extract the nodes with the smallest number of occurrences
    left = nodes.pop()
    #Extract the node with the second smallest number of occurrences
    right = nodes.pop()

    #Add the number of occurrences of left and right to nodes
    count = left.count + right.count
    append_node = Node(key_char = left.key_char + right.key_char, count = count, left = left, right = right)
    nodes.append(append_node)

    print("Number of enforcement:", trial_count)
    print(left.key_char, right.key_char,  count)
    print("---------------")

Output result

Number of enforcement: 1
c g 2
---------------
Number of enforcement: 2
t cg 3
---------------
Number of enforcement: 3
l n 4
---------------
Number of enforcement: 4
i tcg 5
---------------
Number of enforcement: 5
e ln 7
---------------
Number of enforcement: 6
itcg eln 12
---------------

Step 3

Commentary

Put 0 on the left and 1 on the right of all the connected parts. Pick up the 0s and 1s in the spliced part from the top and assign the compressed bit string to each character.

Implementation

#Get the compression result
def recursive_code(node, s, temp_encode_dict):
    if len(node.key_char) == 1:
        temp_encode_dict[node.key_char] = s
        return
    recursive_code(node.left, s + "0", temp_encode_dict)
    recursive_code(node.right, s + "1", temp_encode_dict)

encode_dict = {}
tree = nodes[0]
recursive_code(tree, "", encode_dict)
print(encode_dict)

Output result

{'i': '00', 't': '010', 'c': '0110', 'g': '0111', 'e': '10', 'l': '110', 'n': '111'}

Compression rate

Commentary

If "intelligence" is expressed as a compressed bit string using the created Huffman code, it will be 33 bits. If one character of "intelligence" was represented by 3 bits, it was 36 bits, which means that it was compressed to about 92%.

Implementation

print(target_string,'Compression rate comparison')
bitlength = len(str(bin(len(dict_target_string) - 1))) - 2
before_size = target_len * bitlength
print('Size before compression')
print(target_len,'*',bitlength ,'=', before_size, 'bits')

encode_bits_string = ''
for v in list_target_string:
    encode_bits_string += encode_dict[v]
print('Bit string after compression')
print(encode_bits_string)
print('Compressed size')
print(len(encode_bits_string),'bits')

Output result

Intelligence compression ratio comparison
Size before compression
12 * 3 = 36 bits
Bit string after compression
001110101011011000011110111011010
Compressed size
33 bits

Comparison / correction of implementation results with teaching materials

The explanation of the teaching materials is as follows.

Let's compare it with the output result implemented here.

Check the number of occurrences of the intelligence character
{'i': 2, 'n': 2, 't': 1, 'e': 3, 'l': 2, 'g': 1, 'c': 1}

{'i': '00', 't': '010', 'c': '0110', 'g': '0111', 'e': '10', 'l': '110', 'n': '111'}

**···that? ** ** ** The output result is different. ** ** Since the method of generating branches and the method of distributing branches to the left and right are the contents that explain only the logic ignoring the implementation, this does not result as the teaching material. The following points should be noted.

--When arranging characters by the number of occurrences, it is necessary to arrange them in a stable sort. --Select two items that appear less frequently, but select items that have the same number of alphabets as much as possible. (Priority is given to the same number of connected characters in the first alphabet as the second one) --As for how to specify the left and right of the branch, the one with the smaller number of connected alphabets is assigned to the left, the one that is not assigned to the right is assigned to the right, and if not, the one acquired first is assigned to the right and the second one. The acquired one is on the left.

Therefore, implement according to this condition.

Step 2'

Implementation


class Node:
    def __init__(self, key_char=None, count=None, left=None, right=None):
        self.key_char = key_char
        self.left = left
        self.right = right
        self.count = count

nodes = []
encode_dict = {}

for k, v in dict_target_string.items():
    node_obj = Node(key_char = k, count = v)
    nodes.append(node_obj)

trial_count = 0
while len(nodes) >= 2:
    #Sort nodes in descending order
    nodes = sorted(nodes, key=lambda x: x.count, reverse=True)

    #Number of branch creations
    trial_count += 1

    #Extract the nodes with the smallest number of occurrences
    first = nodes.pop()
    #Extract the node with the second smallest number of occurrences
    second = nodes.pop()

    #temp_nodes
    temp_nodes = []

    #Determine if the first and second alphabetic combinations are the same
    if (len(first.key_char) != len(second.key_char)):
        print('The number of alphabet combinations of first and second is different')

        #When there is still one or more nodes, get the third and subsequent nodes
        while len(nodes) >= 1:
            overthird = nodes.pop()
            if (second.count == overthird.count):
                print('The number of occurrences of second and over third match')
                if (len(first.key_char) == len(overthird.key_char)):
                    print('The number of alphabetic combinations of first and overthird match')
                    nodes.append(second)
                    second = overthird
                    break
                else:
                  temp_nodes.append(overthird)
            else:
                nodes.append(overthird)
                break
    
    #Return what was popped
    nodes += temp_nodes

    #Add the number of occurrences of first and second to nodes
    count = first.count + second.count

    print("Number of enforcement:", trial_count)

    if len(first.key_char) < len(second.key_char): 
        append_node = Node(key_char = first.key_char + second.key_char, count = count, left = first, right = second)
    else:
        append_node = Node(key_char = second.key_char + first.key_char, count = count, left = second, right = first) 

    print(append_node.left.key_char, append_node.right.key_char,  count)

    nodes.insert(0, append_node)
    print("---------------")

~~ It became a uselessly redundant process. To fucking code at once ~~

Output result

Number of enforcement: 1
g c 2
---------------
Number of enforcement: 2
l t 3
---------------
Number of enforcement: 3
i n 4
---------------
The number of alphabet combinations of first and second is different
The number of occurrences of second and over third match
The number of alphabetic combinations of first and overthird match
Number of enforcement: 4
lt gc 5
---------------
The number of alphabet combinations of first and second is different
Number of enforcement: 5
e in 7
---------------
The number of alphabet combinations of first and second is different
Number of enforcement: 6
ein ltgc 12
---------------

Correction result

Result of executing step 3 again

{'e': '00', 'i': '010', 'n': '011', 'l': '100', 't': '101', 'g': '110', 'c': '111'}

It was in agreement with the explanation of the teaching materials. By the way, if you compare the compression ratio,

Intelligence compression ratio comparison
Size before compression
12 * 3 = 36 bits
Bit string after compression
010011101001001000101100001111100
Compressed size
33 bits

Source code

https://gist.github.com/ereyester/68ef4653890f0823e81e99dc00d73a07

[High School Curriculum Guidelines Information I] Teaching materials for teacher training: Implementation of Huffman method in python

Introduction

Teaching materials

environment

Overview

Other references, etc.

Huffman method (character string: about intelligence)

Step 1

Commentary

Implementation

Output result

Step 2

Commentary

Implementation

Output result

Step 3

Commentary

Implementation

Output result

Compression rate

Commentary

Implementation

Output result

Comparison / correction of implementation results with teaching materials

Step 2'

Implementation

Output result

Correction result

Source code