In high school, the curriculum guidelines have been revised, and the subject "Information" has also changed significantly. The biggest change is to learn programming from high school. Regarding this, the Ministry of Education, Culture, Sports, Science and Technology has released teaching materials for teacher training free of charge. In this article, I would like to write supplementary information related to programming that will be learned in high school in the future.
Various articles have already been written by our ancestor engineers, so please check the links below. [Unraveling the 2020 Course of Study Subject "Information"](https://qiita.com/woadachi/items/75cc334b10e22e4c5cad "Unraveling the 2020 Course of Study" Information ") Python of the Ministry of Education is not Python [Full-time faculty member of the subject "Information" and future dreams "Current status of IT engineer / programmer"](https://qiita.com/woadachi/items/fdcb67bed4d496d5823e "Full-time faculty member of the subject" Information "and future dream" IT engineer " ・ Current status of "programmers" ")
[High School Information Department "Information I" Teacher Training Materials (Main Volume): Ministry of Education, Culture, Sports, Science and Technology](https://www.mext.go.jp/a_menu/shotou/zyouhou/detail/1416756.htm "High School Information Department "Information I" teaching materials for teacher training (main part): Ministry of Education, Culture, Sports, Science and Technology ") Chapter 2 Communication and Information Design (PDF: 6892KB) PDF
In the new course of study, "Information I" mainly adds programming, and "Information II" mainly adds data science fields. Considering that, there are many explanations by code in python, so this article also uses python. In addition, it is assumed that Google Colab, which makes it easy to build an environment, is used.
This time, we will use "Information I" for teacher training materials, "Chapter 2 Communication and Information Design". In accordance with the implicit policy of Information I, when programming with python, we will confirm and supplement the elements that are not written in the teaching materials. Specifically, on pages 61-63 of the teaching material "Chapter 2 Communication and Information Design"
(9) File compression-Huffman method
If you check the place, you can see the explanation of the Huffman algorithm, but the ** essential programming code ** is not included. So I would like to implement python code.
Huffman coding with Python3 I implemented Huffman coding in Python [Huffman Code-Wikipedia](https://ja.wikipedia.org/wiki/%E3%83%8F%E3%83%95%E3%83%9E%E3%83%B3%E7%AC%A6% E5% 8F% B7 "Huffman code-Wikipedia")
Arrange the characters in order of appearance count
Let's use the python dictionary type and arrange them in order of the number of occurrences.
from collections import Counter
target_string = 'intelligence'
print('{0}Check the number of occurrences of the character' .format(target_string))
list_target_string = list(target_string)
counter_target_string = Counter(target_string)
dict_target_string = dict(counter_target_string)
print(dict_target_string)
Check the number of occurrences of the intelligence character
{'i': 2, 'n': 2, 't': 1, 'e': 3, 'l': 2, 'g': 1, 'c': 1}
Connect two characters with the least number of appearances and write the total number of appearances on it. Since the selected characters have already been added to the total, select the two characters with the least number of appearances from the total of the other 5 characters and the total number of characters in ②, enter the total, and all the characters are connected by the same procedure. Continue until the total is intelligence (12).
class Node:
def __init__(self, key_char=None, count=None, left=None, right=None):
self.key_char = key_char
self.left = left
self.right = right
self.count = count
nodes = []
for k, v in dict_target_string.items():
node_obj = Node(key_char = k, count = v)
nodes.append(node_obj)
trial_count = 0
while len(nodes) >= 2:
#Sort nodes in descending order
nodes = sorted(nodes, key=lambda x: x.count, reverse=True)
#Number of branch creations
trial_count += 1
#Extract the nodes with the smallest number of occurrences
left = nodes.pop()
#Extract the node with the second smallest number of occurrences
right = nodes.pop()
#Add the number of occurrences of left and right to nodes
count = left.count + right.count
append_node = Node(key_char = left.key_char + right.key_char, count = count, left = left, right = right)
nodes.append(append_node)
print("Number of enforcement:", trial_count)
print(left.key_char, right.key_char, count)
print("---------------")
Number of enforcement: 1
c g 2
---------------
Number of enforcement: 2
t cg 3
---------------
Number of enforcement: 3
l n 4
---------------
Number of enforcement: 4
i tcg 5
---------------
Number of enforcement: 5
e ln 7
---------------
Number of enforcement: 6
itcg eln 12
---------------
Put 0 on the left and 1 on the right of all the connected parts. Pick up the 0s and 1s in the spliced part from the top and assign the compressed bit string to each character.
#Get the compression result
def recursive_code(node, s, temp_encode_dict):
if len(node.key_char) == 1:
temp_encode_dict[node.key_char] = s
return
recursive_code(node.left, s + "0", temp_encode_dict)
recursive_code(node.right, s + "1", temp_encode_dict)
encode_dict = {}
tree = nodes[0]
recursive_code(tree, "", encode_dict)
print(encode_dict)
{'i': '00', 't': '010', 'c': '0110', 'g': '0111', 'e': '10', 'l': '110', 'n': '111'}
If "intelligence" is expressed as a compressed bit string using the created Huffman code, it will be 33 bits. If one character of "intelligence" was represented by 3 bits, it was 36 bits, which means that it was compressed to about 92%.
print(target_string,'Compression rate comparison')
bitlength = len(str(bin(len(dict_target_string) - 1))) - 2
before_size = target_len * bitlength
print('Size before compression')
print(target_len,'*',bitlength ,'=', before_size, 'bits')
encode_bits_string = ''
for v in list_target_string:
encode_bits_string += encode_dict[v]
print('Bit string after compression')
print(encode_bits_string)
print('Compressed size')
print(len(encode_bits_string),'bits')
Intelligence compression ratio comparison
Size before compression
12 * 3 = 36 bits
Bit string after compression
001110101011011000011110111011010
Compressed size
33 bits
The explanation of the teaching materials is as follows.
Let's compare it with the output result implemented here.
Check the number of occurrences of the intelligence character
{'i': 2, 'n': 2, 't': 1, 'e': 3, 'l': 2, 'g': 1, 'c': 1}
{'i': '00', 't': '010', 'c': '0110', 'g': '0111', 'e': '10', 'l': '110', 'n': '111'}
**···that? ** ** ** The output result is different. ** ** Since the method of generating branches and the method of distributing branches to the left and right are the contents that explain only the logic ignoring the implementation, this does not result as the teaching material. The following points should be noted.
--When arranging characters by the number of occurrences, it is necessary to arrange them in a stable sort. --Select two items that appear less frequently, but select items that have the same number of alphabets as much as possible. (Priority is given to the same number of connected characters in the first alphabet as the second one) --As for how to specify the left and right of the branch, the one with the smaller number of connected alphabets is assigned to the left, the one that is not assigned to the right is assigned to the right, and if not, the one acquired first is assigned to the right and the second one. The acquired one is on the left.
Therefore, implement according to this condition.
class Node:
def __init__(self, key_char=None, count=None, left=None, right=None):
self.key_char = key_char
self.left = left
self.right = right
self.count = count
nodes = []
encode_dict = {}
for k, v in dict_target_string.items():
node_obj = Node(key_char = k, count = v)
nodes.append(node_obj)
trial_count = 0
while len(nodes) >= 2:
#Sort nodes in descending order
nodes = sorted(nodes, key=lambda x: x.count, reverse=True)
#Number of branch creations
trial_count += 1
#Extract the nodes with the smallest number of occurrences
first = nodes.pop()
#Extract the node with the second smallest number of occurrences
second = nodes.pop()
#temp_nodes
temp_nodes = []
#Determine if the first and second alphabetic combinations are the same
if (len(first.key_char) != len(second.key_char)):
print('The number of alphabet combinations of first and second is different')
#When there is still one or more nodes, get the third and subsequent nodes
while len(nodes) >= 1:
overthird = nodes.pop()
if (second.count == overthird.count):
print('The number of occurrences of second and over third match')
if (len(first.key_char) == len(overthird.key_char)):
print('The number of alphabetic combinations of first and overthird match')
nodes.append(second)
second = overthird
break
else:
temp_nodes.append(overthird)
else:
nodes.append(overthird)
break
#Return what was popped
nodes += temp_nodes
#Add the number of occurrences of first and second to nodes
count = first.count + second.count
print("Number of enforcement:", trial_count)
if len(first.key_char) < len(second.key_char):
append_node = Node(key_char = first.key_char + second.key_char, count = count, left = first, right = second)
else:
append_node = Node(key_char = second.key_char + first.key_char, count = count, left = second, right = first)
print(append_node.left.key_char, append_node.right.key_char, count)
nodes.insert(0, append_node)
print("---------------")
~~ It became a uselessly redundant process. To fucking code at once ~~
Number of enforcement: 1
g c 2
---------------
Number of enforcement: 2
l t 3
---------------
Number of enforcement: 3
i n 4
---------------
The number of alphabet combinations of first and second is different
The number of occurrences of second and over third match
The number of alphabetic combinations of first and overthird match
Number of enforcement: 4
lt gc 5
---------------
The number of alphabet combinations of first and second is different
Number of enforcement: 5
e in 7
---------------
The number of alphabet combinations of first and second is different
Number of enforcement: 6
ein ltgc 12
---------------
Result of executing step 3 again
{'e': '00', 'i': '010', 'n': '011', 'l': '100', 't': '101', 'g': '110', 'c': '111'}
It was in agreement with the explanation of the teaching materials. By the way, if you compare the compression ratio,
Intelligence compression ratio comparison
Size before compression
12 * 3 = 36 bits
Bit string after compression
010011101001001000101100001111100
Compressed size
33 bits
https://gist.github.com/ereyester/68ef4653890f0823e81e99dc00d73a07
Recommended Posts