I want to exchange binary data between Python and C / C ++ for machine learning and ray tracing. I want to complete with only the standard functions of Python. For text, there are JSON and numpy text format (csv), but binaries are not easy to use on the C ++ side.
Consider Pickle serialization.
https://docs.python.org/ja/3/library/pickle.html
It seems that endianness is also taken into account.
The site that briefly explained Pickle's serialization format itself was not in English either.: Cry: (Once you know it, it's not that complicated format, so it may not be enough to explain ...)
However, thankfully, PyTorch JIT has serialization support with its own C ++ Pickle loader for implementing TorchScript (Python-like scripting language), and the code is helpful.
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/docs/serialization.md
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/serialization/pickler.h
You can also analyze the data with Python's Pickletools.
https://docs.python.org/ja/3.6/library/pickletools.html
Pickle has several Protocol versions. In Python3, 3 is the default, but when serialized in Python3 with proto 3, it cannot be read in Python2.
If you are mainly using numerical data and do not handle data that is not very strange, is proto 2 recommended? (TorchScript only supports proto 2)
The header will be 2 bytes of 0x80
(PROTO, 1 byte) and version number (1 byte).
Let's try serializing 1.
import pickle
import io
a = 1
f = io.BytesIO()
b = pickle.dump(a, f)
w = open("bora.p", "wb")
w.write(f.getbuffer())
$ od -tx1c bora.p
0000000 80 03 4b 01 2e
200 003 K 001 .
0000005
'K' is BININT1
.
(2e) is STOP
. The end of the data.
Looking at unpicker.cpp in pytorch jit,
case PickleOpCode::BININT1: {
uint8_t value = read<uint8_t>();
stack_.emplace_back(int64_t(value));
} break;
You can see that BININT1
is an int type value that can be serialized with 1 byte.
Try array data.
import pickle
import io
a = [1, 2]
f = io.BytesIO()
b = pickle.dump(a, f, protocol=2)
w = open("bora.p", "wb")
w.write(f.getbuffer())
Now let's dump it with pickletools.
$ python -m pickletools bora.p
0: \x80 PROTO 2
2: ] EMPTY_LIST
3: q BINPUT 0
5: ( MARK
6: K BININT1 1
8: K BININT1 2
10: e APPENDS (MARK at 5)
11: . STOP
highest protocol among opcodes = 2
Basically, it is a combination of prefix + actual data, so after that, you should try various things by referring to pickler.cpp, unpickler.cpp and pickletools.py of pytorch jit and analyze it!
numpy array
Let's serialize the numpy array (ndarray).
a = numpy.array([1.0, 2.2, 3.3, 4, 5, 6, 7, 8, 9, 10], dtype=numpy.float32)
f = io.BytesIO()
b = pickle.dump(a, f, protocol=2)
w = open("bora.p", "wb")
w.write(f.getbuffer())
0: \x80 PROTO 2
2: c GLOBAL 'numpy.core.multiarray _reconstruct'
38: q BINPUT 0
40: c GLOBAL 'numpy ndarray'
55: q BINPUT 1
57: K BININT1 0
59: \x85 TUPLE1
60: q BINPUT 2
62: c GLOBAL '_codecs encode'
78: q BINPUT 3
80: X BINUNICODE 'b'
86: q BINPUT 4
88: X BINUNICODE 'latin1'
99: q BINPUT 5
101: \x86 TUPLE2
102: q BINPUT 6
104: R REDUCE
105: q BINPUT 7
107: \x87 TUPLE3
108: q BINPUT 8
110: R REDUCE
111: q BINPUT 9
113: ( MARK
114: K BININT1 1
116: K BININT1 10
118: \x85 TUPLE1
119: q BINPUT 10
121: c GLOBAL 'numpy dtype'
134: q BINPUT 11
136: X BINUNICODE 'f4'
143: q BINPUT 12
145: K BININT1 0
147: K BININT1 1
149: \x87 TUPLE3
150: q BINPUT 13
152: R REDUCE
153: q BINPUT 14
155: ( MARK
156: K BININT1 3
158: X BINUNICODE '<'
164: q BINPUT 15
166: N NONE
167: N NONE
168: N NONE
169: J BININT -1
174: J BININT -1
179: K BININT1 0
181: t TUPLE (MARK at 155)
182: q BINPUT 16
184: b BUILD
185: \x89 NEWFALSE
186: h BINGET 3
188: X BINUNICODE '\x00\x00\x80?ÍÌ\x0c@33S@\x00\x00\x80@\x00\x00\xa0@\x00\x00À@\x00\x00à@\x00\x00\x00A\x00\x00\x10A\x00\x00 A'
240: q BINPUT 17
242: h BINGET 5
244: \x86 TUPLE2
245: q BINPUT 18
247: R REDUCE
248: q BINPUT 19
250: t TUPLE (MARK at 113)
251: q BINPUT 20
253: b BUILD
254: . STOP
highest protocol among opcodes = 2
You can see that the array data is stored as a byte string around BINUNICODE. After parsing the source code of numpy, it seems that you can load the pickle version of numpy array and pytorch tensor (you can imagine that it has a structure similar to numpy) with your own C ++ loader! (Numpy native? NPY / NPZ is somewhat concise in format, for example cnpy can read and write https://github.com/rogersce/cnpy)
TODO
Recommended Posts