background

I want to exchange binary data between Python and C / C ++ for machine learning and ray tracing. I want to complete with only the standard functions of Python. For text, there are JSON and numpy text format (csv), but binaries are not easy to use on the C ++ side.

Consider Pickle serialization.

https://docs.python.org/ja/3/library/pickle.html

It seems that endianness is also taken into account.

information

The site that briefly explained Pickle's serialization format itself was not in English either.: Cry: (Once you know it, it's not that complicated format, so it may not be enough to explain ...)

However, thankfully, PyTorch JIT has serialization support with its own C ++ Pickle loader for implementing TorchScript (Python-like scripting language), and the code is helpful.

https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/docs/serialization.md

https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/serialization/pickler.h

You can also analyze the data with Python's Pickletools.

https://docs.python.org/ja/3.6/library/pickletools.html

format

Protocol version

Pickle has several Protocol versions. In Python3, 3 is the default, but when serialized in Python3 with proto 3, it cannot be read in Python2.

If you are mainly using numerical data and do not handle data that is not very strange, is proto 2 recommended? (TorchScript only supports proto 2)

The header will be 2 bytes of 0x80 (PROTO, 1 byte) and version number (1 byte).

Let's try serializing 1.

import pickle
import io

a = 1 

f = io.BytesIO()
b = pickle.dump(a, f)

w = open("bora.p", "wb")
w.write(f.getbuffer())

$ od -tx1c bora.p
0000000  80  03  4b  01  2e
        200 003   K 001   .
0000005

'K' is BININT1 . (2e) is STOP. The end of the data.

Looking at unpicker.cpp in pytorch jit,

    case PickleOpCode::BININT1: {
      uint8_t value = read<uint8_t>();
      stack_.emplace_back(int64_t(value));
    } break;

You can see that BININT1 is an int type value that can be serialized with 1 byte.

Try array data.

import pickle
import io

a = [1, 2] 

f = io.BytesIO()
b = pickle.dump(a, f, protocol=2)

w = open("bora.p", "wb")
w.write(f.getbuffer())

Now let's dump it with pickletools.

$ python -m pickletools bora.p 
    0: \x80 PROTO      2
    2: ]    EMPTY_LIST
    3: q    BINPUT     0
    5: (    MARK
    6: K        BININT1    1
    8: K        BININT1    2
   10: e        APPENDS    (MARK at 5)
   11: .    STOP
highest protocol among opcodes = 2

Basically, it is a combination of prefix + actual data, so after that, you should try various things by referring to pickler.cpp, unpickler.cpp and pickletools.py of pytorch jit and analyze it!

numpy array

Let's serialize the numpy array (ndarray).

a = numpy.array([1.0, 2.2, 3.3, 4, 5, 6, 7, 8, 9, 10], dtype=numpy.float32)

f = io.BytesIO()
b = pickle.dump(a, f, protocol=2)

w = open("bora.p", "wb")
w.write(f.getbuffer())

    0: \x80 PROTO      2
    2: c    GLOBAL     'numpy.core.multiarray _reconstruct'
   38: q    BINPUT     0
   40: c    GLOBAL     'numpy ndarray'
   55: q    BINPUT     1
   57: K    BININT1    0
   59: \x85 TUPLE1
   60: q    BINPUT     2
   62: c    GLOBAL     '_codecs encode'
   78: q    BINPUT     3
   80: X    BINUNICODE 'b'
   86: q    BINPUT     4
   88: X    BINUNICODE 'latin1'
   99: q    BINPUT     5
  101: \x86 TUPLE2
  102: q    BINPUT     6
  104: R    REDUCE
  105: q    BINPUT     7
  107: \x87 TUPLE3
  108: q    BINPUT     8
  110: R    REDUCE
  111: q    BINPUT     9
  113: (    MARK
  114: K        BININT1    1
  116: K        BININT1    10
  118: \x85     TUPLE1
  119: q        BINPUT     10
  121: c        GLOBAL     'numpy dtype'
  134: q        BINPUT     11
  136: X        BINUNICODE 'f4'
  143: q        BINPUT     12
  145: K        BININT1    0
  147: K        BININT1    1
  149: \x87     TUPLE3
  150: q        BINPUT     13
  152: R        REDUCE
  153: q        BINPUT     14
  155: (        MARK
  156: K            BININT1    3
  158: X            BINUNICODE '<'
  164: q            BINPUT     15
  166: N            NONE
  167: N            NONE
  168: N            NONE
  169: J            BININT     -1
  174: J            BININT     -1
  179: K            BININT1    0
  181: t            TUPLE      (MARK at 155)
  182: q        BINPUT     16
  184: b        BUILD
  185: \x89     NEWFALSE
  186: h        BINGET     3
  188: X        BINUNICODE '\x00\x00\x80?ÍÌ\x0c@33S@\x00\x00\x80@\x00\x00\xa0@\x00\x00À@\x00\x00à@\x00\x00\x00A\x00\x00\x10A\x00\x00 A'
  240: q        BINPUT     17
  242: h        BINGET     5
  244: \x86     TUPLE2
  245: q        BINPUT     18
  247: R        REDUCE
  248: q        BINPUT     19
  250: t        TUPLE      (MARK at 113)
  251: q    BINPUT     20
  253: b    BUILD
  254: .    STOP
highest protocol among opcodes = 2

You can see that the array data is stored as a byte string around BINUNICODE. After parsing the source code of numpy, it seems that you can load the pickle version of numpy array and pytorch tensor (you can imagine that it has a structure similar to numpy) with your own C ++ loader! (Numpy native? NPY / NPZ is somewhat concise in format, for example cnpy can read and write https://github.com/rogersce/cnpy)

TODO

[] By the way, I would like to implement a Python-like interpreter on my own.
[] I would like to embark on a journey to establish a scheme ** in which excellent Python young people can be sublimated into Picke data loader young people with the fastest and most excellent C ++ in human history by mastering the ** Pickle format.

Python Pickle format notes

background

information

format

Protocol version