pickle is Python's unique data serialization format and a very powerful mechanism, but the behavior behind it is less flexible and simpler than past history. Here, we summarize the process of pickling and unpickling non-main builtin classes (minor builtin classes, standard / non-standard libraries, user-defined classes, etc.), and how to efficiently pickle users. I summarized whether a definition class can be created.
The discussion here is based on Python 3.3.5 and Pickle Protocol Version 3. Protocol Version 4 was introduced from Python 3.4, but the internal processing has become more complicated, so I think it would be efficient to first understand it with Python 3.3 code.
Mainly, you can understand it by following the method below.
Lib/pickle.py
class _Pickler:
def save(self, obj, save_persistent_id=True):
...
def save_reduce(self, func, args, state=None,
listitems=None, dictitems=None, obj=None):
...
def save_global(self, obj, name=None, pack=struct.pack):
...
Objects/typeobject.c
static PyObject *
reduce_2(PyObject *obj)
{
...
}
static PyObject *
object_reduce_ex(PyObject *self, PyObject *args)
{
...
}
When pickle.dump, pickle.dumps
, etc. are called, everything is converted to pickle by the following processing.
sample1.py
pickler = pickle.Pickler(fileobj, protocol)
pickler.dump(obj)
The Pickler class is
_pickle.Pickler
, orpickle._Pickler
So, there are entities in the following places.static PyTypeObject Pickler_Type;
defined in Modules / _pickler.cclass _Pickler
defined in Lib / pickle.py
Normally, the C implementation is used preferentially, but if the import fails, the Python implementation is used.
Since the main purpose here is to understand the mechanism, we will focus on the Python implementation.Individual objects are recursively pickled by pickler.save (obj)
.
First of all, the existing objects such as circular references and references in multiple places are appropriately pickled as forward references in the first half of this function.
Since the builtin classes and constants below are often used, Pickle implements its own efficient processing.
For this reason, it does not correspond to the explanation in this paper and is omitted here.
int, float, str, bytes, list, tuple, dict, bool, None
For other classes, it will be pickled by the procedure shown below.
When the pickle target is a class object (that is, ʻis instance (obj, type) == True) or a function, ʻobj.__module__, obj.__name__
is recorded as a character string.
In unpickle conversion, after importing the required module, the value that can be referred to by this variable name is unpickled.
That is, only classes and functions defined in the module's global namespace can be pickled.
Of course, the logic of functions and classes is not remembered, Python is not LISP.
Next, the existence of copyreg.dispatch_table [type (obj)]
is checked from the dictionary globally defined in the copyreg module.
sample02.py
import copyreg
if type(obj) in copyreg.dispatch_table:
reduce = copyreg.dispatch_table[type(obj)]
rv = reduce(obj)
The contents of the return value rv
will be described later.
In this way, the function registered in copyreg.dispatch_table
has the highest priority and is used for pickleization.
Therefore, even if the definition cannot be changed, the behavior of pickle / unpickle can be changed. In an extreme case, if you make a time object pickle / unpickle, you can make it a regular expression object.
sample03.py
import pickle
import copyreg
import datetime
import re
def reduce_datetime_to_regexp(x):
return re.compile, (r'[spam]+',)
copyreg.pickle(datetime.datetime, reduce_datetime_to_regexp)
a = datetime.datetime.now()
b = pickle.loads(pickle.dumps(a))
print(a, b) # 2014-10-05 10:24:12.177959 re.compile('[spam]+')Output like
Addition to the dictionary dispatch_table
is done viacopyreg.pickle (type, func)
.
If there is a dictionary pickler.dispatch_table
, this will be used instead of copyreg.dispatch_table
.
This is safer if you want to change the behavior only when pickling for a specific purpose.
sample03a.py
import pickle
import copyreg
import datetime
import re
import io
def reduce_datetime_to_regexp(x):
return re.compile, (r'[spam]+',)
a = datetime.datetime.now()
with io.BytesIO() as fp:
pickler = pickle.Pickler(fp)
pickler.dispatch_table = copyreg.dispatch_table.copy()
pickler.dispatch_table[datetime.datetime] = reduce_datetime_to_regexp
pickler.dump(a)
b = pickle.loads(fp.getvalue())
print(a, b) # 2014-10-05 10:24:12.177959 re.compile('[spam]+')Output like
If the method ʻobj.reduce_ex` is defined,
sample03.py
rv = obj.__reduce_ex__(protocol_version)
Is called.
The contents of the return value rv
will be described later.
If the method ʻobj.reduce` is defined,
sample03.py
rv = obj.__reduce__()
Is called.
The contents of the return value rv
will be described later.
__reduce__
It seems that it is not the current situation. You should always use __reduce_ex__
.
This is searched first, so it will be a little faster.
If you don't use the protocol variable, you can ignore it.
If no special method is written for pickle / unpickle, ʻobject standard
reduce processing is performed as a last resort. This is, so to speak, "the most universal and greatest common divisor implementation of
reduce_ex` that can be used as it is for most objects ", which is very helpful, but unfortunately it is implemented in C language and I understand it. difficult.
If this part is omitted such as error handling and the general flow is implemented in Python, it will be as follows.
object_reduce_ex.py
class object:
def __reduce_ex__(self, proto):
from copyreg import __newobj__
if hasattr(self, '__getnewargs__'):
args = self.__getnewargs__()
else:
args = ()
if hasattr(self, '__getstate__'):
state = self.__getstate__()
elif hasattr(type(self), '__slots__'):
state = self.__dict__, {k: getattr(self, k) for k in type(self).__slots__}
else:
state = self.__dict__
if isinstance(self, list):
listitems = self
else:
listitems = None
if isinstance(self, dict):
dictitems = self.items()
else:
listitems = None
return __newobj__, (type(self),)+args, state, listitems, dictitems
As you can see from the above, even if you rely on ʻobject.reduce_ex, you can change the behavior in detail by defining the methods of
getnewargs, getstate. If you define
reduce_ex, reduce` yourself, these functions will not be used unless you explicitly call them.
__getnewargs__
A method that returns tuples that can be pickled.
Once this is defined, the arguments to __new__
in unpickleization (not __init__
) can be customized.
Does not include the first argument (class object).
__getstate__
If this is defined, the argument of __setstate__
in unpickleization, or __dict__
when __setstate__
does not exist, and the initial value of the slot can be customized.
__reduce_ex__, __reduce__
and copyreg registration functions should returnIn the above process, the value rv
that each function should return is
None
.Is.
type (rv) is str
type (obj) .__ module__, rv
is recorded as a character string in pickle conversion, and in unpickle conversion, the module referenced by this name is returned after the module is properly imported.
This mechanism can be effectively used when pickling a singleton object or the like.
type (rv) is tuple
The tuple elements (2 or more and 5 or less) are as follows
func
--A pickleable and callable object (typically a class object) that creates an object when unpickled. However, in the case of func.__name__ ==" __newobj__ "
, it will be described later with an exception.--pickle A tuple of possible elements. Used as a parameter when calling
func`.state
-An object for unpickling the state of an object. Optional. It may be None
.listitems
--an iterable object that returns elements of a list
-like object. Optional. It may be None
.dictitems
-- dict
An iterable object that returns the keys and elements of an object. The value returned by the iterator must be a key / element pair. Typically dict_object.items ()
. Optional. It may be None
.func.__name__ ==" __newobj__ "
In this case, ʻargs [0] is interpreted as a class object and a class object is created with ʻargs
as an argument. At this time, __init__
is not called.
If you need a func
object with these conditions, there is one already declared in the copyreg module.
Lib/copyreg.py
def __newobj__(cls, *args):
return cls.__new__(cls, *args)
This copyreg .__ newobj__
is implemented and entered so that it behaves in the same way even if it is interpreted as a normal function, but it is not actually executed.
state
It is interpreted as follows.
state [0]
is a dictionary that indicates the contents of ʻobj.items, and
state [1]is a dictionary that indicates the contents of
type (obj) .__ slots__. Both may be
None`.Mainly, you can understand it by following the method below.
Lib/pickle.py
class _Unpickler:
def load_newobj(self):
...
def load_reduce(self):
...
def load_build(self):
...
def load_global(self):
...
When pickle.load, pickle.loads
, etc. are called, all are unpickled by the following processing.
sample1.py
unpickler = pickle.Unpickler(fileobj)
unpickler.load()
The Unpickler class is
_pickle.Unpickler
, orpickle._Unpickler
So, there are entities in the following places.static PyTypeObject Unpickler_Type;
defined in Modules / _pickler.cclass _Unpickler
defined in Lib / pickle.pyThe object is restored while sequentially calling ʻunpickler.load_xxx ()` according to the ID called opcode according to the element in the pickle data.
In cases where a class, function, or __reduce_ex__
returns a string, the string"modulename.varname"
is recorded as is.
In this case, import the module if necessary and output the corresponding value.
No new object is created by unpickler.
When pickled using a tuple of 5 elements returned by __reduce_ex__
etc., the object is unpickled by these processes.
If you rewrite the outline of each method of load_newobj, load_reduce, load_build
corresponding to this process in a simple flow, it will be as follows.
sample09.py
def unpickle_something():
func, args, state, listitems, dictitems = load_from_pickle_stream()
if getattr(func, '__name__', None) == '__newobj__':
obj = args[0].__new__(*args)
else:
obj = func(*args)
if lisitems is not None:
for x in listitems:
obj.append(x)
if dictitems is not None:
for k, v in dictitems:
obj[k] = v
if hasattr(obj, '__setstate__'):
obj.__setstate__(state)
elif type(state) is tuple and len(state) == 2:
for k, v in state[0].items():
obj.__dict__[k] = v
for k, v in state[1].items():
setattr(obj, k, v)
else:
for k, v in state.items():
obj.__dict__[k] = v
return obj
Cases that satisfy the following conditions can be processed appropriately without writing the pickle and unpickle processes.
__dict__
can be pickled, and there is no problem even if they are restored as they are.__slots__
can be pickled, and there is no problem even if it is restored as it is.__new__
.__init__
is not called, there is no contradiction as an object if the attributes are restored correctly.sphere0.py
import pickle
class Sphere:
def __init__(self, radius):
self._radius = radius
@property
def volume(self):
if not hasattr(self, '_volume'):
from math import pi
self._volume = 4/3 * pi * self._radius ** 3
return self._volume
def _main():
sp1 = Sphere(3)
print(sp1.volume)
print(sp1.__reduce_ex__(3))
sp2 = pickle.loads(pickle.dumps(sp1))
print(sp2.volume)
if __name__ == '__main__':
_main()
When the Shere object that represents a sphere accesses the volume property that represents the volume, the calculation result is cached internally. If this is pickled as it is, the cached volume will be saved together, and the data third will increase. I want to delete this.
sphere1.py
class Sphere:
def __init__(self, radius):
self._radius = radius
@property
def volume(self):
if not hasattr(self, '_volume'):
from math import pi
self._volume = 4/3 * pi * self._radius ** 3
return self._volume
def __getstate__(self):
return {'_radius': self._radius}
You can prevent the cache from being pickled by defining a __getstate__
method that returns the value of __dict __
after unpickle.
sphere2.py
class Sphere:
__slots__ = ['_radius', '_volume']
def __init__(self, radius):
self._radius = radius
@property
def volume(self):
if not hasattr(self, '_volume'):
from math import pi
self._volume = 4/3 * pi * self._radius ** 3
return self._volume
def __getstate__(self):
return None, {'_radius': self._radius}
To improve memory efficiency, if you define __slots__
, the value returned by __getstate__
must be changed because __dict __
no longer exists.
In this case, it is a two-element tuple, and the latter element is a dictionary that initializes the attributes of __slots__
.
The previous element (initial value of __dict__
) can be None
.
sphere3.py
class Sphere:
__slots__ = ['_radius', '_volume']
def __init__(self, radius):
self._radius = radius
@property
def volume(self):
if not hasattr(self, '_volume'):
from math import pi
self._volume = 4/3 * pi * self._radius ** 3
return self._volume
def __getstate__(self):
return self._radius
def __setstate__(self, state):
self._radius = state
If the only value to be pickled is the radius, you can return the self._radius
value itself as __getstate__
instead of the dictionary.
In that case, also define a pair of __setstate__
.
__new__
intliterals.py
import pickle
class IntLiterals(tuple):
def __new__(cls, n):
a = '0b{n:b} 0o{n:o} {n:d} 0x{n:X}'.format(n=n).split()
return super(cls, IntLiterals).__new__(cls, a)
def __getnewargs__(self):
return int(self[0], 0),
def _main():
a = IntLiterals(10)
print(a) # ('0b1010', '0o12', '10', '0xA')
print(a.__reduce_ex__(3))
b = pickle.loads(pickle.dumps(a))
print(b)
if __name__ == '__main__':
_main()
__init__
closureholder.py
import pickle
class ClosureHolder:
def __init__(self, value):
def _get():
return value
self._get = _get
def get(self):
return self._get()
def __reduce_ex__(self, proto):
return type(self), (self.get(),)
def _main():
a = ClosureHolder('spam')
print(a.get())
print(a.__reduce_ex__(3))
b = pickle.loads(pickle.dumps(a))
print(b.get())
if __name__ == '__main__':
_main()
The value returned by get
is stored by the closure in __init__
, so the object cannot be created without calling __init__
.
In such a case, ʻobject.reduce_excannot be used, so implement
reduce_ex` by yourself.
singleton.py
class MySingleton(object):
def __new__(cls, *args, **kwds):
assert mysingleton is None, \
'A singleton of MySingleton has already been created.'
return super(cls, MySingleton).__new__(cls, *args, **kwds)
def __reduce_ex__(self, proto):
return 'mysingleton'
mysingleton = None
mysingleton = MySingleton()
def _main():
import pickle
a = pickle.dumps(mysingleton)
b = pickle.loads(a)
print(b)
if __name__ == '__main__':
_main()
Suppose the MySingleton
class has only one instance in the mysingleton
global variable.
To unpickle this correctly, use a format in which __reduce_ex__
returns a string.
Recommended Posts