Since I started creating a lot of analysis scripts in Python, there have been many requests to save the state during learning at any time so that it can be restored at any time. When the amount of code reaches a certain level, a class that combines the data set, parameter group, and operation required for learning is created, and processing is executed on an instance basis of the class. At this time, I always wonder if I can dump this class instance as a whole. This request is fulfilled by the serialization package of the object such as pickle, and each object can be dumped and loaded.
However, this gradually becomes unsatisfactory.
What you want to do becomes more and more luxurious. At this point, standard serializers such as pickle aren't enough, and you'll want to manage them in a database. However, it is not easy to find a mechanism to do this easily.
If it is about dump and restore, you can make it yourself, but when it comes to searching and sorting, the code of the dump part will be larger than the original analysis code, which will lead to an overwhelming situation.
There was a time when I wanted to create a utility for such a problem, but I was hesitant when I realized that I needed to create a fairly large framework. Here are some of the implementation issues:
Considering various things, it far exceeded the level of making one effort (implemented in a little over a day and released in about a week), so I wanted to do it, but I couldn't afford it.
However, recently, when I was studying mongodb thinking that I should study properly, I noticed that the above problem was solved by combining with the python package.
When you notice the required function, it is usually provided. After that, it seems that it can be realized if only a lightweight wrapper is made, so I tried it (I have to do it if I do not come to that point, my waist is too heavy ...)
The implemented source code is named dbarchive and is already on github so you can see it.
I also prepared setup.py, so you can install it.
What is the design that people who use parsing classes want? I want to get rid of my consciousness such as dump / restore as much as possible. That's right ... If possible, couldn't we create such a mechanism that all can be solved by inheriting one superclass? The query search is not original as much as possible, but I want to follow the existing mechanism (no need to write a document ...), so I made the following specifications
As a result of making it satisfy the above, it became a package that can be used as follows.
Below is the sample code using dbarchive.
import numpy
import logging
from datetime import datetime
from dbarchive import Base
class Sample(Base):
def __init__(self, maxval=10):
self.base = "hoge"
self.bin = numpy.arange(maxval)
self.created = datetime.now()
print 'create sample instance'
sample01 = Sample(10)
sample01.save()
sample02 = Sample(3)
sample02.save()
for sample in Sample.objects.all():
print 'sample: ', type(sample)
print '\tbase: ', sample.base
print '\tbin: ', sample.bin
print '\tcreated: ', sample.created
sample01.bin = numpy.arange(20)
sample01.save()
for sample in Sample.objects.all():
print 'sample: ', type(sample)
print '\tbase: ', sample.base
print '\tbin: ', sample.bin
print '\tcreated: ', sample.created
print "all task completed"
Let's follow the source in detail. First, let the class you want to manage in the database inherit the dbarchive.Base class.
class Sample(Base):
def __init__(self, maxval=10):
self.base = "hoge"
self.bin = numpy.arange(maxval)
self.created = datetime.now()
By inheriting the dbarchive.Base class, you can create a class with the utilities required to save the database. All you have to do is save the instance with the save function.
Note that the \ __ init__ method of a class that inherits the Base class must be designed so that it can be executed without arguments. This is because it is necessary to be able to instantiate without arguments when automatically creating an instance from the database, and such restrictions are unavoidably required.
print 'create sample instance'
sample01 = Sample(10)
sample01.save()
sample02 = Sample(3)
sample02.save()
When the save function is called, the class creates a table (collection) called \
The search is done through a handler called objects that the class has. Issuing queries through objects is basically django-compliant, so if you're used to it, you can use it without any discomfort.
for sample in Sample.objects.all():
print 'sample: ', type(sample)
print '\tbase: ', sample.base
print '\tbin: ', sample.bin
print '\tcreated: ', sample.created
The above is the code to get and display all the instances saved so far. For details on creating a query set with the objects handler, refer to the following document.
In order to confirm whether it can handle huge binaries, we also prepared sample code to be applied to deep learning by chainer.
See the dbarchive readme for more details on other uses.
The following tools are useful when checking the values saved in mongodb.
mongohub is perfect for personal use. It can be used in the same way as existing database client tools. If you want to check with multiple people using the web interface, use mongo-express.
Recommended Posts