Introduction

Since I started creating a lot of analysis scripts in Python, there have been many requests to save the state during learning at any time so that it can be restored at any time. When the amount of code reaches a certain level, a class that combines the data set, parameter group, and operation required for learning is created, and processing is executed on an instance basis of the class. At this time, I always wonder if I can dump this class instance as a whole. This request is fulfilled by the serialization package of the object such as pickle, and each object can be dumped and loaded.

However, this gradually becomes unsatisfactory.

I want to aggregate and manage the results of parallel processing on multiple machines
I want to sort the analysis results according to the analysis conditions.
I want to search the analysis result under the analysis conditions.
If necessary, I want to restart the analysis on any machine

What you want to do becomes more and more luxurious. At this point, standard serializers such as pickle aren't enough, and you'll want to manage them in a database. However, it is not easy to find a mechanism to do this easily.

If it is about dump and restore, you can make it yourself, but when it comes to searching and sorting, the code of the dump part will be larger than the original analysis code, which will lead to an overwhelming situation.

There was a time when I wanted to create a utility for such a problem, but I was hesitant when I realized that I needed to create a fairly large framework. Here are some of the implementation issues:

It is necessary to create a database table that holds the attributes of the class for data storage, but the existing ORM mechanism makes it difficult to dynamically generate a Model. However, if you use an API close to the sql statement, it will never end.
SQL databases are not designed to hold large amounts of data (eg binaries). If you want to keep it with another mechanism, you need to launch several services such as MySQL + WebDAV. After all it may be a tipping situation
I want to be able to handle standard Python classes such as list, dict, and datetime as they are. However, there are many types that are not supported by ORM. When you start making custom fields ... (same below ...)

Considering various things, it far exceeded the level of making one effort (implemented in a little over a day and released in about a week), so I wanted to do it, but I couldn't afford it.

However, recently, when I was studying mongodb thinking that I should study properly, I noticed that the above problem was solved by combining with the python package.

Mongoengine.DynamicDocument, mongoengine.DynamicField allows dynamic addition of attributes
With the GridFS mechanism, large-capacity binaries can be handled only with the mongodb mechanism.
mongoengine is better than sqlalchemy and django It provides a field set that matches the type of python

When you notice the required function, it is usually provided. After that, it seems that it can be realized if only a lightweight wrapper is made, so I tried it (I have to do it if I do not come to that point, my waist is too heavy ...)

Packaging

The implemented source code is named dbarchive and is already on github so you can see it.

https://github.com/mountcedar/dbarchive

I also prepared setup.py, so you can install it.

design

What is the design that people who use parsing classes want? I want to get rid of my consciousness such as dump / restore as much as possible. That's right ... If possible, couldn't we create such a mechanism that all can be solved by inheriting one superclass? The query search is not original as much as possible, but I want to follow the existing mechanism (no need to write a document ...), so I made the following specifications

If you inherit the Base class, all the necessary functions will be included.
Basic API follows mongoengine

As a result of making it satisfy the above, it became a package that can be used as follows.

How to use

Below is the sample code using dbarchive.

import numpy
import logging
from datetime import datetime
from dbarchive import Base

class Sample(Base):
    def __init__(self, maxval=10):
        self.base = "hoge"
        self.bin = numpy.arange(maxval)
        self.created = datetime.now()

print 'create sample instance'
sample01 = Sample(10)
sample01.save()
sample02 = Sample(3)
sample02.save()

for sample in Sample.objects.all():
    print 'sample: ', type(sample)
    print '\tbase: ', sample.base
    print '\tbin: ', sample.bin
    print '\tcreated: ', sample.created

sample01.bin = numpy.arange(20)
sample01.save()

for sample in Sample.objects.all():
    print 'sample: ', type(sample)
    print '\tbase: ', sample.base
    print '\tbin: ', sample.bin
    print '\tcreated: ', sample.created

print "all task completed"

Let's follow the source in detail. First, let the class you want to manage in the database inherit the dbarchive.Base class.

class Sample(Base):
    def __init__(self, maxval=10):
        self.base = "hoge"
        self.bin = numpy.arange(maxval)
        self.created = datetime.now()

By inheriting the dbarchive.Base class, you can create a class with the utilities required to save the database. All you have to do is save the instance with the save function.

Note that the \ __ init__ method of a class that inherits the Base class must be designed so that it can be executed without arguments. This is because it is necessary to be able to instantiate without arguments when automatically creating an instance from the database, and such restrictions are unavoidably required.

print 'create sample instance'
sample01 = Sample(10)
sample01.save()
sample02 = Sample(3)
sample02.save()

When the save function is called, the class creates a table (collection) called \ \ _ table in the database and stores its value.

The search is done through a handler called objects that the class has. Issuing queries through objects is basically django-compliant, so if you're used to it, you can use it without any discomfort.

for sample in Sample.objects.all():
    print 'sample: ', type(sample)
    print '\tbase: ', sample.base
    print '\tbin: ', sample.bin
    print '\tcreated: ', sample.created

The above is the code to get and display all the instances saved so far. For details on creating a query set with the objects handler, refer to the following document.

MongoEngine -- 2.5. Querying the database

In order to confirm whether it can handle huge binaries, we also prepared sample code to be applied to deep learning by chainer.

https://gist.githubusercontent.com/mountcedar/be58ebfe1e9c752a72ac

See the dbarchive readme for more details on other uses.

interface

The following tools are useful when checking the values saved in mongodb.

mongohub is perfect for personal use. It can be used in the same way as existing database client tools. If you want to check with multiple people using the web interface, use mongo-express.

Dump, restore and query search for Python class instances using mongodb

Introduction

Packaging

design

How to use

interface