MessagePack is a data serialization format proposed and implemented by Mr. Furuhashi of Treasure Data, and is well known as a format for fluentd to communicate with each other. And I think you are. Personally, I often use this to store data when analyzing data, and this article will introduce the method.
Although I try to write the article as accurately as possible, I am relatively amateur about data analysis, so I welcome your suggestions and comments.
All you want to do is save the ** data you got for analysis in MessagePack format to a file and read that data at the time of analysis **.
Normally, when performing data analysis etc., it seems that it is the royal road to store it once in a database etc. and then perform the analysis, but in the following cases I think that there is a merit of saving and reading in MessagePack format. ..
For example, if the data output from a certain system is JSON in dictionary format, the keys contained in each record are different, the format of what is entered as a value is different, and it is unknown whether the dictionary format or the array format comes. Or something like that. Furthermore, even if the key or value changes, it is good if such a schema or specification is properly defined, but there are cases where there is no choice but to estimate from the actual data without even the specification.
In such a case, MessagePack can convert the hierarchical structure such as dictionaries and arrays almost as it is and use it, so it is easy to put it in the file without thinking about anything for the time being. When retrieving data for data analysis, data I / O often becomes a bottleneck, so saving it locally in a file format makes it easier to perform trial and error later.
There are many ways to say data analysis, but it is often necessary to perform "search-type data analysis" to grasp the whole picture and formulate a hypothesis while kneading the data at the stage when a sufficient hypothesis cannot be formulated. In such a case, you can convert the data to a format that is easy to process, extract only the necessary data and save it, and so on. In that case, I think that there may be trial and error regarding the data format, such as "I want to add this attribute for a moment" or "It seems better to save it as a dictionary type instead of a list".
When processing while flexibly changing the data format in this way, when the code side is changed by defining the schema again in a place other than the code that processes the data (for example, creating a table with SQL), etc. Inconsistencies can occur in each case, which can be very annoying. On the other hand, if you want to save the data in MessagePack format, you need to match the data format for both writing and reading, but the work is minimal.
(However, if you don't leave comments to some extent, you may find that even if you review the code yourself a few months later, you won't understand the meaning at all ...)
DB as middleware not only simply stores data, but also provides various functions such as setting keys and guaranteeing reliability, so it takes more overhead than simply writing to disk. .. This may be solved by techniques such as load distribution, but if you simply want to "save a little data", it will take time and effort to convert it to MessagePack format and write it directly as a file. No, the file write / read performance is applied almost as it is.
However, the reliability is the same as writing to a file, and it is assumed that all data is read when reading.
On the other hand, of course, there are cases where it is not suitable to save in MessagePack format and analyze the data.
The following technologies can be considered as alternatives. Please select according to the situation.
If your data is in a nearly fixed length column format and is not hierarchical, it is better to use CSV (or TSV). However, it is easier to use MessagePack when variable length elements and hierarchical structures are included.
You can insert without defining a schema in a document-oriented DB, so you can do the same from the point of saving data for the time being. However, the performance of insert seems to be 3,500 insert / sec too much, and when writing with MessagePack, it only writes directly to the disk, so Saving with MessagePack, which retains the performance of disk I / O, is overwhelmingly faster. However, if you want to create keys etc. later, MongoDB will be more suitable.
It is disadvantageous compared to MessagePack in terms of processing speed and data size (reference). In addition, JSON has a module such as ijson that reads out in stages when trying to put multiple objects in one data segment (for example, one file). If you don't use it, you have to parse it all at once, so if the number of data is large, the processing will be difficult on a poor machine. On the other hand, if you write it like ijson, the code will be complicated, so I personally think that it is easier to store data continuously in one data segment and retrieve it obediently.
Protocol Buffers is famous as one of the serialization technologies, but it takes time to handle data whose schema is unclear because it is necessary to define the schema on the processing code side. When viewed as data serialization as an interface between software, it is convenient because it regulates the schema, but it becomes difficult to handle in cases where you do not know what kind of schema data will come.
There is a lot of explanation on the Official page, so I don't need to talk much, but I will introduce the sample code focusing only on writing and reading to the file. The code can also be found on github.
In all cases, the following data shall be written / read to the data.msg
file.
{
"name": "Alice",
"age": 27,
"hist": [5, 3, 1]
}
{
"name": "Bob",
"age": 33,
"hist": [4, 5]
}
Package msgpack-python
is required.
Installation
$ pip install msgpack-python
Write sample code
# coding: UTF-8
import msgpack
obj1 = {
"name": "Alice",
"age": 27,
"hist": [5, 3, 1]
}
obj2 = {
"name": "Bob",
"age": 33,
"hist": [4, 5]
}
with open('data.msg', 'w') as fd:
fd.write(msgpack.packb(obj1))
fd.write(msgpack.packb(obj2))
Read sample code
# coding: UTF-8
import msgpack
for msg in msgpack.Unpacker(open('data.msg', 'rb')):
print msg
Package msgpack
is required.
$ gem install msgpack
Write sample code
# -*- coding: utf-8 -*-
require "msgpack"
obj1 = {
"name": "Alice",
"age": 27,
"hist": [5, 3, 1]
}
obj2 = {
"name": "Bob",
"age": 33,
"hist": [4, 5]
}
File.open("data.msg", "w") do |file|
file.write(obj1.to_msgpack)
file.write(obj2.to_msgpack)
end
Read sample code
# -*- coding: utf-8 -*-
require "msgpack"
File.open("data.msg") do |file|
MessagePack::Unpacker.new(file).each do |obj|
puts obj
end
end
There are some major libraries for MessagePack, but this time I will use msgpack-lite
for the code.
$ npm install msgpack-lite
Write sample code
const fs = require('fs');
const msgpack = require('msgpack-lite');
const obj1 = {
name: "Alice",
age: 27,
hist: [5, 3, 1]
};
const obj2 = {
name: "Bob",
age: 33,
hist: [4, 5]
};
fs.open('data.msg', 'w', (err, fd) => {
fs.writeSync(fd, msgpack.encode(obj1));
fs.writeSync(fd, msgpack.encode(obj2));
});
Read sample code
const fs = require('fs');
const msgpack = require('msgpack-lite');
var rs = fs.createReadStream('data.msg');
var ds = msgpack.createDecodeStream();
rs.pipe(ds).on('data', (msg) => {
console.log(msg);
});
The msgpackc
library is required. For macOS, you can install it with brew.
$ brew install msgpack
Write sample code
#include <msgpack.hpp>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
int main(int argc, char *argv[]) {
int fd = open("data.msg", O_WRONLY | O_CREAT, 0600);
msgpack::sbuffer buf1, buf2;;
msgpack::packer<msgpack::sbuffer> pk1(&buf1), pk2(&buf2);
pk1.pack_map(3);
pk1.pack("name"); pk1.pack("Alice");
pk1.pack("age"); pk1.pack(27);
pk1.pack("hist");
pk1.pack_array(3);
pk1.pack(5); pk1.pack(3); pk1.pack(1);
write(fd, buf1.data(), buf1.size());
pk2.pack_map(3);
pk2.pack("name"); pk2.pack("Bob");
pk2.pack("age"); pk2.pack(33);
pk2.pack("hist");
pk2.pack_array(2);
pk2.pack(4); pk2.pack(5);
write(fd, buf2.data(), buf2.size());
close(fd);
return 0;
}
Read sample code
#include <msgpack.hpp>
#include <sys/types.h>
#include <unistd.h>
#include <fcntl.h>
int main(int argc, char *argv[]) {
static const size_t BUFSIZE = 4; //Dare to make the buffer size smaller
int rc;
char buf[BUFSIZE];
int fd = open("data.msg", O_RDONLY);
msgpack::unpacker unpkr;
while (0 < (rc = read(fd, buf, sizeof(buf)))) {
unpkr.reserve_buffer(rc);
memcpy(unpkr.buffer(), buf, rc);
unpkr.buffer_consumed(rc);
msgpack::object_handle result;
while (unpkr.next(result)) {
const msgpack::object &obj = result.get();
if (obj.type == msgpack::type::MAP) {
printf("{\n");
msgpack::object_kv* p(obj.via.map.ptr);
for(msgpack::object_kv* const pend(obj.via.map.ptr + obj.via.map.size);
p < pend; ++p) {
std::string key;
p->key.convert(key);
if (key == "name") {
std::string value;
p->val.convert(value);
printf(" %s: %s,\n", key.c_str(), value.c_str());
}
if (key == "age") {
int value;
p->val.convert(value);
printf(" %s: %d,\n", key.c_str(), value);
}
if (key == "hist") {
msgpack::object arr = p->val;
printf (" %s, [", key.c_str());
for (int i = 0; i < arr.via.array.size; i++) {
int value;
arr.via.array.ptr[i].convert(value);
printf("%d, ", value);
}
printf ("],\n");
}
}
printf("}\n");
}
result.zone().reset();
}
}
return 0;
}
By the way, if you throw the msgpack :: object format into ʻostream (
std :: cout` etc.), the format will be formatted and displayed without permission, but it is troublesome as described above to retrieve the value programmatically. The procedure is described as a sample.
Recommended Posts