[RUBY] Recommendation of data analysis using MessagePack

MessagePack is a data serialization format proposed and implemented by Mr. Furuhashi of Treasure Data, and is well known as a format for fluentd to communicate with each other. And I think you are. Personally, I often use this to store data when analyzing data, and this article will introduce the method.

Although I try to write the article as accurately as possible, I am relatively amateur about data analysis, so I welcome your suggestions and comments.

What to do with MessagePack

All you want to do is save the ** data you got for analysis in MessagePack format to a file and read that data at the time of analysis **.

Normally, when performing data analysis etc., it seems that it is the royal road to store it once in a database etc. and then perform the analysis, but in the following cases I think that there is a merit of saving and reading in MessagePack format. ..

If you are happy with MessagePack

When the schema handles unclear data

For example, if the data output from a certain system is JSON in dictionary format, the keys contained in each record are different, the format of what is entered as a value is different, and it is unknown whether the dictionary format or the array format comes. Or something like that. Furthermore, even if the key or value changes, it is good if such a schema or specification is properly defined, but there are cases where there is no choice but to estimate from the actual data without even the specification.

In such a case, MessagePack can convert the hierarchical structure such as dictionaries and arrays almost as it is and use it, so it is easy to put it in the file without thinking about anything for the time being. When retrieving data for data analysis, data I / O often becomes a bottleneck, so saving it locally in a file format makes it easier to perform trial and error later.

When you want to save data temporarily or when you want to repeatedly save and load data

There are many ways to say data analysis, but it is often necessary to perform "search-type data analysis" to grasp the whole picture and formulate a hypothesis while kneading the data at the stage when a sufficient hypothesis cannot be formulated. In such a case, you can convert the data to a format that is easy to process, extract only the necessary data and save it, and so on. In that case, I think that there may be trial and error regarding the data format, such as "I want to add this attribute for a moment" or "It seems better to save it as a dictionary type instead of a list".

When processing while flexibly changing the data format in this way, when the code side is changed by defining the schema again in a place other than the code that processes the data (for example, creating a table with SQL), etc. Inconsistencies can occur in each case, which can be very annoying. On the other hand, if you want to save the data in MessagePack format, you need to match the data format for both writing and reading, but the work is minimal.

(However, if you don't leave comments to some extent, you may find that even if you review the code yourself a few months later, you won't understand the meaning at all ...)

When you need to save and load data at high speed (but full text)

DB as middleware not only simply stores data, but also provides various functions such as setting keys and guaranteeing reliability, so it takes more overhead than simply writing to disk. .. This may be solved by techniques such as load distribution, but if you simply want to "save a little data", it will take time and effort to convert it to MessagePack format and write it directly as a file. No, the file write / read performance is applied almost as it is.

However, the reliability is the same as writing to a file, and it is assumed that all data is read when reading.

If you are not happy with MessagePack

On the other hand, of course, there are cases where it is not suitable to save in MessagePack format and analyze the data.

  1. When using data with a clear schema or data originally organized in a DB etc. --You should use the function provided by DB quietly
  2. For analyzes such as search and aggregation that benefit greatly from creating keys and indexes. --When reading data, the processing increases linearly with respect to the amount of data because basically all searches are performed. ――As a guide, the upper limit of the data saved by MessagePack is about 1 million. If it becomes more than that, you should seriously consider storing it in the DB etc.
  3. When multiple people handle the same data --This time, it is assumed that one file is read and written, so locking is not considered.
  4. To ensure confidentiality, integrity and availability of data --When exporting to a single file, it is not possible to control access on a record-by-record basis. ――It is easy to accidentally delete a file, so you need to carefully consider that.

Alternative proposal

The following technologies can be considered as alternatives. Please select according to the situation.

CSV

If your data is in a nearly fixed length column format and is not hierarchical, it is better to use CSV (or TSV). However, it is easier to use MessagePack when variable length elements and hierarchical structures are included.

MongoDB

You can insert without defining a schema in a document-oriented DB, so you can do the same from the point of saving data for the time being. However, the performance of insert seems to be 3,500 insert / sec too much, and when writing with MessagePack, it only writes directly to the disk, so Saving with MessagePack, which retains the performance of disk I / O, is overwhelmingly faster. However, if you want to create keys etc. later, MongoDB will be more suitable.

JSON, BSON

It is disadvantageous compared to MessagePack in terms of processing speed and data size (reference). In addition, JSON has a module such as ijson that reads out in stages when trying to put multiple objects in one data segment (for example, one file). If you don't use it, you have to parse it all at once, so if the number of data is large, the processing will be difficult on a poor machine. On the other hand, if you write it like ijson, the code will be complicated, so I personally think that it is easier to store data continuously in one data segment and retrieve it obediently.

Protocol Buffers

Protocol Buffers is famous as one of the serialization technologies, but it takes time to handle data whose schema is unclear because it is necessary to define the schema on the processing code side. When viewed as data serialization as an interface between software, it is convenient because it regulates the schema, but it becomes difficult to handle in cases where you do not know what kind of schema data will come.

Sample code

There is a lot of explanation on the Official page, so I don't need to talk much, but I will introduce the sample code focusing only on writing and reading to the file. The code can also be found on github.

In all cases, the following data shall be written / read to the data.msg file.

{
  "name": "Alice", 
  "age": 27,
  "hist": [5, 3, 1]
}
{
  "name": "Bob", 
  "age": 33,
  "hist": [4, 5]
}

Python

Package msgpack-python is required.

Installation

$ pip install msgpack-python

Write sample code

# coding: UTF-8

import msgpack

obj1 = {
    "name": "Alice",
    "age": 27,
    "hist": [5, 3, 1]
}
obj2 = {
    "name": "Bob",
    "age": 33,
    "hist": [4, 5]
}

with open('data.msg', 'w') as fd:
    fd.write(msgpack.packb(obj1))
    fd.write(msgpack.packb(obj2))

Read sample code

# coding: UTF-8

import msgpack

for msg in msgpack.Unpacker(open('data.msg', 'rb')):
    print msg

Ruby

Package msgpack is required.

$ gem install msgpack

Write sample code

# -*- coding: utf-8 -*-

require "msgpack"

obj1 = {
    "name": "Alice",
    "age": 27,
    "hist": [5, 3, 1]
}
obj2 = {
    "name": "Bob",
    "age": 33,
    "hist": [4, 5]
}

File.open("data.msg", "w") do |file|
  file.write(obj1.to_msgpack)
  file.write(obj2.to_msgpack)
end

Read sample code

# -*- coding: utf-8 -*-

require "msgpack"

File.open("data.msg") do |file|
  MessagePack::Unpacker.new(file).each do |obj|
    puts obj
  end
end

Node

There are some major libraries for MessagePack, but this time I will use msgpack-lite for the code.

$ npm install msgpack-lite

Write sample code

const fs = require('fs');
const msgpack = require('msgpack-lite');

const obj1 = {
  name: "Alice",
  age: 27,
  hist: [5, 3, 1]
};
const obj2 = {
  name: "Bob",
  age: 33,
  hist: [4, 5]
};

fs.open('data.msg', 'w', (err, fd) => {
  fs.writeSync(fd, msgpack.encode(obj1));
  fs.writeSync(fd, msgpack.encode(obj2));
});

Read sample code

const fs = require('fs');
const msgpack = require('msgpack-lite');

var rs = fs.createReadStream('data.msg');
var ds = msgpack.createDecodeStream();

rs.pipe(ds).on('data', (msg) => {
  console.log(msg);
});

C++

The msgpackc library is required. For macOS, you can install it with brew.

$ brew install msgpack

Write sample code

#include <msgpack.hpp>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
  int fd = open("data.msg", O_WRONLY | O_CREAT, 0600);

  msgpack::sbuffer buf1, buf2;;
  msgpack::packer<msgpack::sbuffer> pk1(&buf1), pk2(&buf2);

  pk1.pack_map(3);
  pk1.pack("name"); pk1.pack("Alice");
  pk1.pack("age");  pk1.pack(27);
  pk1.pack("hist");
  pk1.pack_array(3);
  pk1.pack(5); pk1.pack(3); pk1.pack(1);

  write(fd, buf1.data(), buf1.size());


  pk2.pack_map(3);
  pk2.pack("name"); pk2.pack("Bob");
  pk2.pack("age");  pk2.pack(33);
  pk2.pack("hist");
  pk2.pack_array(2);
  pk2.pack(4); pk2.pack(5);

  write(fd, buf2.data(), buf2.size());

  close(fd);
  return 0;
}

Read sample code

#include <msgpack.hpp>
#include <sys/types.h>
#include <unistd.h>
#include <fcntl.h>

int main(int argc, char *argv[]) {
  static const size_t BUFSIZE = 4; //Dare to make the buffer size smaller
  int rc;
  char buf[BUFSIZE];

  int fd = open("data.msg", O_RDONLY);

  msgpack::unpacker unpkr;
  while (0 < (rc = read(fd, buf, sizeof(buf)))) {
    unpkr.reserve_buffer(rc);
    memcpy(unpkr.buffer(), buf, rc);
    unpkr.buffer_consumed(rc);

    msgpack::object_handle result;
    while (unpkr.next(result)) {
      const msgpack::object &obj = result.get();

      if (obj.type == msgpack::type::MAP) {
        printf("{\n");
        msgpack::object_kv* p(obj.via.map.ptr);

        for(msgpack::object_kv* const pend(obj.via.map.ptr + obj.via.map.size);
            p < pend; ++p) {

          std::string key;
          p->key.convert(key);

          if (key == "name") {
            std::string value;
            p->val.convert(value);
            printf("  %s: %s,\n", key.c_str(), value.c_str());
          }

          if (key == "age") {
            int value;
            p->val.convert(value);
            printf("  %s: %d,\n", key.c_str(), value);
          }

          if (key == "hist") {
            msgpack::object arr = p->val;
            printf ("  %s, [", key.c_str());
            for (int i = 0; i < arr.via.array.size; i++) {
              int value;
              arr.via.array.ptr[i].convert(value);

              printf("%d, ", value);
            }
            printf ("],\n");
          }
        }

        printf("}\n");
      }

      result.zone().reset();
    }
  }

  return 0;
}

By the way, if you throw the msgpack :: object format into ʻostream (std :: cout` etc.), the format will be formatted and displayed without permission, but it is troublesome as described above to retrieve the value programmatically. The procedure is described as a sample.

Recommended Posts

Recommendation of data analysis using MessagePack
Data analysis using xarray
Data analysis using Python 0
Data analysis using python pandas
[Python] [Word] [python-docx] Simple analysis of diff data using python
Analysis of measurement data ②-Histogram and fitting, lmfit recommendation-
Time series analysis 3 Preprocessing of time series data
Recommendation tutorial using association analysis (concept)
Data handling 2 Analysis of various data formats
Construction of recommendation system using word-of-mouth doc2vec
Recommendation tutorial using association analysis (python implementation)
Creating a data analysis application using Streamlit
Recommendation of Altair! Data visualization with Python
Awareness of using Aurora Severless Data API
Data analysis python
Sentiment analysis of corporate word-of-mouth data of career change meetings using deep learning
Data analysis Titanic 1
I tried to perform a cluster analysis of customers using purchasing data
Time variation analysis of black holes using python
[Pandas] Basics of processing date data using dt
Python introductory study-output of sales data using tuples-
A well-prepared record of data analysis in Python
Check the status of your data using pandas_profiling
Scraping the winning data of Numbers using Docker
An introduction to data analysis using Python-To increase the number of video views-
Data analysis with python 2
Explanation of the concept of regression analysis using python Part 2
Data analysis parts collection
Summary of statistical data analysis methods using Python that can be used in business
Analysis of financial data by pandas and its visualization (1)
Big data analysis using the data flow control framework Luigi
Data cleansing 2 Data cleansing using DataFrame
Data cleaning using Python
Types of recommendation systems
Analysis of measurement data ①-Memorandum of understanding for scipy fitting-
Story of image analysis of PDF file and data extraction
Explanation of the concept of regression analysis using Python Part 1
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 4: Feature extraction of data using T-SQL
I tried using the API of the salmon data project
Preprocessing of prefecture data
Orthologous analysis using OrthoFinder
Basics of regression analysis
Example of using lambda
Selection of measurement data
Data analysis with Python
[Technical book] Introduction to data analysis using Python -1 Chapter Introduction-
Analyzing the life of technology with Qiita article data ~ Survival time analysis using content logs ~
Instantly create a diagram of 2D data using python's matplotlib
Recommendation of Jupyter Notebook, a coding environment for data scientists
Let's make the analysis of the Titanic sinking data like that
First step of data analysis (number of data, table display, missing values)
Stress analysis of torus under internal pressure using axisymmetric stress analysis program
Data analysis based on the election results of the Tokyo Governor's election (2020)
[Introduction] Artificial satellite data analysis using Python (Google Colab environment)
I tried to analyze scRNA-seq data using Topological Data Analysis (TDA)
My python data analysis container
Multidimensional data analysis library xarray
Implementation of TF-IDF using gensim
Python for Data Analysis Chapter 4
Tuning experiment of Tensorflow data
Select features using text data