This post is a formatted version of my blog post for Qiita. If there are additional items, I will write them on the blog. __ "Introduction to Machine Learning Library SHOGUN" __ http://rest-term.com/archives/3090/

Introduction to Machine Learning Library SHOGUN

The machine learning toolbox's focus is on large scale kernel methods and especially on Support Vector Machines (SVM)

The SHOGUN Machine Learning Toolbox

environment

CentOS 6.4 (x86_64 / Intel Xeon 2.9GHz 32 cores / 96GB RAM)
gcc 4.4.7
cmake 2.8.11
swig 2.0.11
Python 2.7.5 (+ NumPy 1.7.1)
SHOGUN (development branch / libshogun.so.14)

Installation NOTE: Dependency library installation SWIG BLAS (ATLAS) / LAPACK / GLPK / Eigen3 NumPy Precautions when compiling hello, world (libshogun) Notes on memory management Python Modular

Installation

The official website describes the setup procedure etc. on the premise of debian OS, but you can install it on redhat without any trouble. If it is debian, the old version is distributed in the deb package, but since it is CentOS here, compile / install from the source. Since machine learning-related tasks often take a long time, we recommend that you build such software, not just SHOGUN, in the environment in which it actually operates and use it in the optimum state.

It seems that it was built with Autotools (./configure && make) before, but the latest version package was compatible with CMake. The number of CMake users such as OpenCV and MySQL has increased in the last few years.

$ git clone git://github.com/shogun-toolbox/shogun.git
$ cd shogun
$ mkdir build && cd build
$ cmake -DCMAKE_INSTALL_PREFIX=/usr/local/shogun-2.1.0 \
        -DCMAKE_BUILD_TYPE=Release \
        -DBUNDLE_EIGEN=ON \
        -DBUNDLE_JSON=ON \
        -DCmdLineStatic=ON \
        -DPythonModular=ON ..
##Dependent libraries are checked and the build configuration is displayed.
-- Summary of Configuration Variables
--
-- The following OPTIONAL packages have been found:

 * GDB
 * OpenMP
 * BLAS
 * Threads
 * LAPACK
 * Atlas
 * GLPK
 * Doxygen
 * LibXml2
 * CURL
 * ZLIB
 * BZip2
 * Spinlock

-- The following REQUIRED packages have been found:

 * SWIG (required version >= 2.0.4)
 * PythonLibs
 * PythonInterp
 * NumPy

-- The following OPTIONAL packages have not been found:

 * CCache
 * Mosek
 * CPLEX
 * ARPACK
 * NLopt
 * LpSolve
 * ColPack
 * ARPREC
 * HDF5
 * LibLZMA
 * SNAPPY
 * LZO

-- ==============================================================================================================
-- Enabled Interfaces
--   libshogun is ON
--   python modular is ON
--   octave modular is OFF       - enable with -DOctaveModular=ON
--   java modular is OFF         - enable with -DJavaModular=ON
--   perl modular is OFF         - enable with -DPerlModular=ON
--   ruby modular is OFF         - enable with -DRubyModular=ON
--   csharp modular is OFF       - enable with -DCSharpModular=ON
--   R modular is OFF            - enable with -DRModular=ON
--   lua modular is OFF          - enable with -DLuaModular=ON
--
-- Enabled legacy interfaces
--   cmdline static is ON
--   python static is OFF        - enable with -DPythonStatic=ON
--   octave static is OFF        - enable with -DOctaveStatic=ON
--   matlab static is OFF        - enable with -DMatlabStatic=ON
--   R static is OFF             - enable with -DRStatic=ON
-- ==============================================================================================================
##If there seems to be no problem, compile and install
$ make -j32
$ sudo make install

It seems that any compiler that supports C ++ 11 features will take advantage of some features (std :: atomic, etc.). Most of the functions of C ++ 11 are not supported by the redhat 6.x system GCC (v4.4.x).

In my environment, I installed the command line and the interface for Python in addition to libshogun in the library itself. If you want to have many script language interfaces, you need to install SWIG separately.

NOTE: Dependency library installation

SWIG SWIG is a tool that creates bindings for using modules (shared libraries) written in C / C ++ from high-level languages such as scripting languages. I also occasionally make PHP bindings for web interfaces using SWIG in my business. As of November 2013, the packages that can be installed with yum do not meet the SHOGUN version requirements, so compile / install them from source as well. For debian, $ apt-get install swig2.0 is OK.

##Dependent package PCRE(Perl Compatible Regular Expressions)If not included, enter
$ sudo yum pcre-devel.x86_64

##Erase old rpm packages if they are included
$ sudo yum remove swig

$ wget http://prdownloads.sourceforge.net/swig/swig-2.0.11.tar.gz
$ tar zxf swig-2.0.11.tar.gz
$ cd swig-2.0.11
$ ./configure --prefix=/usr/local/swig-2.0.11
$ make -j2
$ sudo make install
##Put a binary symbolic link in your PATH
$ sudo ln -s /usr/local/swig-2.0.11/bin/swig /usr/local/bin

Let's build SHOGUN again after aligning the processing system of the language you want to make the binding with SWIG.

BLAS(ATLAS)/LAPACK/GLPK/Eigen3 A set of libraries related to linear algebra. A package that can be installed with yum is OK. ATLAS is one of the optimized BLAS implementations, which is easy to install and should be included (BLAS is the reference implementation). In addition, SHOGUN seems to support CPLEX in addition to GLPK, so if you want to use it for business, CPLEX It seems that the performance will be further improved by introducing /). Let's do our best to write the approval form (it seems that it can be used free of charge for academic purposes). As for Eigen3, as of November 2013, the packages that can be installed with yum do not meet the SHOGUN version requirements, but if you do not have Eigen3, CMake will download the source code (header files because it is a template library). It seems that you can instruct. It's OK if you add -DBUNDLE_EIGEN = ON to the CMake option.

##Install all linear algebra related libraries
## atlas-lapack if you include devel-no devel required
$ sudo yum install blas-devel.x86_64 lapack-devel.x86_64 atlas-devel.x86_64 glpk-devel.x86_64

NumPy ** NumPy is required to install the Python interface **. I introduced NumPy on my blog before, so for reference. By the way, the Python interface of OpenCV (Computer Vision Library) also uses NumPy.

Introduction to Python Numerical Library NumPy

Installation is easy with pip (A tool for installing and managing Python packages.).

$ sudo pip install numpy

The overall picture of the SHOGUN library is as shown in the figure below, and I'm not sure. In addition to the interfaces shown in this figure, Java, Ruby, Lua, etc. are also supported as interfaces for scripting languages. As mentioned above, install SWIG and create a binding for the language you want to use.

Precautions when compiling

If you release and build SHOGUN in a cheap VPS environment with a small amount of physical memory / virtual memory, there is a high possibility that the cc1plus process will be forcibly killed by OOM Killer. I tried it in a virtual environment with 1GB RAM / 2GB Swap, but it was killed with a terrible score. ..

kernel: Out of memory: Kill process 30340 (cc1plus) score 723 or sacrifice child
kernel: Killed process 30340, UID 500, (cc1plus) total-vm:2468236kB, anon-rss:779716kB, file-rss:2516kB
kernel: cc1plus invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0

In that case, consider increasing the swap capacity. If the physical memory is only 1GB, it is probably not enough to have about twice the actual memory capacity, so I think it is safe to temporarily secure about 4 times. It may be better to give up a little if it is an OpenVZ virtual environment. ..

Also, in a virtual environment like VPS, there should be few inodes, so I think that it is not possible to put a large amount of training data. It is harder to build it on a physical server obediently. I built it here in an environment with 32 cores / 96GB RAM, but the resources were sufficient and I was able to build smoothly.

By the way, the GCC optimization options in my environment are as follows.

-march=core2 -mcx16 -msahf -maes -mpclmul -mavx --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=15360 -mtune=generic

hello, world (libshogun) Let's start with a simple task using libshogun. [SVM (Support Vector Machine)](http://ja.wikipedia.org/wiki/%E3%82%B5%E3%83%9D%E3%83%BC%E3%83%88%E3%83% This is a sample to classify data using 99% E3% 82% AF% E3% 82% BF% E3% 83% BC% E3% 83% 9E% E3% 82% B7% E3% 83% B3). ..

/* hello_shogun.cpp */
#include <shogun/labels/BinaryLabels.h>
#include <shogun/features/DenseFeatures.h>
#include <shogun/kernel/GaussianKernel.h>
#include <shogun/classifier/svm/LibSVM.h>
#include <shogun/base/init.h>
#include <shogun/lib/common.h>
#include <shogun/io/SGIO.h>

using namespace shogun;

int main(int argc, char** argv) {
  // initialize
  init_shogun_with_defaults();

  // create some data
  SGMatrix<float64_t> matrix(2,3);
  for(int i=0; i<6; i++) {
    matrix.matrix[i] = i;
  }
  matrix.display_matrix();

  // create three 2-dimensional vectors
  CDenseFeatures<float64_t>* features = new CDenseFeatures<float64_t>();
  features->set_feature_matrix(matrix);

  // create three labels
  CBinaryLabels* labels = new CBinaryLabels(3);
  labels->set_label(0, -1);
  labels->set_label(1, +1);
  labels->set_label(2, -1);

  // create gaussian kernel(RBF) with cache 10MB, width 0.5
  CGaussianKernel* kernel = new CGaussianKernel(10, 0.5);
  kernel->init(features, features);

  // create libsvm with C=10 and train
  CLibSVM* svm = new CLibSVM(10, kernel, labels);
  svm->train();

  SG_SPRINT("total sv:%d, bias:%f\n", svm->get_num_support_vectors(), svm->get_bias());

  // classify on training examples
  for(int i=0; i<3; i++) {
    SG_SPRINT("output[%d]=%f\n", i, svm->apply_one(i));
  }

  // free up memory
  SG_UNREF(svm);

  exit_shogun();
  return 0;
}

Compile and run

##C for a relatively new compiler++It is good to compile with 11 enabled.
$ g++ -g -Wall -std=c++0x -L/usr/local/lib64 -lshogun hello_shogun.cpp -o hello_shogun
$ ./hello_shogun
matrix=[
[       0,      2,      4],
[       1,      3,      5]
]
total sv:3, bias:-0.333333
output[0]=-0.999997
output[1]=1.000003
output[2]=-1.000005

Extract the feature vector (CDense Features) from the matrix, set the correct label (CBinaly Labels), and learn with SVM (CLib SVM) using the Gaussian kernel (CGaussian Kernel). Below are some features.

Matrix layout is handled by Column-Major (Fortran Order).
External libraries such as LibSVM and SVMLight are used internally for SVM learning.
Object management with a unique thread-safe reference counting mechanism

Feature vectors are read from a matrix with Column-Major, similar to OpenGL, CUBLAS, etc. (one column becomes one feature vector). External libraries such as LibSVM and SVMLight can be used internally from the SHOGUN interface for learning SVMs (see shogun / classifier / svm / below).

Notes on memory management

The above code clearly seems to be a memory leak, but when I checked it with valgrind, it seems that the memory is properly managed internally. SHOGUN internally manages objects by reference counting. I use this because a macro (SG_REF / SG_UNREF) that increments / decrements this reference count is defined, but I don't need to manually manipulate the count for all instances. I've read the implementation around memory management in SHOGUN and it decrements the reference count of the referenced instance when the referencing instance is released (that is, in the destructor). Since the instance is released when the reference count becomes 0 or less, in the above code, if you release the SVM instance, other instances will be released in a chain reaction.

Please note that ** nothing is done when the scope is out of scope **, so the decision to release the instance is only made when the reference count changes. It feels like a small kindness and a big help, but it can't be helped, so it seems necessary to deal with this mechanism well. I will write some policies.

Fully manual reference counting management If you accidentally delete yourself while under the control of SHOGUN's reference count, you run the risk of double free. So, at the same time you create an instance of the class, manually increment the reference count (SG_REF) and manually decrement (SG_UNREF) when you no longer need the instance. It's an Objective-C method of ARC off. However, writing exception-safe code can be difficult.
Smart pointer + semi-manual reference counting management If you have a relatively new C ++ compiler, you can use std :: unique_ptr / std :: shared_ptr etc., so this is the method to use. As mentioned above, under the control of SHOGUN's reference count, if the instance's reference count becomes 0 or less, it will be deleted internally, so if you just wrap it in a smart pointer, there is also a risk of double release. .. Therefore, after creating an instance, manually increment the reference count and do not decrement it so that delete is not executed inside SHOGUN. In other words, always keep the reference count above 1 to make SHOGUN's object management virtually meaningless. This has the advantage of making it easier to write exception-safe code without having to be scared of forgetting SG_UNREF.

// C++Manually increment the SHOGUN reference count after creating an instance using a standard smart pointer
std::unique_ptr<CDenseFeatures<float64_t> > features(new CDenseFeatures<float64_t>());
SG_REF(features);
//Do not decrement reference counts

Next, let's input unknown data. Use the training data / test data that is appropriately generated by Python and written to a file as shown below.

import numpy as np

def genexamples(n):
    class1 = 0.6*np.random.randn(n, 2)
    class2 = 1.2*np.random.randn(n, 2) + np.array([5, 1])
    labels = np.hstack((np.ones(n), -np.ones(n)))
    return (class1, class2, labels)

** Note) The following code has been confirmed to work below libshogun.so.14 **

#include <shogun/labels/BinaryLabels.h>
#include <shogun/features/DenseFeatures.h>
#include <shogun/kernel/GaussianKernel.h>
#include <shogun/classifier/svm/LibSVM.h>
#include <shogun/io/SGIO.h>
#include <shogun/io/CSVFile.h>
#include <shogun/evaluation/ContingencyTableEvaluation.h>
#include <shogun/base/init.h>
#include <shogun/lib/common.h>

using namespace std;
using namespace shogun;

int main(int argc, char** argv) {
  try {
    init_shogun_with_defaults();

    // training examples
    CCSVFile train_data_file("traindata.dat");
    // labels of the training examples
    CCSVFile train_labels_file("labeldata.dat");
    // test examples
    CCSVFile test_data_file("testdata.dat");

    SG_SPRINT("training ...\n");
    SGMatrix<float64_t> train_data;
    train_data.load(&train_data_file);

    CDenseFeatures<float64_t>* train_features = new CDenseFeatures<float64_t>(train_data);
    SG_REF(train_features);
    SG_SPRINT("num train vectors: %d\n", train_features->get_num_vectors());

    CBinaryLabels* train_labels = new CBinaryLabels();
    SG_REF(train_labels);
    train_labels->load(&train_labels_file);
    SG_SPRINT("num train labels: %d\n", train_labels->get_num_labels());

    float64_t width = 2.1;
    CGaussianKernel* kernel = new CGaussianKernel(10, width);
    SG_REF(kernel);
    kernel->init(train_features, train_features);

    int C = 1.0;
    CLibSVM* svm = new CLibSVM(C, kernel, train_labels);
    SG_REF(svm);
    svm->train();
    SG_SPRINT("total sv:%d, bias:%f\n", svm->get_num_support_vectors(), svm->get_bias());
    SG_UNREF(train_features);
    SG_UNREF(train_labels);
    SG_UNREF(kernel);

    CBinaryLabels* predict_labels = svm->apply_binary(train_features);
    SG_REF(predict_labels);

    CErrorRateMeasure* measure = new CErrorRateMeasure();
    SG_REF(measure);
    measure->evaluate(predict_labels, train_labels);
    float64_t accuracy = measure->get_accuracy()*100;
    SG_SPRINT("accuracy: %f\%\n", accuracy);
    SG_UNREF(predict_labels);
    SG_UNREF(measure);

    SG_SPRINT("testing ...\n");
    SGMatrix<float64_t> test_data;
    test_data.load(&test_data_file);

    CDenseFeatures<float64_t>* test_features = new CDenseFeatures<float64_t>(test_data);
    SG_REF(test_features);
    SG_SPRINT("num test vectors: %d\n", test_features->get_num_vectors());

    CBinaryLabels* test_labels = svm->apply_binary(test_features);
    SG_REF(test_labels);
    SG_SPRINT("num test labels: %d\n", test_labels->get_num_labels());
    SG_SPRINT("test labels: ");
    test_labels->get_labels().display_vector();
    CCSVFile test_labels_file("test_labels_file.dat", 'w');
    test_labels->save(&test_labels_file);

    SG_UNREF(svm);
    SG_UNREF(test_features);
    SG_UNREF(test_labels);

    exit_shogun();
  } catch(ShogunException& e) {
    SG_SPRINT(e.get_exception_string());
    return - 1;
  }
  return 0;
}

Execution result

training ...
num train vectors: 400
num train labels: 400
total sv:37, bias:-0.428868
accuracy: 99.750000%
testing ...
num test vectors: 400
num test labels: 400
test labels: vector=[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1]

SHOGUN has just the right level of class design granularity, and the API is well abstracted, but the code gets dirty when reference counting operations are included. .. Also, the training data is read from the file using the CCSVFile class. Although it is named CSV, it can also read files in which 2D data is written separated by spaces instead of CSV format.

To get along with SHOGUN and OpenCV, it may be useful to create an adapter that can convert between shogun :: SGMatrix and cv :: Mat. Also, OpenCV is also advancing CUDA support, so I would like SHOGUN to support it as well.

Python Modular Next, let's use SHOGUN's Python binding. Two are provided, python_static and python_modular, but I will use python_modular because it has a smarter interface.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import modshogun as sg
import numpy as np
import matplotlib.pyplot as plt

def classifier():
    train_datafile = sg.CSVFile('traindata.dat')
    train_labelsfile = sg.CSVFile('labeldata.dat')
    test_datafile = sg.CSVFile('testdata.dat')

    train_features = sg.RealFeatures(train_datafile)
    train_labels = sg.BinaryLabels(train_labelsfile)
    test_features = sg.RealFeatures(test_datafile)

    print('training ...')
    width = 2.1
    kernel = sg.GaussianKernel(train_features, train_features, width)

    C = 1.0
    svm = sg.LibSVM(C, kernel, train_labels)
    svm.train()
    sv = svm.get_support_vectors()
    bias = svm.get_bias()
    print('total sv:%s, bias:%s' % (len(sv), bias))

    predict_labels = svm.apply(train_features)
    measure = sg.ErrorRateMeasure()
    measure.evaluate(predict_labels, train_labels)
    print('accuracy: %s%%' % (measure.get_accuracy()*100))

    print('testing ...')
    test_labels = svm.apply(test_features)
    print(test_labels.get_labels())

if __name__=='__main__':
    classifier()

Execution result

training ...
total sv:37, bias:-0.428868128708
accuracy: 99.75%
testing ...
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
...abridgement

I got the same result as the C ++ (libshogun) version. Let's visualize the classification result using matplotlib.

import numpy as np
import matplotlib.pyplot as plt

##Get classification boundaries
def getboundary(plotrange, classifier):
    x = np.arange(plotrange[0], plotrange[1], .1)
    y = np.arange(plotrange[2], plotrange[3], .1)
    xx, yy = np.meshgrid(x, y)
    gridmatrix = np.vstack((xx.flatten(), yy.flatten()))
    gridfeatures = sg.RealFeatures(gridmatrix)
    gridlabels = classifier.apply(gridfeatures)
    zz = gridlabels.get_labels().reshape(xx.shape)
    return (xx, yy, zz)

##Get the classification boundary by specifying the drawing range and classifier
xx, yy, zz = getboundary([-4,8,-4,5], svm)
##Draw classification boundaries
plt.contour(xx, yy, zz, [1,-1])

For the time being, I've organized a simple usage in C ++ and Python. Since SHOGUN implements not only SVM but also various machine learning algorithms, I would like to proceed with verification using more practical tasks.