Benchmark by matrix product: NumPy, Numba, Cython, Swig, OpenCL, intelMKL

I'm interested in putting deep learning into an embedded class environment, and I'm looking for a way to implement it easily and execute it quickly. So, for a little benchmark, I had them calculate the product of two ** NxN ** matrices in NumPy, Numba, Cython, Swig, OpenCL, intel MKL.

Swig implicitly cast the float to double inside, and some of the memos under implementation confirmation have become zombies, but please accept it. I will wait for PL as a result of increasing the comparison level and the environment difference.

NumPyCuPy_test/Test01.ipynb at master · Chachay/NumPyCuPy_test

The person to be multiplied this time

Single-precision NxN matrix mb as well as single-precision NxN matrix ma

Matrix definition


N = 1000 #Matches N in the table below
# Generate NxN randomized matrix
ma = np.random.rand(N, N).astype(np.float32)
mb = np.random.rand(N, N).astype(np.float32)

Result Summary

The result of using Numpy's dot in Numba is overwhelming. But isn't the execution result cached?

N=3 N=100 N=1000
Execution time[us] speed Execution time[us] speed Execution time[ms] speed
1 Numpy Native 1.34 1.00 94.2 1.00 87.9 1.00
2 Numba Simple Mult 1.24 1.08 1,040.0 0.09 6,870.0 0.01
3 Numba Numpy 2.03 0.66 74.6 1.26 23.5 3.74
4 Cython Simple Mult 19.20 0.07 660,000.0 0.00 NA -
5 Cython Numpy 1.52 0.88 94.1 1.00 87.8 1.00
6 Swig Simple Mult 1.97 0.68 837.0 0.11 7,450.0 0.01
7 Swig MKL cblas_dgemm 2.57 0.52 123.0 0.77 95.8 0.92
8 OpenCL Simple Mult(Parallel) 2,560.00 0.00 3,160.0 0.03 877.0 0.10

[Memo] I should have made a note while executing it, but the size relationship is strange. I will try again.

Numpy

See reference

np.dot(ma, mb)

Numba

Prepare Num_NpDot which is np.dot multiplied by jit and Num_Dot which performs matrix multiplication as defined in mathematics.

from numba import jit
@jit
def Num_NpDot(a, b):
    return np.dot(a, b)
@jit    
def Num_Dot(a, b):
    c = np.zeros((a.shape[0], b.shape[1]))
    for i in range(a.shape[0]):
        for j in range(b.shape[1]):
            for k in range(a.shape[1]):
                c[i][j] += a[i][k] * b[k][j]
    return c

Cython

If I thought that I could beat Numba because I was taking care of it while paying attention to defining the type and translating it into C, the result was like this time, and you were quite disappointed.

Same as Numba level. Build is (should) be O2-equivalent optimization in VC2015.

cimport numpy as np
cimport cython
import numpy as np

@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False)  # turn off negative index wrapping for entire function

cpdef np.ndarray Cy_NpDot(np.ndarray a, np.ndarray b):
    return np.dot(a, b)
    
cpdef np.ndarray Cy_Dot(np.ndarray a, np.ndarray b):
    cdef np.ndarray c
    c = np.zeros((a.shape[0], b.shape[0]))
    for i in range(a.shape[0]):
        for j in range(b.shape[1]):
            for k in range(a.shape[1]):
                c[i][j] += a[i][k] * b[k][j]
    return c

I was in trouble because Cy_Dot became a very unusable child even though he was through Cython.

Swig

I've been indebted to you recently. I'm doing Python, but I write important points in C ++. Python is only used for data handling, the method.

Like Numba, Intel MKL pitched as a level equivalent to so-called multiplication Swig_Dot and Numpy. The build is VC2015 and O2 equivalent optimization (should).

Although it is single precision, I prepared it as double. Oops. (Especially it seems to be strange in MKL, but I close my eyes for the time being)

void Swig_Dot(int mm1, int mn1, double* mat1,
              int mm2, int mn2, double* mat2,
              double** outmat, int* mm, int* mn)
{
    double* arr = NULL;
    arr = (double*)calloc(mm1 * mn2, sizeof(double));
    
    // Multiplying matrix a and b and storing in array mult.
    for(int i = 0; i < mm1; ++i)
        for(int j = 0; j < mn2; ++j)
            for(int k = 0; k < mn1; ++k)
                arr[i*mn2+j] += mat1[i*mn1+k] * mat2[k*mn2+j];
    *mm = mm1;
    *mn = mn2;
    *outmat = arr;
}

void Swig_Dot_MKL(int mm1, int mn1, double* mat1,
              int mm2, int mn2, double* mat2,
              double** outmat, int* mm, int* mn)
{
    double alpha = 1.0;
    double beta = 0.0;

    double *C = (double *)mkl_malloc( mm1*mn2*sizeof( double ), 64 );
    if (C == NULL) {
        printf( "\n ERROR: Can't allocate memory for matrices. Aborting... \n\n");
        mkl_free(C);
        return;
    }

    for (int i = 0; i < (mm1*mn2); i++) {
        C[i] = 0.0;
    }
    cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 
                mm1, mn2, mn1, alpha, mat1, mn1, mat2, mn2, beta, C, mn2);

    double *res = (double *)malloc( mm1*mn2*sizeof( double ));
    memcpy(res, C, mm1*mn2*sizeof( double ));
    *mm = mm1;
    *mn = mn2;
    *outmat = res;
    mkl_free(C);
}

OpenCL

bonus. Think of it as a stub for comparison with the GPU. I wanted to compare it with CuPy, but it looks like it's just a hardware macho club.

abridgement.

References

Recommended Posts

Benchmark by matrix product: NumPy, Numba, Cython, Swig, OpenCL, intelMKL
Visualization of matrix created by numpy
[Scientific / technical calculation by Python] Calculation of matrix product by @ operator, python3.5 or later, numpy
Handle numpy with Cython (method by memoryview)