This article is the 22nd day of Furukawa Lab Advent Calendar. I am an alumnus of Furukawa Lab, but I was invited by a student to participate. Thank you.
This time, I would like to explain the sparse matrix and its expression format while also describing the implementation by scipy.sparse
.
This article is for people who have become accustomed to numpy
to some extent, but haven't touched on sparse matrix-related things yet ... I just want to know what it is!"
In this article, we will focus on what values each representation format uses to represent a sparse matrix, and finally compare the memory size and calculation speed as a simple experiment.
If you want to know more about scipy.sparse
than this article, please click here.
https://docs.scipy.org/doc/scipy/reference/sparse.html
This article uses Python's NumPy and SciPy for explanation, but the concept and expression format of sparse matrices are not limited to these languages and libraries, but are widely used.
A matrix with many zero elements is called a ** sparse matrix **. Sparse matrices often appear in real data. For example, if you try to express the data of "which person bought which product and how many" on the EC site in a straightforward matrix, the number of purchases will be stored in a dodeca matrix of total number of users x total number of products. Of course, most of the elements of this matrix will be 0.
The following is the image (originally, the number of users and products is larger, and the ratio of zero elements is much larger). This data will be used as sample data for the following explanation.
The above data can be created with a Numpy array as follows.
import numpy as np
data_np = np.array([[0, 0, 0, 2, 5],
[9, 0, 1, 8, 0],
[0, 0, 6, 0, 0],
[0, 4, 7, 0, 3]])
However, if the zero element is included in this way, there will be a lot of waste in terms of memory and calculation.
It is efficient to express information by focusing only on non-zero elements (non-zero elements / non-zero elements) in a sparse matrix. There are various forms of expression for sparse matrices, and it is necessary to use them properly according to the purpose. In this article, we will focus on the COO format, CSR format, and CSC format, and introduce their information expression methods and advantages. (By the way, I think the most used format will be the CSR format.)
If you want to know the initialization method and advantages / disadvantages that are not explained in this article, please refer to the following. https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html
The most intuitive format is the coordinate format (COOrdinate format: COO format). A sparse matrix is represented by three one-dimensional arrays.
One is simply an array of nonzero element values. (The COO format itself can be arranged in any order, but for the sake of the behavior of scipy.sparse
and later explanation, it is arranged in the order of the green line in the figure.)
The other two show which index the value of each nonzero element is in. These two arrays represent the "coordinates" of the nonzero element.
In other words 0th user is buying 2 3rd products 0th user is buying 5 4th products No. 1 user is buying 9 No. 0 products … It is an information expression like this.
You can easily convert to COO format by using scipy.sparse.coo_matrix ()
.
from scipy import sparse
data_coo = sparse.coo_matrix(data_np)
print(data_coo)
output
(0, 3) 2
(0, 4) 5
(1, 0) 9
(1, 2) 1
(1, 3) 8
(2, 2) 6
(3, 1) 4
(3, 2) 7
(3, 4) 3
I will also display the information shown in the figure.
print(f"row: {data_coo.row}")
print(f"col: {data_coo.col}")
print(f"data: {data_coo.data}")
output
row: [0 0 1 1 1 2 3 3 3]
col: [3 4 0 2 3 2 1 2 4]
data: [2 5 9 1 8 6 4 7 3]
The data as shown in the figure is stored.
You can return to the normal matrix representation with scipy.sparse.coo_matrix.todense ()
.
print(data_coo.todense())
output
[[0 0 0 2 5]
[9 0 1 8 0]
[0 0 6 0 0]
[0 4 7 0 3]]
It is a natural expression in relational databases, and many of the data provided as datasets are in this format. For example, the movie evaluation data MovieLens 100K Dataset can be easily read in COO format as follows.
import pandas as pd
df = pd.read_table('ml-100k/u.data', names=['user_id', 'movie_id', 'rating', 'timestamp'])
ml100k_coo = sparse.coo_matrix((df.rating, (df.user_id-1, df.movie_id-1)))
print(f'number of nonzero: {ml100k_coo.nnz}')
print(f'shape: {ml100k_coo.shape}')
output
number of nonzero: 100000
shape: (943, 1682)
MovieLens 100K Dataset is the data evaluated by 943 users for 1682 movies, and since there are 100000 non-zero elements, it seems that it can be read properly. (However, please note that it will be "** a matrix containing 0 evaluation values other than non-zero elements **". "In the MovieLens data, the 0 part of this matrix is regarded as a missing value. It is necessary for the analyst to handle the data analysis convenience such as "properly handle it properly".)
MovieLens 100K is data in which user_id and movie_id happen to be serial numbers, and can be easily converted to index. (Specifically, id is a serial number starting from 1, so if you subtract 1 and start from 0, it will be treated like an index). If you want to create an index properly without relying on such a lucky case, the code will be as follows.
import pandas as pd
df = pd.read_table('ml-100k/u.data', names=['user_id', 'movie_id', 'rating', 'timestamp'])
user_id_categorical = pd.api.types.CategoricalDtype(categories=sorted(df.user_id.unique()), ordered=True)
user_index = df.user_id.astype(user_id_categorical).cat.codes
movie_id_categorical = pd.api.types.CategoricalDtype(categories=sorted(df.movie_id.unique()), ordered=True)
movie_index = df.movie_id.astype(movie_id_categorical).cat.codes
ml100k_coo = sparse.coo_matrix((df.rating, (user_index, movie_index)))
Reference: [Convert DataFrame to memory saving with Python](https://developers.microad.co.jp/entry/2019/05/10/180000#pandasCategorical%E3%81%AE%E6%B4%BB % E7% 94% A8)
In addition, the COO format can be converted to the CSR format and CSC format at high speed.
I will repost it because it is easier to understand if I proceed from the COO format.
If you look at row
, you can see that the same numbers are lined up in succession.
The format in which this information is further compressed is called the Compressed Sparse Row format: CSR format.
(Compressed Row Storage format: It seems that it is also called CRS format. Translated literally, it is a compressed row storage format.)
I will explain the image of how to compress.
Let's draw a dividing line at the turn of row
.
The information of this dividing line is a compressed representation of row
.
The part enclosed by the delimiter represents the information on the 0th line, 1st line, 2nd line, and 3rd line, respectively.
Use this instead of row
.
To summarize the above, the CSR format is represented by the following three one-dimensional arrays. The separator line information is called a row index pointer because it is a collection of pointers where the row indices change. In scipy.sparse.csr_matrix, the variable name is ʻindptr` for short.
You can easily create it with scipy.sparse.csr_matrix ()
in the same way as scipy.sparse.coo_matrix ()
.
data_csr = sparse.csr_matrix(data_np)
print(f"index pointer: {data_csr.indptr}")
print(f"indices: {data_csr.indices}")
print(f"data: {data_csr.data}")
output
index pointer: [0 2 5 6 9]
indices: [3 4 0 2 3 2 1 2 4]
data: [2 5 9 1 8 6 4 7 3]
Alternatively, you can create it with data_csr = data_coo.tocsr ()
.
In this way, scipy.sparse
s can be converted to each other by the method.toxxx ()
.
I am good at processing that is performed line by line, such as slicing lines. Matrix multiplication (to be clear, the product of CSR-style sparse matrix and vector) has the greatest advantage. Let's check the contents to see what the implementation is like. https://github.com/scipy/scipy/blob/41800a2fc6f86446c7fe0248748bfb371e37cd04/scipy/sparse/sparsetools/csr.h#L1100-L1137
csr.h
template <class I, class T>
void csr_matvec(const I n_row,
const I n_col,
const I Ap[],
const I Aj[],
const T Ax[],
const T Xx[],
T Yx[])
{
for(I i = 0; i < n_row; i++){
T sum = Yx[i];
for(I jj = Ap[i]; jj < Ap[i+1]; jj++){
sum += Ax[jj] * Xx[Aj[jj]];
}
Yx[i] = sum;
}
}
ʻAp is the row index pointer, ʻAj
is the column indices, ʻAxis the non-zero data,
Xx is the vector to be multiplied by the sparse matrix, and
Yx` is the vector of the calculation result.
As you can see by writing it slowly on paper, each data in CSR format is handled efficiently.
The row index pointer expresses the range of information in each row well, and the elements of the vector multiplied by the column indices can be pinpointed.
The expressions are such that the roles of rows and columns in the CSR format are interchanged. Compressed Sparse Column format: It's CSC format.
Almost again, let's check what kind of value is included. Read the matrix in the order of the green lines and make it into COO format.
Consider the dividing line at the transition of col
.
Use this as a col index pointer and use it instead of col
.
Therefore, the CSC format is expressed as follows.
Check with scipy.sparse.csc_matrix ()
.
data_csc = sparse.csc_matrix(data_np)
print(f"index pointer: {data_csc.indptr}")
print(f"indices: {data_csc.indices}")
print(f"data: {data_csc.data}")
output
index pointer: [0 1 2 5 7 9]
indices: [1 3 1 2 3 0 1 0 3]
data: [9 4 1 6 7 2 8 5 3]
Certainly, the information as shown is stored.
I am good at column slicing. Also, it seems that the matrix vector product is faster, though not as much as the CSR format.
Let's compare the expression format for sparse matrices introduced so far with the original simple matrix expression. First, using the MovieLens 100K Dataset used in the explanation of the advantages of the COO format earlier, Let's check the memory size of each format, especially the total number of bytes in the array.
ml100k_csr = ml100k_coo.tocsr()
ml100k_csc = ml100k_coo.tocsc()
ml100k_np = ml100k_coo.todense()
print(f'np: {ml100k_np.nbytes}')
print(f'coo: {ml100k_coo.data.nbytes + ml100k_coo.col.nbytes + ml100k_coo.row.nbytes}')
print(f'csr: {ml100k_csr.data.nbytes + ml100k_csr.indices.nbytes + ml100k_csr.indptr.nbytes}')
print(f'csc: {ml100k_csc.data.nbytes + ml100k_csc.indices.nbytes + ml100k_csc.indptr.nbytes}')
output
np: 12689008
coo: 1600000
csr: 1203776
csc: 1206732
If the expression format is suitable for a sparse matrix, the memory size will be much smaller. The smallest memory required for this data is the CSR format (although by a small margin).
Next, multiply the vector containing random numbers from the right and measure the calculation time of the matrix vector product.
x = np.random.randint(1, 10, 1682)
%timeit ml100k_np @ x
%timeit ml100k_coo @ x
%timeit ml100k_csr @ x
%timeit ml100k_csc @ x
output
3.2 ms ± 20.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
159 µs ± 995 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
92.6 µs ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
140 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Certainly the CSR format seems to be the fastest!
We explained the representation format of sparse matrices and made a simple comparison. I'm glad if you can use it as a reference. Have a good sparse life!
The environment used in this article is as follows.
python = '3.7.0'
scipy = '1.3.3'
numpy = '1.17.3'
pandas = '0.25.3'
Recommended Posts