Memory-saving matrix conversion of log data

Thing you want to do

Assuming that there is log data that has values hierarchically as keys such as user, item as shown below (Hereafter, user item should be read according to the case)

>>> df = pd.DataFrame([['user1','item1',5],['user1','item2',4],['user2','item2',5],['user2','item3',6],['user3','item4',3]], columns=['username','itemname','rate'])
>>> df
  username itemname  rate
0    user1    item1     5
1    user1    item2     4
2    user2    item2     5
3    user2    item3     6
4    user3    item4     3

If you want to convert this to matrix format, you can write it in pandas as follows.

>>> df.groupby(['itemname', 'username']).mean().unstack(fill_value=0).values
array([[5, 0, 0],
       [4, 5, 0],
       [0, 6, 0],
       [0, 0, 3]])


When performing this process with pandas as described above, it is necessary to keep all the target data as a DataFrame, and if it is a large-scale data, the process itself will lead to a memory error. Also, if the data itself does not already fit in memory, it is necessary to divide the data into memory and read it, but there is a problem that it cannot be done.

Therefore, the motivation for this time is to want to perform the above conversion process other than pandas.


It was a story of converting each key into an index and keeping each value and those indexes as a list.

Index conversion

Convert each key value with Label Encoding as follows.

>>> from sklearn.preprocessing import LabelEncoder
>>> le_username = LabelEncoder()
>>> le_itemname = LabelEncoder()
>>> df['username'] = le_username.fit_transform(df['username'])
>>> df['itemname'] = le_username.fit_transform(df['itemname'])
>>> df
   username  itemname  rate
0         0         0     5
1         0         1     4
2         1         1     5
3         1         2     6
4         2         3     3


In order to hold the data in the form of what number item of what number user has what value, each value is retrieved in list format as follows.

>>> row = df['itemname'].values.tolist()
>>> col = df['username'].values.tolist()
>>> value = df['rate'].values.tolist()
>>> row
[0, 1, 1, 2, 3]
>>> col
[0, 0, 1, 1, 2]
>>> value
[5, 4, 5, 6, 3]

If you want to divide the data and read it, extend the list here.

Convert to matrix

Convert the above list to a sparse matrix.

>>> from scipy.sparse import coo_matrix
>>> matrix = coo_matrix((value, (row, col)))
>>> matrix
<4x3 sparse matrix of type '<class 'numpy.int64'>' with 5 stored elements in COOrdinate format>

When it is made dense, it becomes as follows, and the desired matrix can be obtained.

>>> matrix.toarray()
array([[5, 0, 0],
       [4, 5, 0],
       [0, 6, 0],
       [0, 0, 3]])

Recommended Posts

Memory-saving matrix conversion of log data
Memory-saving conversion of log data to sequential category features in chronological order
Conversion of time data in 25 o'clock notation
Visualize the export data of Piyo log
Numerical summary of data
Preprocessing of prefecture data
Selection of measurement data
Distribution of eigenvalues of Laplacian matrix
Visualization of data by prefecture
Fourier transform of raw data
Average estimation of capped data
Image data type conversion [Python]
About data management of anvil-app-server
Probability prediction of imbalanced data
Basic map information using Python Geotiff conversion of numerical elevation data
Output log file with Job (Notebook) of Cloud Pak for Data