example

I want to write and read a numpy array np.random.rand (100000, 99999) .astype (dtype ='float32') that doesn't fit in memory into one file. I want to do something like this:

import numpy as np

arr = np.random.rand(100000, 99999).astype(dtype='float32')
np.save('np_save.npy', arr)

But ʻarr` is about 40GB, so it doesn't get into memory and is killed:

>>> import numpy as np
>>> arr = np.random.rand(100000, 99999).astype(dtype='float32')
Killed: 9

Instead of generating 100000 at the top, let's generate it in a loop and append it. Let's go through a 100-row generation loop 1000 times.

The method of appending to the list, np.vstack, and np.save works well when it gets in memory, but it also kills because 40GB is expanded in memory:

>>> arr_list = []
>>> for i in range(1000):
...     arr = np.random.rand(100, 99999).astype(dtype='float32')
...     arr_list.append(arr)
...

>>> np.save('np_save.npy', np.vstack(arr_list))
Killed: 9

solution

You can write to a file with np.memmap without consuming memory. First, specify the shape and generate a memory-map. Write the data to the specified slice in the loop.

import numpy as np
fp = np.memmap('np_save.dat', dtype='float32', mode='w+', shape=(100000, 99999))
del fp
for i in range(1000):
    arr = np.random.rand(100, 99999).astype(dtype='float32')
    fp = np.memmap('np_save.dat', dtype='float32', mode='r+', shape=(100000,99999))
    fp[i*100:(i+1)*100] = arr
    del fp

Memory usage is reduced like this:

The nice thing about np.memmap is that it can be read instantly, and segments of data on disk can be read in slices. Only the data to be used can be expanded in memory.

>>> import numpy as np
>>> newfp = np.memmap('np_save.dat', dtype='float32', mode='r', shape=(100000, 99999))
>>> newfp
memmap([[0.9187665 , 0.7200832 , 0.36402512, ..., 0.88297397, 0.90521574,
         0.80509114],
        [0.56751174, 0.06882019, 0.7239285 , ..., 0.04442023, 0.26091048,
         0.07676199],
        [0.94537544, 0.06935687, 0.13554753, ..., 0.3666087 , 0.6137967 ,
         0.1677396 ],
        ...,
        [0.97691774, 0.08142337, 0.18367094, ..., 0.30060798, 0.5744949 ,
         0.3676454 ],
        [0.13025525, 0.36571756, 0.17128325, ..., 0.8568927 , 0.9460005 ,
         0.61096454],
        [0.22953852, 0.00882731, 0.15313177, ..., 0.90694803, 0.17832297,
         0.45886058]], dtype=float32)
>>> newfp[30:40]
memmap([[0.23530068, 0.64027864, 0.7193347 , ..., 0.0991324 , 0.71703744,
         0.0429478 ],
        [0.7835149 , 0.20407963, 0.23541291, ..., 0.12721954, 0.6010988 ,
         0.8794886 ],
        [0.62712234, 0.76977116, 0.4803686 , ..., 0.24469836, 0.7741827 ,
         0.14326976],
        ...,
        [0.4855294 , 0.15554492, 0.6792018 , ..., 0.23985237, 0.59824246,
         0.88751584],
        [0.80865365, 0.73577726, 0.9209202 , ..., 0.9908406 , 0.66778165,
         0.16570805],
        [0.63171065, 0.48431855, 0.57776374, ..., 0.76027304, 0.930301  ,
         0.20145524]], dtype=float32)
>>> arr = np.array( newfp[30:40])
>>> arr
array([[0.23530068, 0.64027864, 0.7193347 , ..., 0.0991324 , 0.71703744,
        0.0429478 ],
       [0.7835149 , 0.20407963, 0.23541291, ..., 0.12721954, 0.6010988 ,
        0.8794886 ],
       [0.62712234, 0.76977116, 0.4803686 , ..., 0.24469836, 0.7741827 ,
        0.14326976],
       ...,
       [0.4855294 , 0.15554492, 0.6792018 , ..., 0.23985237, 0.59824246,
        0.88751584],
       [0.80865365, 0.73577726, 0.9209202 , ..., 0.9908406 , 0.66778165,
        0.16570805],
       [0.63171065, 0.48431855, 0.57776374, ..., 0.76027304, 0.930301  ,
        0.20145524]], dtype=float32)

I think it can be used when dealing with data in machine learning.

Not recommended method

There is np.savetxt as a way to append directly to a file, but I don't recommend it! It's very slow and makes the file huge (greater than the original data size):

with open('np_save.dat','ab') as f:
    for i in range(1000):
        arr = np.random.rand(100, 99999).astype(dtype='float32')
        np.savetxt(f,arr)

Fixed an issue where numpy.memmap wouldn't fit in numpy array memory

example

solution

Not recommended method