I want to write and read a numpy array np.random.rand (100000, 99999) .astype (dtype ='float32')
that doesn't fit in memory into one file.
I want to do something like this:
import numpy as np
arr = np.random.rand(100000, 99999).astype(dtype='float32')
np.save('np_save.npy', arr)
But ʻarr` is about 40GB, so it doesn't get into memory and is killed:
>>> import numpy as np
>>> arr = np.random.rand(100000, 99999).astype(dtype='float32')
Killed: 9
Instead of generating 100000 at the top, let's generate it in a loop and append it. Let's go through a 100-row generation loop 1000 times.
The method of appending to the list, np.vstack
, and np.save
works well when it gets in memory, but it also kills because 40GB is expanded in memory:
>>> arr_list = []
>>> for i in range(1000):
... arr = np.random.rand(100, 99999).astype(dtype='float32')
... arr_list.append(arr)
...
>>> np.save('np_save.npy', np.vstack(arr_list))
Killed: 9
You can write to a file with np.memmap without consuming memory. First, specify the shape and generate a memory-map. Write the data to the specified slice in the loop.
import numpy as np
fp = np.memmap('np_save.dat', dtype='float32', mode='w+', shape=(100000, 99999))
del fp
for i in range(1000):
arr = np.random.rand(100, 99999).astype(dtype='float32')
fp = np.memmap('np_save.dat', dtype='float32', mode='r+', shape=(100000,99999))
fp[i*100:(i+1)*100] = arr
del fp
Memory usage is reduced like this:
The nice thing about np.memmap is that it can be read instantly, and segments of data on disk can be read in slices. Only the data to be used can be expanded in memory.
>>> import numpy as np
>>> newfp = np.memmap('np_save.dat', dtype='float32', mode='r', shape=(100000, 99999))
>>> newfp
memmap([[0.9187665 , 0.7200832 , 0.36402512, ..., 0.88297397, 0.90521574,
0.80509114],
[0.56751174, 0.06882019, 0.7239285 , ..., 0.04442023, 0.26091048,
0.07676199],
[0.94537544, 0.06935687, 0.13554753, ..., 0.3666087 , 0.6137967 ,
0.1677396 ],
...,
[0.97691774, 0.08142337, 0.18367094, ..., 0.30060798, 0.5744949 ,
0.3676454 ],
[0.13025525, 0.36571756, 0.17128325, ..., 0.8568927 , 0.9460005 ,
0.61096454],
[0.22953852, 0.00882731, 0.15313177, ..., 0.90694803, 0.17832297,
0.45886058]], dtype=float32)
>>> newfp[30:40]
memmap([[0.23530068, 0.64027864, 0.7193347 , ..., 0.0991324 , 0.71703744,
0.0429478 ],
[0.7835149 , 0.20407963, 0.23541291, ..., 0.12721954, 0.6010988 ,
0.8794886 ],
[0.62712234, 0.76977116, 0.4803686 , ..., 0.24469836, 0.7741827 ,
0.14326976],
...,
[0.4855294 , 0.15554492, 0.6792018 , ..., 0.23985237, 0.59824246,
0.88751584],
[0.80865365, 0.73577726, 0.9209202 , ..., 0.9908406 , 0.66778165,
0.16570805],
[0.63171065, 0.48431855, 0.57776374, ..., 0.76027304, 0.930301 ,
0.20145524]], dtype=float32)
>>> arr = np.array( newfp[30:40])
>>> arr
array([[0.23530068, 0.64027864, 0.7193347 , ..., 0.0991324 , 0.71703744,
0.0429478 ],
[0.7835149 , 0.20407963, 0.23541291, ..., 0.12721954, 0.6010988 ,
0.8794886 ],
[0.62712234, 0.76977116, 0.4803686 , ..., 0.24469836, 0.7741827 ,
0.14326976],
...,
[0.4855294 , 0.15554492, 0.6792018 , ..., 0.23985237, 0.59824246,
0.88751584],
[0.80865365, 0.73577726, 0.9209202 , ..., 0.9908406 , 0.66778165,
0.16570805],
[0.63171065, 0.48431855, 0.57776374, ..., 0.76027304, 0.930301 ,
0.20145524]], dtype=float32)
I think it can be used when dealing with data in machine learning.
There is np.savetxt
as a way to append directly to a file, but I don't recommend it! It's very slow and makes the file huge (greater than the original data size):
with open('np_save.dat','ab') as f:
for i in range(1000):
arr = np.random.rand(100, 99999).astype(dtype='float32')
np.savetxt(f,arr)