I had the opportunity to work on improving the calculation efficiency of numpy. In that, I learned the importance of dtype.
Of course, I hope it will be useful for someone as a personal memorandum. What I wrote is this article.
[Addition]
numpy is a numpy.array created by applying the .values
method to the dataframed version of pandas read_csv.
When I write it,
I had to do the process, but just by changing the dtype of numpy I was able to improve it as follows. (How terrible it was at first ...)
item | before | after |
---|---|---|
processing time | 70 minutes | 10 minutes |
Memory used | Over 100GB | Over 30GB |
Below, I will write what kind of changes I made.
** Set dtype of numpy.array to float32 or float64
**
It seems that BLAS is used for matrix calculation of numpy. So, it seems that this guy will do a good job if it is the above data type. (See the article at the top of the reference link)
In my case, it was int at first, but with .astype (np.float32)
The processing time was reduced from 70 minutes to 10 minutes just by changing the type to float32. !!
[Addition] After sleeping overnight and rereading, I thought that the explanation of the situation was insufficient, so I will supplement it a little.
Originally one hot encoded with pd.get_dummies
and then with .values
I took out numpy.ndarray and calculated it.
With this method, the data type will be uint8, but this will be changed to float32. I was able to achieve high speed by changing it.
The code looks like this.
pd.get_dummies([Pandas Series]).astype(np.float32).values
** Set the matrix value to a bool value` if possible **
The above-mentioned floatization made the process explosive, but the memory was consumed.
I was resisting by often del
ing unnecessary objects, but it did not lead to a big improvement.
However, by re-holding the calculation result in bool, I was able to save a lot of memory. (It seems that the memory reserved in advance is different between bool type and int, float type)
[Supplement]
In python, the following relationship holds between 1/0 and True / False
.
So, a matrix that can be represented by 1/0 (for example, a one hot encoding matrix) is It can also be expressed as a bool value.
As mentioned above, we were able to greatly improve efficiency just by combining the two points. In the end, the process flow was like this.
result = arrayA | arrayB | arrayC
LikeThe float seems to be usable in various cases, but is the bool limited? I don't think it can be applied unless the result can be represented by 0/1.
--I also saw an article saying that if you write by specifying the numpy type in Cpython, it will be super fast. However, I didn't actually try it this time. If I have the opportunity, I would like to give it a try. -Intel MKL seems to have better performance, but the execution environment is Intel CPU Not always, so I ran it with OpenBLAS.
Recommended Posts