The story that performance could be improved just by changing the dtype of numpy

Introduction

I had the opportunity to work on improving the calculation efficiency of numpy. In that, I learned the importance of dtype.

Of course, I hope it will be useful for someone as a personal memorandum. What I wrote is this article.

[Addition] numpy is a numpy.array created by applying the .values method to the dataframed version of pandas read_csv.

Performance before and after

When I write it,

  1. Perform multiple matrix calculations of size (10,000,400) @ (4,550,000)
  2. Take the intersection of them

I had to do the process, but just by changing the dtype of numpy I was able to improve it as follows. (How terrible it was at first ...)

item before after
processing time 70 minutes 10 minutes
Memory used Over 100GB Over 30GB

Below, I will write what kind of changes I made.

What you did

About speeding up

** Set dtype of numpy.array to float32 or float64 **

It seems that BLAS is used for matrix calculation of numpy. So, it seems that this guy will do a good job if it is the above data type. (See the article at the top of the reference link)

In my case, it was int at first, but with .astype (np.float32) The processing time was reduced from 70 minutes to 10 minutes just by changing the type to float32. !!

[Addition] After sleeping overnight and rereading, I thought that the explanation of the situation was insufficient, so I will supplement it a little.

Originally one hot encoded with pd.get_dummies and then with .values I took out numpy.ndarray and calculated it.

With this method, the data type will be uint8, but this will be changed to float32. I was able to achieve high speed by changing it.

The code looks like this.

pd.get_dummies([Pandas Series]).astype(np.float32).values

About memory saving

** Set the matrix value to a bool value` if possible **

The above-mentioned floatization made the process explosive, but the memory was consumed. I was resisting by often deling unnecessary objects, but it did not lead to a big improvement.

However, by re-holding the calculation result in bool, I was able to save a lot of memory. (It seems that the memory reserved in advance is different between bool type and int, float type)

[Supplement] In python, the following relationship holds between 1/0 and True / False.

true_false.png

So, a matrix that can be represented by 1/0 (for example, a one hot encoding matrix) is It can also be expressed as a bool value.

And the matching technique

As mentioned above, we were able to greatly improve efficiency just by combining the two points. In the end, the process flow was like this.

  1. Perform matrix calculation with float32 type ・ ʻArrayA = float32 matrix @ float32 matrix`
  2. Convert the obtained result to bool type ・ Just do ʻarrayA = (arrayA> = 1)`
  3. Take the intersection of them ・result = arrayA | arrayB | arrayCLike

Impressions

The float seems to be usable in various cases, but is the bool limited? I don't think it can be applied unless the result can be represented by 0/1.

Other

--I also saw an article saying that if you write by specifying the numpy type in Cpython, it will be super fast. However, I didn't actually try it this time. If I have the opportunity, I would like to give it a try. -Intel MKL seems to have better performance, but the execution environment is Intel CPU Not always, so I ran it with OpenBLAS.

Reference link

Recommended Posts

The story that performance could be improved just by changing the dtype of numpy
The mystery of the number that can be seen just by arranging 1s-The number of repunits and mysterious properties-
The story of pep8 changing to pycodestyle
The story of Django creating a library that might be a little more useful
The story that 5GHz band access point could not be created on Ubuntu
A story that reduces the effort of operation / maintenance
The story of Hash Sum mismatch caused by gcrypto20
Note that int64 could not be received by pyOSC
A story about changing the master name of BlueZ
The story that the return value of tape.gradient () was None
A story that analyzed the delivery of Nico Nama.
List the classes that can be referenced by ObjCClass
The story of sys.path.append ()
A story that is a little addicted to the authority of the directory specified by expdp (for beginners)