Last time has been modified. The contents are as stated below.
--Reduction of memory usage
--Introduced some error handling
Speeding up will come later.
First, the previous implementation is very memory intensive. The cause is that P (z | x, y) is calculated in the E step of the EM algorithm. Since we have simply created a 3D array, if we do not create it, the memory usage will be considerably reduced.
Specifically, let's look at the update formula for P (x | z) in M step. Since the denominator is just normalized, we only consider the numerator.
P\left( x | z \right)Molecule= \sum_{y} N_{x, y} P \left( z | x, y \right)
Substitute the update formula for P (z | x, y) in step E into this.
\begin{equation}
P\left( x | z \right)Molecule= \sum_{y} N_{x, y} \frac{P\left( z \right)P\left( x | z \right)P\left( y | z \right)}{\sum_{z} P\left( z \right)P\left( x | z \right)P\left( y | z \right)} \tag{1}
\end{equation}
I'm going to implement this expression, but I don't want to spin the for statement around, so I use numpy's einsum.
numpy.einsum()
The einsum function is a reduction of Einstein. It's hard to understand, so here's one example.
P(x,y) = \sum_{z}P(z)P(x|z)P(y|z)
When you implement
Pxy = numpy.einsum('k,ki,kj->ij', Pz, Px_z, Py_z)
It will be.
Equation (1) is implemented using this einsum function, but it is difficult to implement as it is, so the equation is transformed as follows.
P\left( x | z \right)Molecule= \sum_{y} \frac{N_{x, y}}{\sum_{z} P\left( z \right)P\left( x | z \right)P\left( y | z \right)} P\left( z \right)P\left( x | z \right)P\left( y | z \right)
If you implement this, it will look like this.
tmp = N / numpu.einsum('k,ki,kj->ij', Pz, Px_z, Py_z)
Px_z = numpy.einsum('ij,k,ki,kj->ki', tmp, Pz, Px_z, Py_z)
How much memory usage has been reduced by this will be described later.
I implemented it using the einsum function,
tmp = N / numpu.einsum('k,ki,kj->ij', Pz, Px_z, Py_z)
Consider error handling when divided by 0 in. The fact that this denominator is 0 means that for some x and y,
\sum_{z}P(z)P(x|z)P(y|z) = 0
So, no negative value comes out, so for some x and y and all z,
P(z)P(x|z)P(y|z) = 0
Is established. In other words, the E step of the EM algorithm is as follows.
\begin{eqnarray}
P(z|x,y) & = & \frac{P\left( z \right)P\left( x | z \right)P\left( y | z \right)}{\sum_{z} P\left( z \right)P\left( x | z \right)P\left( y | z \right)}
& = & 0
\end{eqnarray}
Therefore, the element divided by 0 is set to 0.
Where in numpy
1 / 0 = inf
0 / 0 = nan
So, using numpy.isinf and numpy.isnan respectively
tmp = N / numpu.einsum('k,ki,kj->ij', Pz, Px_z, Py_z)
tmp[numpy.isinf(tmp)] = 0
tmp[numpy.isnan(tmp)] = 0
Px_z = numpy.einsum('ij,k,ki,kj->ki', tmp, Pz, Px_z, Py_z)
It will be.
In summary, the overall implementation is as follows.
plsa.py
import numpy as np
class PLSA(object):
def __init__(self, N, Z):
self.N = N
self.X = N.shape[0]
self.Y = N.shape[1]
self.Z = Z
# P(z)
self.Pz = np.random.rand(self.Z)
# P(x|z)
self.Px_z = np.random.rand(self.Z, self.X)
# P(y|z)
self.Py_z = np.random.rand(self.Z, self.Y)
#Normalization
self.Pz /= np.sum(self.Pz)
self.Px_z /= np.sum(self.Px_z, axis=1)[:, None]
self.Py_z /= np.sum(self.Py_z, axis=1)[:, None]
def train(self, k=200, t=1.0e-7):
'''
Repeat steps E and M until the log-likelihood converges
'''
prev_llh = 100000
for i in xrange(k):
self.em_algorithm()
llh = self.llh()
if abs((llh - prev_llh) / prev_llh) < t:
break
prev_llh = llh
def em_algorithm(self):
'''
EM algorithm
P(z), P(x|z), P(y|z)Update
'''
tmp = self.N / np.einsum('k,ki,kj->ij', self.Pz, self.Px_z, self.Py_z)
tmp[np.isnan(tmp)] = 0
tmp[np.isinf(tmp)] = 0
Pz = np.einsum('ij,k,ki,kj->k', tmp, self.Pz, self.Px_z, self.Py_z)
Px_z = np.einsum('ij,k,ki,kj->ki', tmp, self.Pz, self.Px_z, self.Py_z)
Py_z = np.einsum('ij,k,ki,kj->kj', tmp, self.Pz, self.Px_z, self.Py_z)
self.Pz = Pz / np.sum(Pz)
self.Px_z = Px_z / np.sum(Px_z, axis=1)[:, None]
self.Py_z = Py_z / np.sum(Py_z, axis=1)[:, None]
def llh(self):
'''
Log likelihood
'''
Pxy = np.einsum('k,ki,kj->ij', self.Pz, self.Px_z, self.Py_z)
Pxy /= np.sum(Pxy)
lPxy = np.log(Pxy)
lPxy[np.isinf(lPxy)] = -1000
return np.sum(self.N * lPxy)
Underflow may occur when calculating the log-likelihood, resulting in log (0) = -inf. The minimum value of a double precision floating point number is about 4.94e-324, so log (4.94e-324) = -744.4 or less, so -1000 is roughly entered.
Use memory_profiler to measure how much memory usage has been reduced. I measured it with the following script.
memory_profile.py
import numpy as np
from memory_profiler import profile
X = 1000
Y = 1000
Z = 10
@profile
def main():
from plsa import PLSA
plsa = PLSA(np.random.rand(X, Y), Z)
plsa.em_algorithm()
llh = plsa.llh()
if __name__ == '__main__':
main()
In the case of X = 1000, Y = 1000, Z = 10, in the previous implementation,
$ python profile_memory_element_wise_product.py
Filename: profile_memory_element_wise_product.py
Line # Mem usage Increment Line Contents
================================================
10 15.9 MiB 0.0 MiB @profile
11 def main():
12 15.9 MiB 0.0 MiB from plsa_element_wise_product import PLSA
13 23.9 MiB 8.0 MiB plsa = PLSA(np.random.rand(X, Y), Z)
14 108.0 MiB 84.1 MiB plsa.e_step()
15 184.5 MiB 76.5 MiB plsa.m_step()
16 199.8 MiB 15.3 MiB llh = plsa.llh()
In this implementation,
$ python profile_memory_einsum.py
Filename: profile_memory_einsum.py
Line # Mem usage Increment Line Contents
================================================
10 15.8 MiB 0.0 MiB @profile
11 def main():
12 15.8 MiB 0.0 MiB from plsa_einsum import PLSA
13 23.7 MiB 7.9 MiB plsa = PLSA(np.random.rand(X, Y), Z)
14 40.7 MiB 16.9 MiB plsa.em_algorithm()
15 48.4 MiB 7.8 MiB llh = plsa.llh()
In the case of X = 5000, Y = 5000, Z = 10, in the previous implementation,
$ python profile_memory_element_wise_product.py
Filename: profile_memory_element_wise_product.py
Line # Mem usage Increment Line Contents
================================================
10 15.9 MiB 0.0 MiB @profile
11 def main():
12 15.9 MiB 0.0 MiB from plsa_element_wise_product import PLSA
13 207.6 MiB 191.7 MiB plsa = PLSA(np.random.rand(X, Y), Z)
14 2115.4 MiB 1907.8 MiB plsa.e_step()
15 2115.5 MiB 0.1 MiB plsa.m_step()
16 2115.5 MiB 0.0 MiB llh = plsa.llh()
In this implementation,
$ python profile_memory_einsum.py
Filename: profile_memory_einsum.py
Line # Mem usage Increment Line Contents
================================================
10 15.7 MiB 0.0 MiB @profile
11 def main():
12 15.7 MiB 0.0 MiB from plsa_einsum import PLSA
13 207.5 MiB 191.7 MiB plsa = PLSA(np.random.rand(X, Y), Z)
14 233.0 MiB 25.6 MiB plsa.em_algorithm()
15 233.1 MiB 0.0 MiB llh = plsa.llh()
In summary, the total memory usage is
Implementation | X=1000,Y=1000,Z=10 | X=5000,Y=5000,Z=10 |
---|---|---|
Last implementation | 199.8 MiB | 2115.5 MiB |
This implementation | 48.4 MiB | 233.1 MiB |
However, if we limit this to the EM algorithm and the calculation part of the log-likelihood,
Implementation | X=1000,Y=1000,Z=10 | X=5000,Y=5000,Z=10 |
---|---|---|
Last implementation | 175.9 MiB | 1907.9 MiB |
This implementation | 24.7 MiB | 25.6 MiB |
You can see that the amount of memory used in this implementation has hardly increased. With this, even if the number of data increases, I am not afraid of memory.
The einsum function is convenient, but it takes a long time to calculate all three or four at once like this time. For my MacBook, it took about 8.7 seconds to calculate the log-likelihood and the EM algorithm once at X = 5000, Y = 5000, Z = 10. This isn't realistic and needs improvement, but it's coming back later.
――It's longer than I expected.
――The einsum function is also useful.
Recommended Posts