Last time has been modified. The contents are as stated below.
--Reduction of memory usage
--Introduced some error handling
Speeding up will come later.
First, the previous implementation is very memory intensive. The cause is that P (z | x, y) is calculated in the E step of the EM algorithm. Since we have simply created a 3D array, if we do not create it, the memory usage will be considerably reduced.
Specifically, let's look at the update formula for P (x | z) in M step. Since the denominator is just normalized, we only consider the numerator.
P\left( x | z \right)Molecule= \sum_{y} N_{x, y} P \left( z | x, y \right)
Substitute the update formula for P (z | x, y) in step E into this.
\begin{equation}
P\left( x | z \right)Molecule= \sum_{y} N_{x, y} \frac{P\left( z \right)P\left( x | z \right)P\left( y | z \right)}{\sum_{z} P\left( z \right)P\left( x | z \right)P\left( y | z \right)} \tag{1}
\end{equation}
I'm going to implement this expression, but I don't want to spin the for statement around, so I use numpy's einsum.
numpy.einsum()
The einsum function is a reduction of Einstein. It's hard to understand, so here's one example.
P(x,y) = \sum_{z}P(z)P(x|z)P(y|z)
When you implement
Pxy = numpy.einsum('k,ki,kj->ij', Pz, Px_z, Py_z)
It will be.
Equation (1) is implemented using this einsum function, but it is difficult to implement as it is, so the equation is transformed as follows.
P\left( x | z \right)Molecule= \sum_{y} \frac{N_{x, y}}{\sum_{z} P\left( z \right)P\left( x | z \right)P\left( y | z \right)} P\left( z \right)P\left( x | z \right)P\left( y | z \right)
If you implement this, it will look like this.
tmp = N / numpu.einsum('k,ki,kj->ij', Pz, Px_z, Py_z)
Px_z = numpy.einsum('ij,k,ki,kj->ki', tmp, Pz, Px_z, Py_z)
How much memory usage has been reduced by this will be described later.
I implemented it using the einsum function,
tmp = N / numpu.einsum('k,ki,kj->ij', Pz, Px_z, Py_z)
Consider error handling when divided by 0 in. The fact that this denominator is 0 means that for some x and y,
\sum_{z}P(z)P(x|z)P(y|z) = 0
So, no negative value comes out, so for some x and y and all z,
P(z)P(x|z)P(y|z) = 0
Is established. In other words, the E step of the EM algorithm is as follows.
\begin{eqnarray}
P(z|x,y) & = & \frac{P\left( z \right)P\left( x | z \right)P\left( y | z \right)}{\sum_{z} P\left( z \right)P\left( x | z \right)P\left( y | z \right)}
& = & 0
\end{eqnarray}
Therefore, the element divided by 0 is set to 0.
Where in numpy
1 / 0 = inf
0 / 0 = nan
So, using numpy.isinf and numpy.isnan respectively
tmp = N / numpu.einsum('k,ki,kj->ij', Pz, Px_z, Py_z)
tmp[numpy.isinf(tmp)] = 0
tmp[numpy.isnan(tmp)] = 0
Px_z = numpy.einsum('ij,k,ki,kj->ki', tmp, Pz, Px_z, Py_z)
It will be.
In summary, the overall implementation is as follows.
plsa.py
import numpy as np
class PLSA(object):
    def __init__(self, N, Z):
        self.N = N
        self.X = N.shape[0]
        self.Y = N.shape[1]
        self.Z = Z
        # P(z)
        self.Pz = np.random.rand(self.Z)
        # P(x|z)
        self.Px_z = np.random.rand(self.Z, self.X)
        # P(y|z)
        self.Py_z = np.random.rand(self.Z, self.Y)
        #Normalization
        self.Pz /= np.sum(self.Pz)
        self.Px_z /= np.sum(self.Px_z, axis=1)[:, None]
        self.Py_z /= np.sum(self.Py_z, axis=1)[:, None]
    def train(self, k=200, t=1.0e-7):
        '''
Repeat steps E and M until the log-likelihood converges
        '''
        prev_llh = 100000
        for i in xrange(k):
            self.em_algorithm()
            llh = self.llh()
            if abs((llh - prev_llh) / prev_llh) < t:
                break
            prev_llh = llh
    def em_algorithm(self):
        '''
EM algorithm
        P(z), P(x|z), P(y|z)Update
        '''
        tmp = self.N / np.einsum('k,ki,kj->ij', self.Pz, self.Px_z, self.Py_z)
        tmp[np.isnan(tmp)] = 0
        tmp[np.isinf(tmp)] = 0
        Pz = np.einsum('ij,k,ki,kj->k', tmp, self.Pz, self.Px_z, self.Py_z)
        Px_z = np.einsum('ij,k,ki,kj->ki', tmp, self.Pz, self.Px_z, self.Py_z)
        Py_z = np.einsum('ij,k,ki,kj->kj', tmp, self.Pz, self.Px_z, self.Py_z)
        self.Pz = Pz / np.sum(Pz)
        self.Px_z = Px_z / np.sum(Px_z, axis=1)[:, None]
        self.Py_z = Py_z / np.sum(Py_z, axis=1)[:, None]
    def llh(self):
        '''
Log likelihood
        '''
        Pxy = np.einsum('k,ki,kj->ij', self.Pz, self.Px_z, self.Py_z)
        Pxy /= np.sum(Pxy)
        lPxy = np.log(Pxy)
        lPxy[np.isinf(lPxy)] = -1000
        return np.sum(self.N * lPxy)
Underflow may occur when calculating the log-likelihood, resulting in log (0) = -inf. The minimum value of a double precision floating point number is about 4.94e-324, so log (4.94e-324) = -744.4 or less, so -1000 is roughly entered.
Use memory_profiler to measure how much memory usage has been reduced. I measured it with the following script.
memory_profile.py
import numpy as np
from memory_profiler import profile
X = 1000
Y = 1000
Z = 10
@profile
def main():
    from plsa import PLSA
    plsa = PLSA(np.random.rand(X, Y), Z)
    plsa.em_algorithm()
    llh = plsa.llh()
if __name__ == '__main__':
    main()
In the case of X = 1000, Y = 1000, Z = 10, in the previous implementation,
$ python profile_memory_element_wise_product.py 
Filename: profile_memory_element_wise_product.py
Line #    Mem usage    Increment   Line Contents
================================================
    10     15.9 MiB      0.0 MiB   @profile
    11                             def main():
    12     15.9 MiB      0.0 MiB       from plsa_element_wise_product import PLSA
    13     23.9 MiB      8.0 MiB       plsa = PLSA(np.random.rand(X, Y), Z)
    14    108.0 MiB     84.1 MiB       plsa.e_step()
    15    184.5 MiB     76.5 MiB       plsa.m_step()
    16    199.8 MiB     15.3 MiB       llh = plsa.llh()
In this implementation,
$ python profile_memory_einsum.py 
Filename: profile_memory_einsum.py
Line #    Mem usage    Increment   Line Contents
================================================
    10     15.8 MiB      0.0 MiB   @profile
    11                             def main():
    12     15.8 MiB      0.0 MiB       from plsa_einsum import PLSA
    13     23.7 MiB      7.9 MiB       plsa = PLSA(np.random.rand(X, Y), Z)
    14     40.7 MiB     16.9 MiB       plsa.em_algorithm()
    15     48.4 MiB      7.8 MiB       llh = plsa.llh()
In the case of X = 5000, Y = 5000, Z = 10, in the previous implementation,
$ python profile_memory_element_wise_product.py 
Filename: profile_memory_element_wise_product.py
Line #    Mem usage    Increment   Line Contents
================================================
    10     15.9 MiB      0.0 MiB   @profile
    11                             def main():
    12     15.9 MiB      0.0 MiB       from plsa_element_wise_product import PLSA
    13    207.6 MiB    191.7 MiB       plsa = PLSA(np.random.rand(X, Y), Z)
    14   2115.4 MiB   1907.8 MiB       plsa.e_step()
    15   2115.5 MiB      0.1 MiB       plsa.m_step()
    16   2115.5 MiB      0.0 MiB       llh = plsa.llh()
In this implementation,
$ python profile_memory_einsum.py 
Filename: profile_memory_einsum.py
Line #    Mem usage    Increment   Line Contents
================================================
    10     15.7 MiB      0.0 MiB   @profile
    11                             def main():
    12     15.7 MiB      0.0 MiB       from plsa_einsum import PLSA
    13    207.5 MiB    191.7 MiB       plsa = PLSA(np.random.rand(X, Y), Z)
    14    233.0 MiB     25.6 MiB       plsa.em_algorithm()
    15    233.1 MiB      0.0 MiB       llh = plsa.llh()
In summary, the total memory usage is
| Implementation | X=1000,Y=1000,Z=10 | X=5000,Y=5000,Z=10 | 
|---|---|---|
| Last implementation | 199.8 MiB | 2115.5 MiB | 
| This implementation | 48.4 MiB | 233.1 MiB | 
However, if we limit this to the EM algorithm and the calculation part of the log-likelihood,
| Implementation | X=1000,Y=1000,Z=10 | X=5000,Y=5000,Z=10 | 
|---|---|---|
| Last implementation | 175.9 MiB | 1907.9 MiB | 
| This implementation | 24.7 MiB | 25.6 MiB | 
You can see that the amount of memory used in this implementation has hardly increased. With this, even if the number of data increases, I am not afraid of memory.
The einsum function is convenient, but it takes a long time to calculate all three or four at once like this time. For my MacBook, it took about 8.7 seconds to calculate the log-likelihood and the EM algorithm once at X = 5000, Y = 5000, Z = 10. This isn't realistic and needs improvement, but it's coming back later.
――It's longer than I expected.
――The einsum function is also useful.
Recommended Posts