I tried to implement PLSA in Python 2

Overview

Last time has been modified. The contents are as stated below.

--Reduction of memory usage

--Introduced some error handling

Speeding up will come later.

Memory usage reduction

Cause and policy

First, the previous implementation is very memory intensive. The cause is that P (z | x, y) is calculated in the E step of the EM algorithm. Since we have simply created a 3D array, if we do not create it, the memory usage will be considerably reduced.

Specifically, let's look at the update formula for P (x | z) in M step. Since the denominator is just normalized, we only consider the numerator.

P\left( x | z \right)Molecule= \sum_{y} N_{x, y} P \left( z | x, y \right)

Substitute the update formula for P (z | x, y) in step E into this.

\begin{equation}
P\left( x | z \right)Molecule= \sum_{y} N_{x, y} \frac{P\left( z \right)P\left( x | z \right)P\left( y | z \right)}{\sum_{z} P\left( z \right)P\left( x | z \right)P\left( y | z \right)} \tag{1}
\end{equation}

I'm going to implement this expression, but I don't want to spin the for statement around, so I use numpy's einsum.

numpy.einsum()

The einsum function is a reduction of Einstein. It's hard to understand, so here's one example.

P(x,y) = \sum_{z}P(z)P(x|z)P(y|z)

When you implement

Pxy = numpy.einsum('k,ki,kj->ij', Pz, Px_z, Py_z)

It will be.

Equation (1) is implemented using this einsum function, but it is difficult to implement as it is, so the equation is transformed as follows.

P\left( x | z \right)Molecule= \sum_{y} \frac{N_{x, y}}{\sum_{z} P\left( z \right)P\left( x | z \right)P\left( y | z \right)} P\left( z \right)P\left( x | z \right)P\left( y | z \right)

If you implement this, it will look like this.

tmp = N / numpu.einsum('k,ki,kj->ij', Pz, Px_z, Py_z)
Px_z = numpy.einsum('ij,k,ki,kj->ki', tmp, Pz, Px_z, Py_z)

How much memory usage has been reduced by this will be described later.

Error handling

I implemented it using the einsum function,

tmp = N / numpu.einsum('k,ki,kj->ij', Pz, Px_z, Py_z)

Consider error handling when divided by 0 in. The fact that this denominator is 0 means that for some x and y,

\sum_{z}P(z)P(x|z)P(y|z) = 0

So, no negative value comes out, so for some x and y and all z,

P(z)P(x|z)P(y|z) = 0

Is established. In other words, the E step of the EM algorithm is as follows.

\begin{eqnarray}
P(z|x,y) & = & \frac{P\left( z \right)P\left( x | z \right)P\left( y | z \right)}{\sum_{z} P\left( z \right)P\left( x | z \right)P\left( y | z \right)}
& = & 0
\end{eqnarray}

Therefore, the element divided by 0 is set to 0.

Where in numpy

1 / 0 = inf
0 / 0 = nan

So, using numpy.isinf and numpy.isnan respectively

tmp = N / numpu.einsum('k,ki,kj->ij', Pz, Px_z, Py_z)
tmp[numpy.isinf(tmp)] = 0
tmp[numpy.isnan(tmp)] = 0

Px_z = numpy.einsum('ij,k,ki,kj->ki', tmp, Pz, Px_z, Py_z)

It will be.

Implementation

In summary, the overall implementation is as follows.

`plsa.py`


import numpy as np


class PLSA(object):
    def __init__(self, N, Z):
        self.N = N
        self.X = N.shape[0]
        self.Y = N.shape[1]
        self.Z = Z

        # P(z)
        self.Pz = np.random.rand(self.Z)
        # P(x|z)
        self.Px_z = np.random.rand(self.Z, self.X)
        # P(y|z)
        self.Py_z = np.random.rand(self.Z, self.Y)

        #Normalization
        self.Pz /= np.sum(self.Pz)
        self.Px_z /= np.sum(self.Px_z, axis=1)[:, None]
        self.Py_z /= np.sum(self.Py_z, axis=1)[:, None]

    def train(self, k=200, t=1.0e-7):
        '''
Repeat steps E and M until the log-likelihood converges
        '''
        prev_llh = 100000
        for i in xrange(k):
            self.em_algorithm()
            llh = self.llh()

            if abs((llh - prev_llh) / prev_llh) < t:
                break

            prev_llh = llh

    def em_algorithm(self):
        '''
EM algorithm
        P(z), P(x|z), P(y|z)Update
        '''
        tmp = self.N / np.einsum('k,ki,kj->ij', self.Pz, self.Px_z, self.Py_z)
        tmp[np.isnan(tmp)] = 0
        tmp[np.isinf(tmp)] = 0

        Pz = np.einsum('ij,k,ki,kj->k', tmp, self.Pz, self.Px_z, self.Py_z)
        Px_z = np.einsum('ij,k,ki,kj->ki', tmp, self.Pz, self.Px_z, self.Py_z)
        Py_z = np.einsum('ij,k,ki,kj->kj', tmp, self.Pz, self.Px_z, self.Py_z)

        self.Pz = Pz / np.sum(Pz)
        self.Px_z = Px_z / np.sum(Px_z, axis=1)[:, None]
        self.Py_z = Py_z / np.sum(Py_z, axis=1)[:, None]

    def llh(self):
        '''
Log likelihood
        '''
        Pxy = np.einsum('k,ki,kj->ij', self.Pz, self.Px_z, self.Py_z)
        Pxy /= np.sum(Pxy)
        lPxy = np.log(Pxy)
        lPxy[np.isinf(lPxy)] = -1000

        return np.sum(self.N * lPxy)

Underflow may occur when calculating the log-likelihood, resulting in log (0) = -inf. The minimum value of a double precision floating point number is about 4.94e-324, so log (4.94e-324) = -744.4 or less, so -1000 is roughly entered.

Measurement

Use memory_profiler to measure how much memory usage has been reduced. I measured it with the following script.

`memory_profile.py`


import numpy as np
from memory_profiler import profile

X = 1000
Y = 1000
Z = 10


@profile
def main():
    from plsa import PLSA
    plsa = PLSA(np.random.rand(X, Y), Z)
    plsa.em_algorithm()
    llh = plsa.llh()


if __name__ == '__main__':
    main()

In the case of X = 1000, Y = 1000, Z = 10, in the previous implementation,

$ python profile_memory_element_wise_product.py 
Filename: profile_memory_element_wise_product.py

Line #    Mem usage    Increment   Line Contents
================================================
    10     15.9 MiB      0.0 MiB   @profile
    11                             def main():
    12     15.9 MiB      0.0 MiB       from plsa_element_wise_product import PLSA
    13     23.9 MiB      8.0 MiB       plsa = PLSA(np.random.rand(X, Y), Z)
    14    108.0 MiB     84.1 MiB       plsa.e_step()
    15    184.5 MiB     76.5 MiB       plsa.m_step()
    16    199.8 MiB     15.3 MiB       llh = plsa.llh()

In this implementation,

$ python profile_memory_einsum.py 
Filename: profile_memory_einsum.py

Line #    Mem usage    Increment   Line Contents
================================================
    10     15.8 MiB      0.0 MiB   @profile
    11                             def main():
    12     15.8 MiB      0.0 MiB       from plsa_einsum import PLSA
    13     23.7 MiB      7.9 MiB       plsa = PLSA(np.random.rand(X, Y), Z)
    14     40.7 MiB     16.9 MiB       plsa.em_algorithm()
    15     48.4 MiB      7.8 MiB       llh = plsa.llh()

In the case of X = 5000, Y = 5000, Z = 10, in the previous implementation,

$ python profile_memory_element_wise_product.py 
Filename: profile_memory_element_wise_product.py

Line #    Mem usage    Increment   Line Contents
================================================
    10     15.9 MiB      0.0 MiB   @profile
    11                             def main():
    12     15.9 MiB      0.0 MiB       from plsa_element_wise_product import PLSA
    13    207.6 MiB    191.7 MiB       plsa = PLSA(np.random.rand(X, Y), Z)
    14   2115.4 MiB   1907.8 MiB       plsa.e_step()
    15   2115.5 MiB      0.1 MiB       plsa.m_step()
    16   2115.5 MiB      0.0 MiB       llh = plsa.llh()

In this implementation,

$ python profile_memory_einsum.py 
Filename: profile_memory_einsum.py

Line #    Mem usage    Increment   Line Contents
================================================
    10     15.7 MiB      0.0 MiB   @profile
    11                             def main():
    12     15.7 MiB      0.0 MiB       from plsa_einsum import PLSA
    13    207.5 MiB    191.7 MiB       plsa = PLSA(np.random.rand(X, Y), Z)
    14    233.0 MiB     25.6 MiB       plsa.em_algorithm()
    15    233.1 MiB      0.0 MiB       llh = plsa.llh()

In summary, the total memory usage is

Implementation	X=1000,Y=1000,Z=10	X=5000,Y=5000,Z=10
Last implementation	199.8 MiB	2115.5 MiB
This implementation	48.4 MiB	233.1 MiB

However, if we limit this to the EM algorithm and the calculation part of the log-likelihood,

Implementation	X=1000,Y=1000,Z=10	X=5000,Y=5000,Z=10
Last implementation	175.9 MiB	1907.9 MiB
This implementation	24.7 MiB	25.6 MiB

You can see that the amount of memory used in this implementation has hardly increased. With this, even if the number of data increases, I am not afraid of memory.

Calculation speed

The einsum function is convenient, but it takes a long time to calculate all three or four at once like this time. For my MacBook, it took about 8.7 seconds to calculate the log-likelihood and the EM algorithm once at X = 5000, Y = 5000, Z = 10. This isn't realistic and needs improvement, but it's coming back later.

Impressions

――It's longer than I expected.

――The einsum function is also useful.