--Talk about how to make Python faster --Find CPU-bound processing bottlenecks by profiling --Improve the bottleneck you find with Cython
The other day, I implemented a recommendation algorithm called Bayesian Personalized Ranking (BPR). I wrote the code with reference to the formula in this paper, but when I executed it, it was a little too slow to use, so I improved the processing speed. I worked on it. I will summarize what I tried at that time as a memorandum.
BPR gives the matrix factorization of the user x item matrix. Decomposes the user x item matrix $ X $ into the user x factor matrix $ U $ and the item x factor matrix $ V $.
X = U \cdot V^T
See BPR's Original Paper for how to solve this problem.
I implemented this technique as follows. $ X $ is defined in scipy.sparse.lil_matrix
. The decomposed $ U $ and $ V $ are numpy.array
.
The flow is to bootstrap sample the sample used for learning and then update $ U and V $.
For this code, I would like to improve buildModel ()
, which is the learning part of the model.
MFbpr.py
class MFbpr(Recommender):
'''
Constructor and other processing
'''
def buildModel(self):
loss_pre = sys.float_info.max
nonzeros = self.trainMatrix.nnz
hr_prev = 0.0
sys.stderr.write("Run for BPR. \n")
for itr in xrange(self.maxIter):
start = time.time()
# Each training epoch
for s in xrange(nonzeros):
# sample a user
u = np.random.randint(self.userCount)
itemList = self.trainMatrix.getrowview(u).rows[0]
if len(itemList) == 0:
continue
# sample a positive item
i = random.choice(itemList)
# One SGD step update
self.update_ui(u, i)
def update_ui(self, u, i):
# sample a negative item(uniformly random)
j = np.random.randint(self.itemCount)
while self.trainMatrix[u, j] != 0:
j = np.random.randint(self.itemCount)
# BPR update rules
y_pos = self.predict(u, i) # target value of positive instance
y_neg = self.predict(u, j) # target value of negative instance
mult = -self.partial_loss(y_pos - y_neg)
for f in xrange(self.factors):
grad_u = self.V[i, f] - self.V[j, f]
self.U[u, f] -= self.lr * (mult * grad_u + self.reg * self.U[u, f])
grad = self.U[u, f]
self.V[i, f] -= self.lr * (mult * grad + self.reg * self.V[i, f])
self.V[j, f] -= self.lr * (-mult * grad + self.reg * self.V[j, f])
exec_bpr.py
from MFbpr import MFbpr
def load_data(ratingFile, testRatio=0.1):
'''
The process of loading data
'''
return trainMatrix, testRatings
if __name__ == "__main__":
# data
trainMatrix, testRatings = load_data('yelp.rating')
# settings
topK = 100
factors = 64
maxIter = 10
maxIterOnline = 1
lr = 0.01
adaptive = False
init_mean = 0.0
init_stdev = 0.1
reg = 0.01
showProgress = False
showLoss = True
bpr = MFbpr(trainMatrix, testRatings, topK, factors, maxIter, lr, adaptive, reg, init_mean, init_stdev, showProgress, showLoss)
bpr.buildModel()
The entire amount of code is available on github.
I will try it for the time being.
--Number of users: 25,677 --Number of items: 25,815 --Number of ratings (number of samples with ratings): 627,775
--Number of factors: 64
It is ubuntu with 2G memory and 2 processors built on VirtualBox.
The first bracket is the time [s] it took for one iteration, and the last bracket is the time [s] it took to calculate loss.
> python exex_bpr.py
Data yelp.rating
#Users 25677, #newUser: 588
#Items 25815
#Ratings 627775.0 (train), 73167(test), 10056(#newTestRatings)
Run for BPR.
Iter=0 [128.6936481] [-]loss: 624484.357765 [8.16665792465]
Iter=1 [137.202406883] [-]loss: 616970.769244 [7.11149096489]
Iter=2 [131.134891987] [-]loss: 609361.307524 [7.12593793869]
Iter=3 [134.665620804] [-]loss: 601240.628869 [8.45840716362]
Iter=4 [134.722868919] [-]loss: 592053.155587 [7.6300880909]
Even though I'm calculating about 600,000 samples, I feel that 130 seconds per iteration is too long.
Let's start by identifying the parts that are taking a long time.
Use the Python profiler cProfile
to measure processing speed.
python -m cProfile -s cumulative ***.py
If you execute, the time-consuming processing will be displayed in descending order.
> python -m cProfile -s cumulative exex_bpr.py
Data yelp.rating
#Users 25677, #newUser: 588
#Items 25815
#Ratings 627775.0 (train), 73167(test), 10056(#newTestRatings)
Run for BPR.
Iter=0 [244.87996006] [-]loss: 624394.802988 [53.4806399345]
Iter=1 [248.624686956] [-]loss: 616876.50976 [48.6073460579]
Iter=2 [253.822627068] [-]loss: 609269.820414 [52.5446169376]
Iter=3 [261.039247036] [-]loss: 601207.904104 [53.8221797943]
Iter=4 [260.285779953] [-]loss: 592212.148141 [54.9556028843]
369374621 function calls (368041918 primitive calls) in 1808.492 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.171 0.171 1808.493 1808.493 exex_bpr.py:3(<module>)
1 34.307 34.307 1532.124 1532.124 MFbpr.py:40(buildModel)
3067088 535.383 0.000 830.046 0.000 MFbpr.py:92(update_ui)
6209078 48.829 0.000 433.564 0.000 lil.py:228(__getitem__)
3292937 26.140 0.000 376.631 0.000 lil.py:191(getrowview)
3292939 66.488 0.000 346.337 0.000 lil.py:85(__init__)
5 0.000 0.000 263.411 52.682 MFbpr.py:117(_showLoss)
5 22.098 4.420 263.410 52.682 MFbpr.py:128(loss)
1 9.443 9.443 242.550 242.550 exex_bpr.py:9(load_data)
From the bottom of the line ʻOrdered by: cumulative time, the function names and the time required for processing are listed. If you look at this, you can see that the function ʻupdate_ui
is called 3,067,088 times and takes a total of 535.383 seconds.
(Well, it's only natural that we call only ʻupdate_ui in
buildModel` ...)
It is the overhead of cProfile that the execution time of one iteration is longer than before.
You can use line_profiler to measure the function of interest line by line. You can install it with pip.
> pip install line_profiler
In order to profile with line_profiler
, you need to add a @ profile
decorator to the function you are looking at.
MFbpr.py
@profile
def update_ui(self, u, i):
# sample a negative item(uniformly random)
j = np.random.randint(self.itemCount)
while self.trainMatrix[u, j] != 0:
j = np.random.randint(self.itemCount)
# BPR update rules
y_pos = self.predict(u, i) # target value of positive instance
y_neg = self.predict(u, j) # target value of negative instance
mult = -self.partial_loss(y_pos - y_neg)
for f in xrange(self.factors):
grad_u = self.V[i, f] - self.V[j, f]
self.U[u, f] -= self.lr * (mult * grad_u + self.reg * self.U[u, f])
grad = self.U[u, f]
self.V[i, f] -= self.lr * (mult * grad + self.reg * self.V[i, f])
self.V[j, f] -= self.lr * (-mult * grad + self.reg * self.V[j, f])
All you have to do is run it with the kernprof
command.
> kernprof -l -v exex_bpr.py
Data yelp.rating
#Users 25677, #newUser: 588
#Items 25815
#Ratings 627775.0 (train), 73167(test), 10056(#newTestRatings)
Run for BPR.
Iter=0 [953.993126154] [-]loss: 624406.940531 [7.50253987312]
Iter=1 [962.82383585] [-]loss: 616855.373858 [7.96375918388]
Wrote profile results to exex_bpr.py.lprof
Timer unit: 1e-06 s
Total time: 1082.55 s
File: MFbpr.py
Function: update_ui at line 92
Line # Hits Time Per Hit % Time Line Contents
==============================================================
92 @profile
93 def update_ui(self, u, i):
94 # sample a negative item(uniformly random)
95 1226961 8241361 6.7 0.8 j = np.random.randint(self.itemCount)
96 1228124 39557350 32.2 3.7 while self.trainMatrix[u, j] != 0:
97 1163 6123 5.3 0.0 j = np.random.randint(self.itemCount)
98
99 # BPR update rules
100 1226961 9495381 7.7 0.9 y_pos = self.predict(u, i) # target value of positive instance
101 1226961 4331378 3.5 0.4 y_neg = self.predict(u, j) # target value of negative instance
102 1226961 10002355 8.2 0.9 mult = -self.partial_loss(y_pos - y_neg)
103
104 79752465 126523856 1.6 11.7 for f in xrange(self.factors):
105 78525504 161882071 2.1 15.0 grad_u = self.V[i, f] - self.V[j, f]
106 78525504 191293505 2.4 17.7 self.U[u, f] -= self.lr * (mult * grad_u + self.reg * self.U[u, f])
107
108 78525504 137871145 1.8 12.7 grad = self.U[u, f]
109 78525504 194033291 2.5 17.9 self.V[i, f] -= self.lr * (mult * grad + self.reg * self.V[i, f])
110 78525504 199315454 2.5 18.4 self.V[j, f] -= self.lr * (-mult * grad + self.reg * self.V[j, f])
Looking at the % Time
column, I found that ** 90% or more of the processing time of ʻupdate_ui` is spent under the for statement **.
One iteration is less than 1,000 seconds, which means that the overhead is heavy.
Rewrite the for loop with Cython. One of the reasons Python is slow is that it is dynamically typed. Since the type is checked every time a variable is referenced, a program with many variable references cannot ignore the time required for this operation.
If you write using Cython, you can define the type of the variable like in C language. Since the type is not confirmed one by one, a considerable speedup can be expected.
Variables are declared with cdef
. Define the types of all variables used in the calculation.
fastupdfn.pyx
import numpy as np
cimport numpy as np
cimport cython
DOUBLE = np.float64
ctypedef np.float64_t DOUBLE_t
cpdef c_upd(np.ndarray[DOUBLE_t, ndim=2] U,
np.ndarray[DOUBLE_t, ndim=2] V,
double mult,
double lr,
double reg,
unsigned int u,
unsigned int i,
unsigned int j,
unsigned int factors):
cdef unsigned int f
cdef double grad_u, grad
for f in xrange(factors):
grad_u = V[i, f] - V[j, f]
U[u, f] -= lr * (mult * grad_u + reg * U[u, f])
grad = U[u, f]
V[i, f] -= lr * (mult * grad + reg * V[i, f])
V[j, f] -= lr * (-mult * grad + reg * V[j, f])
return U, V
The caller is rewritten as follows.
MFbpr.py
import fastupd #add to
class MFbpr(Recommender):
"""
Omission
"""
def update_ui(self, u, i):
# sample a negative item(uniformly random)
j = np.random.randint(self.itemCount)
while self.trainMatrix[u, j] != 0:
j = np.random.randint(self.itemCount)
# BPR update rules
y_pos = self.predict(u, i) # target value of positive instance
y_neg = self.predict(u, j) # target value of negative instance
mult = -self.partial_loss(y_pos - y_neg)
#Call the Cython implementation
self.U, self.V = fastupd.c_upd(self.U, self.V, mult, self.lr, self.reg, u, i, j, self.factors)
You also need to have a setup.py
to compile your Cython implementation.
setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
setup(
cmdclass={'build_ext': build_ext},
ext_modules=[Extension("fastupd", ["fastupdfn.pyx"])]
)
Compile when you're ready. Compile with the following command.
> python setup.py build_ext --inplace
I will do it.
> python exec_bpr.py
Data yelp.rating
#Users 25677, #newUser: 588
#Items 25815
#Ratings 627775.0 (train), 73167(test), 10056(#newTestRatings)
Run for BPR.
Iter=0 [38.7308020592] [-]loss: 624282.459815 [8.64863014221]
Iter=1 [36.7822458744] [-]loss: 616740.942017 [8.62252616882]
Iter=2 [35.8996829987] [-]loss: 609119.520253 [7.9108710289]
Iter=3 [35.1236720085] [-]loss: 600992.740207 [8.14179396629]
Iter=4 [34.9632918835] [-]loss: 591877.909123 [8.81325411797]
It takes less than 40 seconds to execute one iteration. It is ** 3-4 times faster than the first **.
The result of line_profiler
is also posted.
> kernprof -l -v exex_bpr.py
Data yelp.rating
#Users 25677, #newUser: 588
#Items 25815
#Ratings 627775.0 (train), 73167(test), 10056(#newTestRatings)
Run for BPR.
Iter=0 [66.7851500511] [-]loss: 624400.680806 [7.92221903801]
Iter=1 [62.5339269638] [-]loss: 616866.244211 [7.85720801353]
Iter=2 [65.6408250332] [-]loss: 609235.226048 [8.48338794708]
Iter=3 [66.0613160133] [-]loss: 601140.035318 [8.52119803429]
Iter=4 [66.5882000923] [-]loss: 592026.927719 [8.32318806648]
Wrote profile results to exex_bpr.py.lprof
Timer unit: 1e-06 s
Total time: 164.139 s
File: MFbpr.py
Function: update_ui at line 93
Line # Hits Time Per Hit % Time Line Contents
==============================================================
93 @profile
94 def update_ui(self, u, i):
95 # sample a negative item(uniformly random)
96 3066856 17642519 5.8 10.7 j = np.random.randint(self.itemCount)
97 3069840 79530375 25.9 48.5 while self.trainMatrix[u, j] != 0:
98 2984 15027 5.0 0.0 j = np.random.randint(self.itemCount)
99
100 # BPR update rules
101 3066856 17651846 5.8 10.8 y_pos = self.predict(u, i) # target value of positive instance
102 3066856 10985781 3.6 6.7 y_neg = self.predict(u, j) # target value of negative instance
103 3066856 18993291 6.2 11.6 mult = -self.partial_loss(y_pos - y_neg)
104
105 3066856 19320147 6.3 11.8 self.U, self.V = fastupd.c_upd(self.U, self.V, mult, self.lr, self.reg, u, i, j, self.factors)
The portion of the matrix update that originally spent more than 90% has been reduced to 11.8%. You can see the effect of Cython.
Profiling with cPrifile
and line_profiler
found a processing bottleneck and improved it with Cython
.
I just rewrote one place, but it's a little less than four times faster.
By the way, I put the code with Cython applied in the w_cython branch of github. It may be merged into master soon.
The for statement part updates the matrix elements independently, so it can be executed completely in parallel. Cython has a function called prange ()
that executes the processing of the for statement in multiple threads, so it may be possible to improve it a little by applying this.
Recommended Posts