It is a challenge record of Language processing 100 knock 2015. The environment is Ubuntu 16.04 LTS + Python 3.5.2 : : Anaconda 4.1.1 (64-bit). Click here for a list of past knocks (http://qiita.com/segavvy/items/fb50ba8097d59475f760).
In this chapter, the task of classifying sentences into positive (positive) or negative (negative) using the sentence polarity dataset v1.0 of Movie Review Data published by Bo Pang and Lillian Lee (polarity analysis). Work on.
Learn the logistic regression model using the features extracted in> 72.
main.py
# coding: utf-8
import codecs
import snowballstemmer
import numpy as np
fname_sentiment = 'sentiment.txt'
fname_features = 'features.txt'
fname_theta = 'theta.npy'
fencoding = 'cp1252' # Windows-1252
learn_alpha = 6.0 #Learning rate
learn_count = 1000 #Number of learning iterations
stemmer = snowballstemmer.stemmer('english')
#List of stop words http://xpo6.com/list-of-english-stop-words/From CSV Format
stop_words = (
'a,able,about,across,after,all,almost,also,am,among,an,and,any,are,'
'as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,'
'either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,'
'him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,'
'likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,'
'on,only,or,other,our,own,rather,said,say,says,she,should,since,so,'
'some,than,that,the,their,them,then,there,these,they,this,tis,to,too,'
'twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,'
'will,with,would,yet,you,your').lower().split(',')
def is_stopword(str):
'''Returns whether the character is a stopword
Equalize case
Return value:
True for stop words, False for different
'''
return str.lower() in stop_words
def hypothesis(data_x, theta):
'''Hypothetical function
data_For x, use theta for data_Predict y
Return value:
Matrix of predicted values
'''
return 1.0 / (1.0 + np.exp(-data_x.dot(theta)))
def cost(data_x, theta, data_y):
'''Objective function
data_Calculate the difference between the predicted result and the correct answer for x
Return value:
Difference between prediction and correct answer
'''
m = data_y.size #Number of data
h = hypothesis(data_x, theta) # data_Matrix of predicted values of y
j = 1 / m * np.sum(-data_y * np.log(h) -
(np.ones(m) - data_y) * np.log(np.ones(m) - h))
return j
def gradient(data_x, theta, data_y):
'''Gradient calculation for steepest descent
Return value:
Gradient matrix for theta
'''
m = data_y.size #Number of data
h = hypothesis(data_x, theta) # data_Matrix of predicted values of y
grad = 1 / m * (h - data_y).dot(data_x)
return grad
def extract_features(data, dict_features):
'''Extracting features from sentences
Dict from text_Extract the features included in features and
dict_features['(Feature)']Returns a matrix with the position of 1.
The first element is fixed at 1. For weights that do not correspond to features.
Return value:
First element and the position of the corresponding feature+Matrix with 1 as 1
'''
data_one_x = np.zeros(len(dict_features) + 1, dtype=np.float64)
data_one_x[0] = 1 #The first element is fixed and 1, for weights that do not correspond to features.
for word in data.split(' '):
#Remove whitespace before and after
word = word.strip()
#Stop word removal
if is_stopword(word):
continue
#Stemming
word = stemmer.stemWord(word)
#Get index of features, set the corresponding part of the matrix to 1.
try:
data_one_x[dict_features[word]] = 1
except:
pass # dict_Ignore features not found in features
return data_one_x
def load_dict_features():
'''features.Create a dictionary to read txt and convert features into indexes
Index value is 1 based, features.Matches the line number in txt.
Return value:
A dictionary that converts features into indexes
'''
with codecs.open(fname_features, 'r', fencoding) as file_in:
return {line.strip(): i for i, line in enumerate(file_in, start=1)}
def create_training_set(sentiments, dict_features):
'''Create a matrix to be learned and a matrix with polar labels from the correct data sentiments
The size of the line example to be learned is the number of reviews of correct answer data ×(Feature number+1)。
The column value will be 1 if there is a corresponding feature for each review, and 0 otherwise.
The index of column features is dict_features['(Feature)']It is decided by.
The first column is always 1 for learning weights that do not correspond to features.
dict_Ignore features that do not exist in features.
The size of the polar label matrix is the number of reviews x 1.
1 for positive content and 0 for negative content.
Return value:
Matrix to learn,Polarity label matrix
'''
#Initialize matrix with 0
data_x = np.zeros([len(sentiments), len(dict_features) + 1], dtype=np.float64)
data_y = np.zeros(len(sentiments), dtype=np.float64)
for i, line in enumerate(sentiments):
#Feature extraction
data_x[i] = extract_features(line[3:], dict_features)
#Set of polar label matrices
if line[0:2] == '+1':
data_y[i] = 1
return data_x, data_y
def learn(data_x, data_y, alpha, count):
'''Learning logistic regression
Return value:
Trained theta
'''
theta = np.zeros(data_x.shape[1])
c = cost(data_x, theta, data_y)
print('\t Start learning\tcost:{}'.format(c))
for i in range(1, count + 1):
grad = gradient(data_x, theta, data_y)
theta -= alpha * grad
#Calculate the cost and the maximum adjustment amount of theta and display the progress (once in 100 times)
if i % 100 == 0:
c = cost(data_x, theta, data_y)
e = np.max(np.absolute(alpha * grad))
print('\t learning(#{})\tcost:{}\tE:{}'.format(i, c, e))
c = cost(data_x, theta, data_y)
e = np.max(np.absolute(alpha * grad))
print('\t Learning completed(#{}) \tcost:{}\tE:{}'.format(i, c, e))
return theta
#Reading the feature dictionary
dict_features = load_dict_features()
#Matrix to be learned and matrix of polar labels
with codecs.open(fname_sentiment, 'r', fencoding) as file_in:
data_x, data_y = create_training_set(list(file_in), dict_features)
#Learning
print('Learning rate:{}\t Number of learning repetitions:{}'.format(learn_alpha, learn_count))
theta = learn(data_x, data_y, alpha=learn_alpha, count=learn_count)
#Save results
np.save(fname_theta, theta)
Execution result
Learning rate: 6.0 Number of learning repetitions: 1000
Start learning cost: 0.6931471805599453
Learning(#100) cost:0.4809284917412944 E:0.006248170735186127
Learning(#200) cost:0.43188679850114775 E:0.003599155234198481
Learning(#300) cost:0.4043113376254009 E:0.002616675715766214
Learning(#400) cost:0.38547454091328076 E:0.0020805226234380772
Learning(#500) cost:0.37135664408713015 E:0.0017952496476012821
Learning(#600) cost:0.36017505743644285 E:0.0015873326040173347
Learning(#700) cost:0.35098616931062043 E:0.0014288227472357999
Learning(#800) cost:0.343231725184532 E:0.0013037591670736948
Learning(#900) cost:0.33655507220582787 E:0.0012023865948793643
Learning(#1000) cost:0.33071511988186225 E:0.0011184180264631118
Learning completed(#1000) cost:0.33071511988186225 E:0.0011184180264631118
The trained result $ \ theta $ matrix is output to "theta.npy". The file is uploaded to GitHub.
As a preparation, you need to install a library called NumPy that can perform high-speed matrix operations. Fortunately, it seemed to be installed in Anaconda, so I could use it as it was without doing anything. The official site is here.
The reason why high-speed matrix operation by NumPy is required is that the amount of calculation required for machine learning learning work is very large, and if you do not implement it while replacing the logic with matrix operation by the method of vectorization, the learning time will be very long. This is because it ends up.
For example, the following formula used for learning that appears in Problem 72 has extracted 3,227 features, so n becomes 3,227.
In this case, it takes 3,227 multiplications and 3,227 additions to calculate y once. Normally, $ \ theta_0 $ ~ $ \ theta_ {3227} $ and $ x_1 $ ~ $ x_ {3227} $ are listed, and while looping 3,227 times with a for statement, multiply each and add to $ y $. It would be nice to go, but it will take a considerable amount of time because this formula will be calculated many times during learning.
Therefore, instead of listing $ \ theta_0 $ ~ $ \ theta_ {3227} $ and $ x_1 $ ~ $ x_ {3227} $, make a matrix and implement it by taking the inner product of the matrix. The inner product of a matrix is obtained by sequentially multiplying the elements of each matrix and adding them together, so the answer is the same. Since it is faster to find the inner product of a matrix with NumPy than to repeat multiplication and addition of lists with a for statement, this speeds up learning. A matrix with 1 row or 1 column is called a vector, so the process of making a matrix is called "vectorization".
There is one caveat to vectorization. There are a total of 3,228 $ \ theta $ from $ \ theta_0 $ to $ \ theta_ {3227} $, while $ x $ is missing one from $ x_1 $ to $ x_ {3227} $. That is. The inner product cannot be obtained unless the numbers are matched. Therefore, we prepare $ x_0 $, transform the expression as follows, and implement it so that the value of $ x_0 $ is always 1. Note that $ x $ is one more than the number of features.
The matrixed $ X $ and $ \ theta $ look like this: The value is appropriate.
Now $ y $ can be calculated by $ X \ theta $ by matrix operation.
The method of manipulating matrices with NumPy is in English, but you can roughly understand it by looking at the Quickstart tutorial. There are many commentary articles on the net, so please refer to them.
For more information on vectorization, see junichiro's Making Machine Learning Practical Level in One Month # 6 (Octave Practical Edition) and ken5scal's [For beginners] Introduction to vectorization in machine learning was easy to understand.
Even if you vectorize it and let NumPy calculate it as a matrix, the calculation done inside should be the same, so why not speed it up? You may think that NumPy has a devised memory arrangement of matrix elements so that it can be calculated at the speed of C / C ++, and it seems that it will be considerably faster than calculating while looping in Python. ..
In addition, since many matrix operations are internally calculated independently of each other, they can be calculated in parallel instead of sequentially. Therefore, depending on the library used, you can expect speedup by GPU.
Initially, the GPU is a chip dedicated to graphics, and unlike the CPU core that tries to perform advanced tasks sequentially and at high speed, the GPU core executes a large number of pixel data calculation tasks for screen display in parallel. Was specialized in. Today's GPUs are no longer dedicated to graphics and can be used for computational tasks such as matrix computation. It seems that modern GPUs have thousands of cores, so if it is a task such as matrix operation that the GPU core can handle, it can process much more in parallel than a CPU with a limited number of cores. .. Therefore, it can be expected that the processing time will be significantly reduced.
In machine learning-related implementations, vectorization seems to be a very important technique for speeding up processing.
The prediction function $ h_ \ theta (x) $ used in the examples so far is called a hypothesis function.
Predicted value of
This time, I will process this formula a little more and use it. If you use it as it is, [Problem 72](http://qiita.com/segavvy/items/6695f94c28126607227b#4 Let's actually predict it) But as I mentioned a little, the predicted value of $ y $ is more than 1 which shows affirmation. This is because it becomes difficult to handle when it becomes large or smaller than 0, which indicates denial.
In Problem 70, I wrote that machine learning praises if it can predict the result correctly, and adjusts $ \ theta $ in a scolding manner if it is incorrect. Actually, the difference from the correct answer is calculated by the objective function described later, and $ \ theta $ is adjusted so that the difference becomes smaller. If you correctly predict affirmation, the correct label for $ y $ is 1. However, if this formula is used, the predicted value will be 1, 10, or 100, and if it is 10 or 100, the correct answer is predicted correctly, but the difference from the correct answer label 1 becomes large. I will end up. Therefore, we use a convenient function called the sigmoid function.
This function converts any value given to a value between 0 and 1, and the magnitude relation of the given value is maintained even after conversion. If you put the hypothesis function $ h_ \ theta (x) $ in the $ z $ part of this function, the result will always be 0 to 1.
When I write it on a graph, it looks like this (I tried to write it on Mac standard Grapher, but this is convenient. However, I do not know how to specify the axis label, so I am processing only that).
When $ z $ fluctuates around 0, the result of the sigmoid function fluctuates greatly around 0.5, whereas when $ z $ becomes larger or smaller than a certain level, the result of the sigmoid function fluctuates around 1 or around 0. Makes almost no difference.
Thanks to this property, even if the result of the hypothesis function is 1, 10, or 100, if you put it in $ z $, it will be close to 1, and the difference from the correct label will be small.
This time, I used a hypothetical function that uses this sigmoid function. hypothesis ()
is its implementation. It takes a matrix of $ X $ and $ \ theta $ and calculates the predicted value.
Next, I will explain the objective function to determine how accurately the prediction is made. Depending on the result of this objective function, we will adjust $ \ theta $ to make a more accurate prediction.
You can tell if the prediction is correct by comparing the predicted value calculated by the hypothesis function with the label of the correct answer. If the difference is small, it means that the prediction accuracy is high, and if the difference is large, it means that the prediction accuracy is low. It is a shape.
Normally, instead of simply adding the difference, add the squared value of the difference. It seems that if you square it, you don't have to worry about the sign of the difference, and the value for the difference becomes large, so it becomes easier to adjust $ \ theta $.
However, in the prediction that the result is either 1 (affirmative) or 0 (negative) like this time, the difference is only 1 at the maximum just by using the squared difference, and this objective function is no longer a convex function. It seems that it will be difficult to adjust $ \ theta $ to bring the objective function to the minimum value.
Therefore, using the property of $ -log () $, we use the following objective function so that if the prediction and the result match, it will be 0, and if they do not match, it will be $ \ infinity $. $ y $ is the correct label (1 for affirmative, 0 for negative), and $ h $ is the predicted value by the hypothesis function.
In the first half, the value when the correct answer is affirmative ($ y = 1
The implementation of the objective function is also accelerated by using matrix operations. To calculate the objective function when implemented normally, find the predicted value $ h $ for each review and $ -y , log (h)-(1 --y) log (1 --h) $ The value of is calculated, and it is repeated for the total number of reviews and added. This is because writing this in a loop slows it down.
First, create a matrix $ X $ with features extracted from all reviews. The length of the matrix is the number of reviews, and the width is the number of features + 1. The reason for +1 is that, as I wrote in the section on vectorization, in order to use matrix operations in hypothetical functions, we always need an element of 1 that corresponds to $ x_0 $.
This time, the number of reviews is 10,662 and the number of features is 3,227, so $ X $ looks like the following. It is just an image, and the value is appropriate.
$ \ theta $ is the form written in the vectorization section.
This will give you a predictive value of $ H $ for all reviews in a single $ X \ theta $.
In the same way, make the correct label a matrix.
Now you can calculate $ -y , log (h)-(1 --y) log (1 --h) $ by matrix operation, and you can calculate all reviews at once. I will. Cost ()
is implemented in this form.
Note that the matrices $ X $ and $ Y $ are created with create_training_set ()
.
Next is the part that adjusts $ \ theta $. Predict with the hypothesis function, calculate the difference from the correct answer with the objective function, and repeat the adjustment of $ \ theta $ so that the value becomes smaller. This time I used the steepest descent method for this adjustment. The steepest descent method is a method of examining the gradient of the current individual $ \ theta $ in the objective function and adjusting the individual $ \ theta $ in the direction in which the value of the objective function decreases.
Below is an image when there is only one $ \ theta $.
First, check the gradient of position ① in the current $ \ theta $. The examined gradient is shown as a straight line in the figure (the length of this straight line corresponds to the learning rate described later). Then, assuming that the value of the objective function falls along the straight line of the gradient, change $ \ theta $ so that it is at the position beyond the straight line. This completes the first adjustment, $ \ theta $ becomes a new value, and the current position moves to ②. The second time, check the gradient of position ②. After that, this process is repeated to proceed to ③ and ④, and the objective function is minimized by adjusting to the position where the gradient becomes flat.
In the explanation of the objective function above, I wrote that it is a problem if it is not a convex function, but if it is not a convex function, there will be a dent that is not the minimum point in the middle of the curve of this graph. If you do so, you will be addicted to a place that is not the minimum point, and you will not be able to derive $ \ theta $ to the minimum value.
The gradient can be obtained by partial differentiation, but I will omit the method of obtaining it (or rather, I can't find it by myself because I can't remember partial differentiation yet ^^;). Finally, you can adjust $ \ theta $ with the following formula.
$ m $ is the number of reviews, $ h_ \ theta (x ^ {(i)}) $ is the hypothesis function using $ x ^ {(i)} $ feature extracted from the $ i $ th review. $ i $ Predicted value of the 3rd review, $ y ^ {(i)} $ is the correct label of the ith review, $ x_j ^ {(i)} $ is the feature extracted from the $ i $ th review The value of j $, $ \ alpha $, is a parameter called the learning rate. The learning rate will be described later.
The calculation of the new $ \ theta $ by this steepest descent method is also accelerated by matrix calculation instead of repeating the loop for 3,228 $ \ theta $. Gradient calculation is implemented by gradient ()
, and repeated adjustment of $ \ theta $ is implemented by learn ()
.
The learning rate $ \ alpha $ when adjusting $ \ theta $ corresponds to the length of the straight line of the gradient in the figure of the steepest descent method above. If you increase it, the adjustment amount of $ \ theta $ will increase, and the value of the objective function will also decrease significantly. However, if you make it too large, it will pass the minimum point of the objective function, and the objective function will become large, and it will not converge repeatedly. On the other hand, if you make it too small, the processing time will be very long, although it will progress steadily toward the minimum point of the objective function. This area requires trial and error.
Also, how many times should this $ \ theta $ adjustment be repeated to complete the learning? This is a part that requires trial and error. The objective function gets smaller and smaller at first, but its speed gets slower and slower. In general, it was said that the adjustment amount of $ \ theta $ should be as small as $ 10 ^ {-3} $ as a guideline for the end. In this implementation, the adjustment amount is calculated in learn ()
and displayed as E.
I tried several times by trial and error, but this time I set the learning rate to 6.0 and repeated it 1,000 times. E at the 1,000th adjustment was about 0.001, and the value of the objective function hardly changed, so it seems good to end around here. By the way, it took about 1 minute to study on my computer.
I thought I was writing this article, but when it comes to such complicated content, it is difficult to try to explain it easily without understanding myself ^^; As I wrote in [Recommendations for studying machine learning in Problem 70](http://qiita.com/segavvy/items/0e91fe02088b875a386a#Recommendations for studying machine learning), please read the wonderful teaching materials and articles in the world first. Please take advantage of it.
That's all for the 74th knock. If you have any mistakes, I would appreciate it if you could point them out.
Recommended Posts