100 amateur language processing knocks: 73

It is a challenge record of Language processing 100 knock 2015. The environment is Ubuntu 16.04 LTS + Python 3.5.2 : : Anaconda 4.1.1 (64-bit). Click here for a list of past knocks (http://qiita.com/segavvy/items/fb50ba8097d59475f760).

Chapter 8: Machine Learning

In this chapter, the task of classifying sentences into positive (positive) or negative (negative) using the sentence polarity dataset v1.0 of Movie Review Data published by Bo Pang and Lillian Lee (polarity analysis). Work on.

73. Learning

Learn the logistic regression model using the features extracted in> 72.

The finished code:

`main.py`


# coding: utf-8
import codecs
import snowballstemmer
import numpy as np

fname_sentiment = 'sentiment.txt'
fname_features = 'features.txt'
fname_theta = 'theta.npy'
fencoding = 'cp1252'		# Windows-1252

learn_alpha = 6.0		#Learning rate
learn_count = 1000		#Number of learning iterations

stemmer = snowballstemmer.stemmer('english')

#List of stop words http://xpo6.com/list-of-english-stop-words/From CSV Format
stop_words = (
	'a,able,about,across,after,all,almost,also,am,among,an,and,any,are,'
	'as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,'
	'either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,'
	'him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,'
	'likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,'
	'on,only,or,other,our,own,rather,said,say,says,she,should,since,so,'
	'some,than,that,the,their,them,then,there,these,they,this,tis,to,too,'
	'twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,'
	'will,with,would,yet,you,your').lower().split(',')


def is_stopword(str):
	'''Returns whether the character is a stopword
Equalize case

Return value:
True for stop words, False for different
	'''
	return str.lower() in stop_words


def hypothesis(data_x, theta):
	'''Hypothetical function
	data_For x, use theta for data_Predict y

Return value:
Matrix of predicted values
	'''
	return 1.0 / (1.0 + np.exp(-data_x.dot(theta)))


def cost(data_x, theta, data_y):
	'''Objective function
	data_Calculate the difference between the predicted result and the correct answer for x

Return value:
Difference between prediction and correct answer
	'''
	m = data_y.size			#Number of data
	h = hypothesis(data_x, theta)		# data_Matrix of predicted values of y
	j = 1 / m * np.sum(-data_y * np.log(h) -
			(np.ones(m) - data_y) * np.log(np.ones(m) - h))

	return j


def gradient(data_x, theta, data_y):
	'''Gradient calculation for steepest descent

Return value:
Gradient matrix for theta
	'''
	m = data_y.size			#Number of data
	h = hypothesis(data_x, theta)		# data_Matrix of predicted values of y
	grad = 1 / m * (h - data_y).dot(data_x)

	return grad


def extract_features(data, dict_features):
	'''Extracting features from sentences
Dict from text_Extract the features included in features and
	dict_features['(Feature)']Returns a matrix with the position of 1.
The first element is fixed at 1. For weights that do not correspond to features.

Return value:
First element and the position of the corresponding feature+Matrix with 1 as 1
	'''
	data_one_x = np.zeros(len(dict_features) + 1, dtype=np.float64)
	data_one_x[0] = 1		#The first element is fixed and 1, for weights that do not correspond to features.

	for word in data.split(' '):

		#Remove whitespace before and after
		word = word.strip()

		#Stop word removal
		if is_stopword(word):
			continue

		#Stemming
		word = stemmer.stemWord(word)

		#Get index of features, set the corresponding part of the matrix to 1.
		try:
			data_one_x[dict_features[word]] = 1
		except:
			pass		# dict_Ignore features not found in features

	return data_one_x


def load_dict_features():
	'''features.Create a dictionary to read txt and convert features into indexes
Index value is 1 based, features.Matches the line number in txt.

Return value:
A dictionary that converts features into indexes
	'''
	with codecs.open(fname_features, 'r', fencoding) as file_in:
		return {line.strip(): i for i, line in enumerate(file_in, start=1)}


def create_training_set(sentiments, dict_features):
	'''Create a matrix to be learned and a matrix with polar labels from the correct data sentiments
The size of the line example to be learned is the number of reviews of correct answer data ×(Feature number+1)。
The column value will be 1 if there is a corresponding feature for each review, and 0 otherwise.
The index of column features is dict_features['(Feature)']It is decided by.
The first column is always 1 for learning weights that do not correspond to features.
	dict_Ignore features that do not exist in features.

The size of the polar label matrix is the number of reviews x 1.
1 for positive content and 0 for negative content.

Return value:
Matrix to learn,Polarity label matrix
	'''

	#Initialize matrix with 0
	data_x = np.zeros([len(sentiments), len(dict_features) + 1], dtype=np.float64)
	data_y = np.zeros(len(sentiments), dtype=np.float64)

	for i, line in enumerate(sentiments):

		#Feature extraction
		data_x[i] = extract_features(line[3:], dict_features)

		#Set of polar label matrices
		if line[0:2] == '+1':
			data_y[i] = 1

	return data_x, data_y


def learn(data_x, data_y, alpha, count):
	'''Learning logistic regression

Return value:
Trained theta
	'''
	theta = np.zeros(data_x.shape[1])
	c = cost(data_x, theta, data_y)
	print('\t Start learning\tcost：{}'.format(c))

	for i in range(1, count + 1):

		grad = gradient(data_x, theta, data_y)
		theta -= alpha * grad

		#Calculate the cost and the maximum adjustment amount of theta and display the progress (once in 100 times)
		if i % 100 == 0:
			c = cost(data_x, theta, data_y)
			e = np.max(np.absolute(alpha * grad))
			print('\t learning(#{})\tcost：{}\tE:{}'.format(i, c, e))

	c = cost(data_x, theta, data_y)
	e = np.max(np.absolute(alpha * grad))
	print('\t Learning completed(#{}) \tcost：{}\tE:{}'.format(i, c, e))
	return theta


#Reading the feature dictionary
dict_features = load_dict_features()

#Matrix to be learned and matrix of polar labels
with codecs.open(fname_sentiment, 'r', fencoding) as file_in:
	data_x, data_y = create_training_set(list(file_in), dict_features)

#Learning
print('Learning rate:{}\t Number of learning repetitions:{}'.format(learn_alpha, learn_count))
theta = learn(data_x, data_y, alpha=learn_alpha, count=learn_count)

#Save results
np.save(fname_theta, theta)

Execution result:

`Execution result`


Learning rate: 6.0 Number of learning repetitions: 1000
Start learning cost: 0.6931471805599453
Learning(#100)	cost：0.4809284917412944	E:0.006248170735186127
Learning(#200)	cost：0.43188679850114775	E:0.003599155234198481
Learning(#300)	cost：0.4043113376254009	E:0.002616675715766214
Learning(#400)	cost：0.38547454091328076	E:0.0020805226234380772
Learning(#500)	cost：0.37135664408713015	E:0.0017952496476012821
Learning(#600)	cost：0.36017505743644285	E:0.0015873326040173347
Learning(#700)	cost：0.35098616931062043	E:0.0014288227472357999
Learning(#800)	cost：0.343231725184532	E:0.0013037591670736948
Learning(#900)	cost：0.33655507220582787	E:0.0012023865948793643
Learning(#1000)	cost：0.33071511988186225	E:0.0011184180264631118
Learning completed(#1000) 	cost：0.33071511988186225	E:0.0011184180264631118

The trained result $ \ theta $ matrix is output to "theta.npy". The file is uploaded to GitHub.

Install NumPy

As a preparation, you need to install a library called NumPy that can perform high-speed matrix operations. Fortunately, it seemed to be installed in Anaconda, so I could use it as it was without doing anything. The official site is here.

Vectorization

The reason why high-speed matrix operation by NumPy is required is that the amount of calculation required for machine learning learning work is very large, and if you do not implement it while replacing the logic with matrix operation by the method of vectorization, the learning time will be very long. This is because it ends up.

For example, the following formula used for learning that appears in Problem 72 has extracted 3,227 features, so n becomes 3,227.

y = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_{3227}\, x_{3227}

In this case, it takes 3,227 multiplications and 3,227 additions to calculate y once. Normally, $ \ theta_0 $ ~ $ \ theta_ {3227} $ and $ x_1 $ ~ $ x_ {3227} $ are listed, and while looping 3,227 times with a for statement, multiply each and add to $ y $. It would be nice to go, but it will take a considerable amount of time because this formula will be calculated many times during learning.

Therefore, instead of listing $ \ theta_0 $ ~ $ \ theta_ {3227} $ and $ x_1 $ ~ $ x_ {3227} $, make a matrix and implement it by taking the inner product of the matrix. The inner product of a matrix is obtained by sequentially multiplying the elements of each matrix and adding them together, so the answer is the same. Since it is faster to find the inner product of a matrix with NumPy than to repeat multiplication and addition of lists with a for statement, this speeds up learning. A matrix with 1 row or 1 column is called a vector, so the process of making a matrix is called "vectorization".

There is one caveat to vectorization. There are a total of 3,228 $ \ theta $ from $ \ theta_0 $ to $ \ theta_ {3227} $, while $ x $ is missing one from $ x_1 $ to $ x_ {3227} $. That is. The inner product cannot be obtained unless the numbers are matched. Therefore, we prepare $ x_0 $, transform the expression as follows, and implement it so that the value of $ x_0 $ is always 1. Note that $ x $ is one more than the number of features.

y = \ theta_0 x_0 + \ theta_1 x_1 + \ theta_2 x_2 + ... + \ theta_ {3227} \, x_ {3227} (x_0 is always 1)

The matrixed $ X $ and $ \ theta $ look like this: The value is appropriate.

Kobito.MhQnXr.png

Now $ y $ can be calculated by $ X \ theta $ by matrix operation.

The method of manipulating matrices with NumPy is in English, but you can roughly understand it by looking at the Quickstart tutorial. There are many commentary articles on the net, so please refer to them.

For more information on vectorization, see junichiro's Making Machine Learning Practical Level in One Month # 6 (Octave Practical Edition) and ken5scal's [For beginners] Introduction to vectorization in machine learning was easy to understand.

Why vectorization makes it faster

Even if you vectorize it and let NumPy calculate it as a matrix, the calculation done inside should be the same, so why not speed it up? You may think that NumPy has a devised memory arrangement of matrix elements so that it can be calculated at the speed of C / C ++, and it seems that it will be considerably faster than calculating while looping in Python. ..

In addition, since many matrix operations are internally calculated independently of each other, they can be calculated in parallel instead of sequentially. Therefore, depending on the library used, you can expect speedup by GPU.

Initially, the GPU is a chip dedicated to graphics, and unlike the CPU core that tries to perform advanced tasks sequentially and at high speed, the GPU core executes a large number of pixel data calculation tasks for screen display in parallel. Was specialized in. Today's GPUs are no longer dedicated to graphics and can be used for computational tasks such as matrix computation. It seems that modern GPUs have thousands of cores, so if it is a task such as matrix operation that the GPU core can handle, it can process much more in parallel than a CPU with a limited number of cores. .. Therefore, it can be expected that the processing time will be significantly reduced.

In machine learning-related implementations, vectorization seems to be a very important technique for speeding up processing.

Hypothetical function

The prediction function $ h_ \ theta (x) $ used in the examples so far is called a hypothesis function.

Predicted value of $ y = h_ \ theta (x) = \ theta_0 x_0 + \ theta_1 x_1 + \ theta_2 x_2 + ... + \ theta_n x_n (x_0 is always 1) $

This time, I will process this formula a little more and use it. If you use it as it is, [Problem 72](http://qiita.com/segavvy/items/6695f94c28126607227b#4 Let's actually predict it) But as I mentioned a little, the predicted value of $ y $ is more than 1 which shows affirmation. This is because it becomes difficult to handle when it becomes large or smaller than 0, which indicates denial.

In Problem 70, I wrote that machine learning praises if it can predict the result correctly, and adjusts $ \ theta $ in a scolding manner if it is incorrect. Actually, the difference from the correct answer is calculated by the objective function described later, and $ \ theta $ is adjusted so that the difference becomes smaller. If you correctly predict affirmation, the correct label for $ y $ is 1. However, if this formula is used, the predicted value will be 1, 10, or 100, and if it is 10 or 100, the correct answer is predicted correctly, but the difference from the correct answer label 1 becomes large. I will end up. Therefore, we use a convenient function called the sigmoid function.

Sigmoid function

This function converts any value given to a value between 0 and 1, and the magnitude relation of the given value is maintained even after conversion. If you put the hypothesis function $ h_ \ theta (x) $ in the $ z $ part of this function, the result will always be 0 to 1.

g(z) = \frac{1}{1 + e^{-z}}

When I write it on a graph, it looks like this (I tried to write it on Mac standard Grapher, but this is convenient. However, I do not know how to specify the axis label, so I am processing only that).

Kobito.Zukcox.png

When $ z $ fluctuates around 0, the result of the sigmoid function fluctuates greatly around 0.5, whereas when $ z $ becomes larger or smaller than a certain level, the result of the sigmoid function fluctuates around 1 or around 0. Makes almost no difference.

Thanks to this property, even if the result of the hypothesis function is 1, 10, or 100, if you put it in $ z $, it will be close to 1, and the difference from the correct label will be small.

This time, I used a hypothetical function that uses this sigmoid function. hypothesis () is its implementation. It takes a matrix of $ X $ and $ \ theta $ and calculates the predicted value.

Objective function

Next, I will explain the objective function to determine how accurately the prediction is made. Depending on the result of this objective function, we will adjust $ \ theta $ to make a more accurate prediction.

You can tell if the prediction is correct by comparing the predicted value calculated by the hypothesis function with the label of the correct answer. If the difference is small, it means that the prediction accuracy is high, and if the difference is large, it means that the prediction accuracy is low. It is a shape.

Normally, instead of simply adding the difference, add the squared value of the difference. It seems that if you square it, you don't have to worry about the sign of the difference, and the value for the difference becomes large, so it becomes easier to adjust $ \ theta $.

However, in the prediction that the result is either 1 (affirmative) or 0 (negative) like this time, the difference is only 1 at the maximum just by using the squared difference, and this objective function is no longer a convex function. It seems that it will be difficult to adjust $ \ theta $ to bring the objective function to the minimum value.

Therefore, using the property of $ -log () $, we use the following objective function so that if the prediction and the result match, it will be 0, and if they do not match, it will be $ \ infinity $. $ y $ is the correct label (1 for affirmative, 0 for negative), and $ h $ is the predicted value by the hypothesis function.

-y\,log(h) - (1 - y) log(1 - h)

In the first half, the value when the correct answer is affirmative ($ y = 1 ) is calculated, and in the second half, the value when the correct answer is negative ( y = 0 $) is calculated. This time, I calculated this for all reviews, added them and divided by the number of reviews, and used it as the objective function.

Vectorization of objective function

The implementation of the objective function is also accelerated by using matrix operations. To calculate the objective function when implemented normally, find the predicted value $ h $ for each review and $ -y , log (h)-(1 --y) log (1 --h) $ The value of is calculated, and it is repeated for the total number of reviews and added. This is because writing this in a loop slows it down.

First, create a matrix $ X $ with features extracted from all reviews. The length of the matrix is the number of reviews, and the width is the number of features + 1. The reason for +1 is that, as I wrote in the section on vectorization, in order to use matrix operations in hypothetical functions, we always need an element of 1 that corresponds to $ x_0 $.

This time, the number of reviews is 10,662 and the number of features is 3,227, so $ X $ looks like the following. It is just an image, and the value is appropriate.

Kobito.HzcqaN.png

$ \ theta $ is the form written in the vectorization section.

Kobito.Q9trKP.png

This will give you a predictive value of $ H $ for all reviews in a single $ X \ theta $.

In the same way, make the correct label a matrix.

Kobito.ybJVFe.png

Now you can calculate $ -y , log (h)-(1 --y) log (1 --h) $ by matrix operation, and you can calculate all reviews at once. I will. Cost () is implemented in this form.

Note that the matrices $ X $ and $ Y $ are created with create_training_set ().

The steepest descent method

Next is the part that adjusts $ \ theta $. Predict with the hypothesis function, calculate the difference from the correct answer with the objective function, and repeat the adjustment of $ \ theta $ so that the value becomes smaller. This time I used the steepest descent method for this adjustment. The steepest descent method is a method of examining the gradient of the current individual $ \ theta $ in the objective function and adjusting the individual $ \ theta $ in the direction in which the value of the objective function decreases.

Below is an image when there is only one $ \ theta $.

Kobito.OJ9Q78.png

First, check the gradient of position ① in the current $ \ theta $. The examined gradient is shown as a straight line in the figure (the length of this straight line corresponds to the learning rate described later). Then, assuming that the value of the objective function falls along the straight line of the gradient, change $ \ theta $ so that it is at the position beyond the straight line. This completes the first adjustment, $ \ theta $ becomes a new value, and the current position moves to ②. The second time, check the gradient of position ②. After that, this process is repeated to proceed to ③ and ④, and the objective function is minimized by adjusting to the position where the gradient becomes flat.

In the explanation of the objective function above, I wrote that it is a problem if it is not a convex function, but if it is not a convex function, there will be a dent that is not the minimum point in the middle of the curve of this graph. If you do so, you will be addicted to a place that is not the minimum point, and you will not be able to derive $ \ theta $ to the minimum value.

Kobito.n7B7Nv.png

The gradient can be obtained by partial differentiation, but I will omit the method of obtaining it (or rather, I can't find it by myself because I can't remember partial differentiation yet ^^;). Finally, you can adjust $ \ theta $ with the following formula.

New \ theta_j = Current \ theta_j-\ alpha \ frac {1} {m} \ sum_ {i = 1} ^ m (h_ \ theta (x ^ {(i)})-y ^ {(i) }) x_j ^ {(i)}

$ m $ is the number of reviews, $ h_ \ theta (x ^ {(i)}) $ is the hypothesis function using $ x ^ {(i)} $ feature extracted from the $ i $ th review. $ i $ Predicted value of the 3rd review, $ y ^ {(i)} $ is the correct label of the ith review, $ x_j ^ {(i)} $ is the feature extracted from the $ i $ th review The value of j $, $ \ alpha $, is a parameter called the learning rate. The learning rate will be described later.

The calculation of the new $ \ theta $ by this steepest descent method is also accelerated by matrix calculation instead of repeating the loop for 3,228 $ \ theta $. Gradient calculation is implemented by gradient (), and repeated adjustment of $ \ theta $ is implemented by learn ().

Learning rate and number of repetitions

The learning rate $ \ alpha $ when adjusting $ \ theta $ corresponds to the length of the straight line of the gradient in the figure of the steepest descent method above. If you increase it, the adjustment amount of $ \ theta $ will increase, and the value of the objective function will also decrease significantly. However, if you make it too large, it will pass the minimum point of the objective function, and the objective function will become large, and it will not converge repeatedly. On the other hand, if you make it too small, the processing time will be very long, although it will progress steadily toward the minimum point of the objective function. This area requires trial and error.

Also, how many times should this $ \ theta $ adjustment be repeated to complete the learning? This is a part that requires trial and error. The objective function gets smaller and smaller at first, but its speed gets slower and slower. In general, it was said that the adjustment amount of $ \ theta $ should be as small as $ 10 ^ {-3} $ as a guideline for the end. In this implementation, the adjustment amount is calculated in learn () and displayed as E.

I tried several times by trial and error, but this time I set the learning rate to 6.0 and repeated it 1,000 times. E at the 1,000th adjustment was about 0.001, and the value of the objective function hardly changed, so it seems good to end around here. By the way, it took about 1 minute to study on my computer.

Recommendations for studying machine learning (repost)

I thought I was writing this article, but when it comes to such complicated content, it is difficult to try to explain it easily without understanding myself ^^; As I wrote in [Recommendations for studying machine learning in Problem 70](http://qiita.com/segavvy/items/0e91fe02088b875a386a#Recommendations for studying machine learning), please read the wonderful teaching materials and articles in the world first. Please take advantage of it.

That's all for the 74th knock. If you have any mistakes, I would appreciate it if you could point them out.

The execution result includes a part of the data distributed in Corpus data used for 100 knocks. I will. *