This is the article on the 15th day of Nextremer Advent Calender 2016.
2016 is almost over, but there was a lot of news this year as well. Isn't Donald Trump's victory in the presidential election one of the most impactful events?
I often see it on TV, but I haven't heard the speech properly, so I'll read the speech. However, I am not good at English, so I use the summary algorithm to reduce it to about 3 lines.
There are two main ways to summarize.
The generative summaries are very difficult, and most of the main researches currently being conducted are extractive reservations. LexRank implemented this time also belongs to the extractive summary and extracts important sentences in the sentence.
LexRank is a summarization algorithm inspired by PageRank. PageRank is an algorithm devised by Google founders L.Page and S.Brin to determine the importance of web pages. The importance of the page was determined by grasping the link structure on the Internet as the following graph.
Node: Web page
Edge: Link
Node importance:
1.Pages linked from many pages are important pages
2.Pages linked from important pages are important pages
Like PageRank, LexRank also captures sentences as a graph.
Node: statement
Edge: Similarity between two sentences(1 | 0)
Node importance:
1.Sentences that are similar to many sentences are important sentences
2.Sentences similar to important sentences are important sentences
The degree of similarity here is simply how much the two sentences have a common word.
LexRank evaluates the importance only by the graph structure, but when it was invented in 2004, it boasted the best accuracy in DUC2004. It is very interesting to be able to summarize with high accuracy without semantic analysis.
The formula for LexRank is as follows.
Input:Array S consisting of n statements,Cosine similarity threshold t
Output:Array L containing LexRank scores for each sentence
Array CosineMatrix[n][n];
Array Degree[n];
Array L[n];
for i <- 1 to n do
for j <- 1 to n do
CosineMatrix[i][j] = idf-modified-cosine(S[i], S[j]);
if CosineMatrix[i][j] > t then
CosineMatrix[i][j] = 1;
Degree[i]++;
end
else
CosineMatrix[i][j] = 0;
end
end
end
for i <- 1 to n do
for j<- 1 to n do
CosineMatrix[i][j] = CosineMatrix[i][j] / Degree[i]
end
end
L = PowerMethod(CosineMatrix, n, ε);
return L
(Quote: "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization" Alogorithm 3)
ʻIdf-modified-cosineis a similarity calculation between two sentences.
PowerMethod` is a PageRank calculation.
is.
Now, let's actually implement this algorithm in Python.
import math
import numpy
def lex_rank(sentences, n, t):
"""
Summarize the text with LexRank.
@param sentences: list
Sentence([[w1,w2,w3],[w1,w3,w4,w5],..]Sentence list like)
@param n: int
Number of sentences contained in a sentence
@param t: float
Cosine similarity threshold(default 0.1)
@return : list
LexRank
"""
cosine_matrix = numpy.zeros((n, n))
degrees = numpy.zeros((n,))
l = []
# 1.Creating an adjacency matrix
for i in range(n):
for j in range(n):
cosine_matrix[i][j] = idf_modified_cosine(sentences, sentences[i], sentences[j])
if cosine_matrix[i][j] > t:
cosine_matrix[i][j] = 1
degrees[i] += 1
else:
cosine_matrix[i][j] = 0
# 2.LexRank calculation
for i in range(n):
for j in range(n):
cosine_matrix[i][j] = cosine_matrix[i][j] / degrees[i]
ratings = power_method(cosine_matrix, n)
return zip(sentences, ratings)
This program is by function
Here, we will explain each block.
The adjacency matrix is a matrix for graph representation. The graph is expressed by substituting 1 if there is an edge between the nodes and 0 if it does not exist. In LexRank, if the similarity between two sentences is equal to or more than the threshold value t, they are connected by an edge.
The implementation of similarity is as follows.
def idf_modified_cosine(sentences, sentence1, sentence2):
"""
Calculate cosine similarity between two sentences
@param sentence1: list
Sentence 1([w1,w2,w3]Word list like)
@param sentence2: list
Sentence 2([w1,w2,w3]Word list like)
@param sentences: list
Sentence([[w1,w2,w3],[w1,w3,w4,w5],..]Word list like)
@return : float
Cosine similarity
"""
tf1 = compute_tf(sentence1)
tf2 = compute_tf(sentence2)
idf_metrics = compute_idf(sentences)
return cosine_similarity(sentence1, sentence2, tf1, tf2, idf_metrics)
TF is the frequency of occurrence of words in a sentence. (The more important it is) IDF is the frequency of occurrence of words throughout a document. (The less it is, the more important it is) Cosine similarity is the cosine of the similarity between vectors.
I think that many people know about these, so I will briefly post only the implementation.
from collection import Counter
def compute_tf(sentence):
"""
Calculate TF
@param sentence: list
Sentence([w1,w2,w3]Word list like)
@return : list
TF list
"""
tf_values = Counter(sentence)
tf_metrics = {}
max_tf = find_tf_max(tf_values)
for term, tf in tf_values.items():
tf_metrics[term] = tf / max_tf
return tf_metrics
def find_tf_max(terms):
"""
Search for the maximum frequency of occurrence of words
@param terms: list
Frequency of word occurrence
@return : int
Maximum frequency of occurrence of words
"""
return max(terms.values()) if terms else 1
def compute_idf(sentences):
"""
Calculate the IDF value of a word in a sentence
@param sentences: list
Sentence([[w1,w2,w3],[w1,w3,w4,w5],..]Word list like)
@return: list
IDF list
"""
idf_metrics = {}
sentences_count = len(sentences)
for sentence in sentences:
for term in sentence:
if term not in idf_metrics:
n_j = sum(1 for s in sentences if term in s)
idf_metrics[term] = math.log(sentences_count / (1 + n_j))
return idf_metrics
def cosine_similarity(sentence1, sentence2, tf1, tf2, idf_metrics):
"""
Calculate cosine similarity
@param sentence1: list
Sentence 1([w1,w2,w3]Word list like)
@param sentence2: list
Sentence 2([w1,w2,w3]Word list like)
@param tf1: list
TF list of sentence 1
@param tf2: list
TF list of sentence 2
@param idf_metrics: list
IDF list of sentences
@return : float
Cosine similarity
"""
unique_words1 = set(sentence1)
unique_words2 = set(sentence2)
common_words = unique_words1 & unique_words2
numerator = sum((tf1[t] * tf2[t] * idf_metrics[t] ** 2) for t in common_words)
denominator1 = sum((tf1[t] * idf_metrics[t]) ** 2 for t in unique_words1)
denominator2 = sum((tf2[t] * idf_metrics[t]) ** 2 for t in unique_words2)
if denominator1 > 0 and denominator2 > 0:
return numerator / (math.sqrt(denominator1) * math.sqrt(denominator2))
else:
return 0.0
Finally, if the cosine similarity is t or greater, substitute 1 to complete the adjacency matrix.
Next, we will find the LexRank based on the matrix created here.
From here, LexRank is calculated based on the adjacency matrix obtained in 1. First, divide each element of the adjacency matrix by degree (degree of node) and convert it to a stochastic matrix.
Then, based on this stochastic matrix, we will find the Markov steady distribution. Since Markov chains are aperiodic and guaranteed to converge, the transpose matrix is multiplied by the power method and iteratively repeated until it is less than or equal to ε. The eigenvector finally calculated here is the LexRank.
The implementation is as follows.
def power_method(cosine_matrix, n, e):
"""
Do the power method
@param scosine_matrix: list
Stochastic matrix
@param n: int
Number of sentences in a sentence
@param e: float
Tolerance ε
@return: list
LexRank
"""
transposed_matrix = cosine_matrix.T
sentences_count = n
p_vector = numpy.array([1.0 / sentences_count] * sentences_count)
lambda_val = 1.0
while lambda_val > e:
next_p = numpy.dot(transposed_matrix, p_vector)
lambda_val = numpy.linalg.norm(numpy.subtract(next_p, p_vector))
p_vector = next_p
return p_vector
This completes the importance calculation by LexRank. The larger the corresponding ratings value, the more important the sentence.
Now, let's actually summarize. To summarize [Mr. Trump's speech at the time of the presidential election victory](http://www.vox.com/policy-and-politics/2016/11/9/13569124/donald-trump-wins-2016-presidential- election-victory-speech-transcript).
# lex_rank(document)
Great people.
Great brothers, sisters, great, unbelievable parents.
We're going to get to work immediately for the American people, and we're going to be doing a job that hopefully you will be so proud of your president.
#Great people.
#Great, incredible parents, great brothers and sisters.
#We will work soon for the American people. We will work in the hope that you will be very proud of the President.
Yeah, well, I'll do my best for the president! You can feel the enthusiasm.
In this article, we implemented an automatic summarization algorithm to summarize Mr. Trump's speech. LexRank seems to be useful for text analysis because you can capture sentences with graphs. Tomorrow I will continue to write something in summary + translation.
LexRank: Graph-based Lexical Centrality as Salience in Text Summarization miso-belica/sumy Document Summarization PART-2: LEXRank (Modified Pagerank) based Document Summarization Random walk on graph and Google page rank Automatic summarization (wikipedia) Automatic text summarization method using natural language processing
Recommended Posts