I tried summarizing sentences with summpy

I tried summpy, a text summarization tool published by Recruit Technologies.

summpy https://github.com/recruit-tech/summpy

The environment is Ubuntu 16.4. Requires Python 2.7 to work. Since it is not included by default, we will prepare an environment of 2.7 with anaconda.

$ conda create -n 2.7 python=2.7 anaconda

Check if it was installed properly

$ source activate 2.7
(2.7)$
(2.7)$ conda info -e
# conda environments:
#
base                     /home/croso/anaconda3
2.7                   *  /home/croso/anaconda3/envs/2.7
3.5                      /home/croso/anaconda3/envs/3.5
3.6                      /home/croso/anaconda3/envs/3.6

Install Mecab-python because MeCab or janome is required for morphological analysis

(2.7)$ pip install mecab-python

Then install summpy with pip

(2.7)$ pip install summpy

Create a sample script


# -*- coding: utf-8 -*-
from summpy.lexrank import summarize

text=u'''
The unemployment rate (seasonally adjusted) for September announced by the Ministry of Internal Affairs and Communications was 2.4%, down 0.2 points from the previous month.

According to a Reuters survey, 2.3% was expected.

The unemployment rate has been below 2.5% since January 2018.

The Ministry of Internal Affairs and Communications summed up, "Although the unemployment rate has risen, the level has remained at the lowest level in about 26 years, and the employment situation is steadily improving," said an executive.

The number of employees (seasonally adjusted) was 67.3 million, a decrease of 50,000 from the previous month.

The number of unemployed (same as above) was 1.67 million, an increase of 130,000 from the previous month.

The increase in the number of unemployed people is the first in six months.

Looking at the breakdown, the number of involuntary turnovers was the same as the previous month, but the number of voluntary turnovers (self-convenience) increased by 10,000, and the number of new job seekers increased by 90,000. "The number of people who want to work anew is increasing," he said.

According to the original figures, the number of employees increased by 530,000 from the same month of the previous year to 67.68 million.

It has increased for 81 consecutive months, the highest ever since 1953, which is comparable.

The employment rate for 15-64 years old is 77.9%, the highest ever in Thailand.

The active job openings-to-applicants ratio (seasonally adjusted) for September announced by the Ministry of Health, Labor and Welfare was 1.57 times, down from the previous month.

According to a Reuters survey, it was expected to be 1.59 times.
'''

sentences, debug_info = summarize(
    text, sent_limit=2
)

for sent in sentences:
    print sent.strip().encode('utf-8')

Let's analyze the article extracted from the news site. sent_limit How many lines do you put the results together? It looks like . Up to this point, the README.md of summpy is traced as it is.

An error occurred when it was operated.

"error": "add_edge() takes exactly 3 arguments (4 given)"

When I looked it up, it was a version mismatch with networkx. https://teratail.com/questions/114565 I will match the version.

(2.7)$ pip install multiqc==1.2
(2.7)$ pip install networkx==1.11

Install multiqc first. When you install multiqc, networkx also installs 2.2 automatically, so you can not reproduce the environment well unless you overwrite 1.11 of networkx on it.

The completed sentence is below

The number of employees (seasonally adjusted) was 67.3 million, a decrease of 50,000 from the previous month.
The number of unemployed (same as above) was 1.67 million, an increase of 130,000 from the previous month.

What is it like? It can be read from the summary that the Japanese economy has cooled because the number of employees has decreased and the number of unemployed has increased. However, I feel that I have overlooked the sentence that can be said to be the subject of "0.2 points worse than the previous month."

I also misunderstood, but it seems that he does not "summary the text". It seems correct to say a tool that extracts only important lines from a sentence.

By the way, I think that few people install mecab and use it as it is. If you do not add a dictionary, it will be useless. So, install the following.

https://github.com/neologd/mecab-ipadic-neologd

It is a mecab dictionary that supports the latest words. After installation, the dictionary information should be included below.

(2.7)$ ls /usr/local/lib/mecab/dic/mecab-ipadic-neologd
char.bin  dicrc  left-id.def  matrix.bin  pos-id.def  rewrite.def  right-id.def  sys.dic  unk.dic

I'd like the dictionary to be read automatically, but summpy didn't have such a function, so I rewrote a part of the source to handle it. (It may be more correct to modify mecab-python ...)

(2.7)$ vi ~/anaconda3/envs/2.7/lib/python2.7/site-packages/summpy/misc/mecab_segmenter.py

8th line

_mecab = MeCab.Tagger()

_mecab = MeCab.Tagger('-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd')

Replaced with. This should be a little more accurate ...

It's irrelevant, but machine learning requires a huge amount of learning data. Among them, I thought that sentence summarization was a relatively easy task to collect learning data. If you use the article title for the correct answer data and the article body for the learning data from many news sites, you can find as many samples as you want on the net, so I thought it might be a good subject for studying.