--Created a summarization system (extraction type) --Building an operational configuration on AWS --Serverless configuration made it as cheap as possible
--Contents of summary algorithm
At work, I was asked "Can I make a summarization system?", So I tried to build it. My place of work runs a news site, and my motivation is to "post a summary in the posted news."
Certainly, the news site of a major newspaper company also has a three-line summary, which is good content for those who do not have time to read the full article.
It's nice for editors to be able to write their own summaries, but creating summaries seems like a special skill (I heard from the editors).
I think there is a demand for automatic summarization at such times.
There are two main types of automatic summarization.
The first is __sentence generation type __.
This is a type in which the algorithm interprets the meaning of the input document and creates a text summarized in __good feeling __ (so that it can show a tendency similar to training data).
Since deep learning became popular, it has finally reached a practical level.
And the other is the __extracted __type.
It is a type that selects N texts (generally one sentence is one unit) that can express the meaning of the entire document well from the input text.
There are some practical algorithms here even before deep learning [^ 1]
If you think "automatic summarization is interesting", please also see this article.
This time, we will use the 2004 algorithm called LexRank among the extraction types.
To put it plainly, it is an algorithm of the idea that "a sentence in which the same word appears frequently is surely important".
Qiita has also posted articles with interesting ideas so far.
-Summary of Aozora Bunko using sumy -Summary Donald Trump's speech in three lines with the automatic summarization algorithm LexRank
This time, the target data is news text. News texts are characterized by "the words used are as unified as possible" and "the logical structure of the sentence is clear".
For that reason, mistakes in morpheme division are less likely to occur, and LexRank's graph weighting is easy to work __. And my ghost whispered, "Well, that's okay."
That's why I skipped the rigorous investigation this time and implemented it first. [^ 2]
I made it like this.
The points are as follows.
--The news site is operated by Wordpress. ――I want to keep operating costs down as much as possible. Run the summarization in ECS Fargate (serverless). --May use an outsourcer for maintenance. Make DB a highly versatile RDS.
The division of roles is as follows.
The minimum size of RDS is sufficient.
Communication with RDS occurs only a few times a day, and simultaneous connections do not occur.
So I chose db.t3.micro
.
With this configuration, the operating cost is about 2,000 yen a month. Moreover, most of it is RDS operating costs.
Somehow, it came out lightly, but Fargate is a container service that uses Docker. Calculator costs are incurred only when you want to do it. Setting up a Batch job is also very easy.
I can't introduce the commands for various reasons, but the work images are in this order.
I have a Wordpress API package for Python, so use this (https://python-wordpress-xmlrpc.readthedocs.io/en/latest/).
There are other this kind of thing, so this one might have been fine.
The implementation of Japanese Lexrank is introduced in Recruit Technologies' blog.
Since this is the Python2 version, I used the implementation of Python3 version.
There is a Lexrank package, so this may be fine. However, since this is a premise that the morpheme is divided, it is divided into morphemes in advance with Mecab etc. and separated by spaces.
So, I introduced an extraction type summarization system using LexRank.
If the text at hand is like news text, I think it will work reasonably well.
Building a Docker + serverless configuration on AWS has also become easier, so it seems that it can also be used for building batch job systems.
[^ 1]: The question "When did you start deep learning?" Seems to come up, but when it comes to natural language processing, it's about 2015? I feel that the generative automatic summarization has been getting better since around 2017 __. It feels like the encoder-decoder using RNN has become a breakthrough. [^ 2]: For the time being, I did only a simple evaluation. ROUGE-N and BLEU are commonly used to evaluate summarization algorithms. However, it is quite difficult to get people on the business side to understand ROUGE-N and BLEU. This time I did it roughly with Accuracy. It can be expressed in Excel. ʻAccuracy = N (document with summary I judged OK) / N (number of input documents) `
Recommended Posts