Content of this article

--Created a summarization system (extraction type) --Building an operational configuration on AWS --Serverless configuration made it as cheap as possible

Not introduced in this article

--Contents of summary algorithm

At work, I was asked "Can I make a summarization system?", So I tried to build it. My place of work runs a news site, and my motivation is to "post a summary in the posted news."

Certainly, the news site of a major newspaper company also has a three-line summary, which is good content for those who do not have time to read the full article.

It's nice for editors to be able to write their own summaries, but creating summaries seems like a special skill (I heard from the editors).

I think there is a demand for automatic summarization at such times.

So what is automatic summarization?

There are two main types of automatic summarization.

The first is __sentence generation type __.

This is a type in which the algorithm interprets the meaning of the input document and creates a text summarized in __good feeling __ (so that it can show a tendency similar to training data).

Since deep learning became popular, it has finally reached a practical level.

And the other is the __extracted __type.

It is a type that selects N texts (generally one sentence is one unit) that can express the meaning of the entire document well from the input text.

There are some practical algorithms here even before deep learning [^ 1]

If you think "automatic summarization is interesting", please also see this article.

This time, we will use the 2004 algorithm called LexRank among the extraction types.

To put it plainly, it is an algorithm of the idea that "a sentence in which the same word appears frequently is surely important".

Qiita has also posted articles with interesting ideas so far.

-Summary of Aozora Bunko using sumy -Summary Donald Trump's speech in three lines with the automatic summarization algorithm LexRank

This time, the target data is news text. News texts are characterized by "the words used are as unified as possible" and "the logical structure of the sentence is clear".

For that reason, mistakes in morpheme division are less likely to occur, and LexRank's graph weighting is easy to work __. And my ghost whispered, "Well, that's okay."

That's why I skipped the rigorous investigation this time and implemented it first. [^ 2]

AWS component system configuration

I made it like this.

要約システム.png

The points are as follows.

--The news site is operated by Wordpress. ――I want to keep operating costs down as much as possible. Run the summarization in ECS Fargate (serverless). --May use an outsourcer for maintenance. Make DB a highly versatile RDS.

The division of roles is as follows.

ECS fargate --wp-import: insert from Wordpress to RDS --auto-summary: Summarized by Lexrank. The summary result is saved in RDS. --wp-export: Post a summary stored in RDS to Wordpress
RDS --Management of news articles and summaries. In this project, there was no DB that manages articles other than Wordpress, so it also serves as article data management.

The minimum size of RDS is sufficient. Communication with RDS occurs only a few times a day, and simultaneous connections do not occur. So I chose db.t3.micro.

With this configuration, the operating cost is about 2,000 yen a month. Moreover, most of it is RDS operating costs.

What is ECS Fargate?

Somehow, it came out lightly, but Fargate is a container service that uses Docker. Calculator costs are incurred only when you want to do it. Setting up a Batch job is also very easy.

Work procedure

I can't introduce the commands for various reasons, but the work images are in this order.

Implemented on the local machine.
Put the implemented code in the Docker image. Build a Docker image on your local machine.
Create an ECS cluster in the AWS console.
Create an ECR (Docker repository) on the AWS console. Push the local Docker image to ECR.
Define the task in ECS.
Set up task scheduling in ECS.

Cooperation implementation with Wordpress

I have a Wordpress API package for Python, so use this (https://python-wordpress-xmlrpc.readthedocs.io/en/latest/).

There are other this kind of thing, so this one might have been fine.

Lexrank implementation

The implementation of Japanese Lexrank is introduced in Recruit Technologies' blog.

Since this is the Python2 version, I used the implementation of Python3 version.

There is a Lexrank package, so this may be fine. However, since this is a premise that the morpheme is divided, it is divided into morphemes in advance with Mecab etc. and separated by spaces.

So, I introduced an extraction type summarization system using LexRank.

If the text at hand is like news text, I think it will work reasonably well.

Building a Docker + serverless configuration on AWS has also become easier, so it seems that it can also be used for building batch job systems.

[^ 1]: The question "When did you start deep learning?" Seems to come up, but when it comes to natural language processing, it's about 2015? I feel that the generative automatic summarization has been getting better since around 2017 __. It feels like the encoder-decoder using RNN has become a breakthrough. [^ 2]: For the time being, I did only a simple evaluation. ROUGE-N and BLEU are commonly used to evaluate summarization algorithms. However, it is quite difficult to get people on the business side to understand ROUGE-N and BLEU. This time I did it roughly with Accuracy. It can be expressed in Excel. ʻAccuracy = N (document with summary I judged OK) / N (number of input documents) `

Build a cheap summarization system with AWS components