Build a cheap summarization system with AWS components

Content of this article

--Created a summarization system (extraction type) --Building an operational configuration on AWS --Serverless configuration made it as cheap as possible

Not introduced in this article

--Contents of summary algorithm


At work, I was asked "Can I make a summarization system?", So I tried to build it. My place of work runs a news site, and my motivation is to "post a summary in the posted news."

Certainly, the news site of a major newspaper company also has a three-line summary, which is good content for those who do not have time to read the full article.

It's nice for editors to be able to write their own summaries, but creating summaries seems like a special skill (I heard from the editors).

I think there is a demand for automatic summarization at such times.

So what is automatic summarization?

There are two main types of automatic summarization.

The first is __sentence generation type __.

This is a type in which the algorithm interprets the meaning of the input document and creates a text summarized in __good feeling __ (so that it can show a tendency similar to training data).

Since deep learning became popular, it has finally reached a practical level.

And the other is the __extracted __type.

It is a type that selects N texts (generally one sentence is one unit) that can express the meaning of the entire document well from the input text.

There are some practical algorithms here even before deep learning [^ 1]

If you think "automatic summarization is interesting", please also see this article.

This time, we will use the 2004 algorithm called LexRank among the extraction types.

To put it plainly, it is an algorithm of the idea that "a sentence in which the same word appears frequently is surely important".

Qiita has also posted articles with interesting ideas so far.

-Summary of Aozora Bunko using sumy -Summary Donald Trump's speech in three lines with the automatic summarization algorithm LexRank

This time, the target data is news text. News texts are characterized by "the words used are as unified as possible" and "the logical structure of the sentence is clear".

For that reason, mistakes in morpheme division are less likely to occur, and LexRank's graph weighting is easy to work __. And my ghost whispered, "Well, that's okay."

That's why I skipped the rigorous investigation this time and implemented it first. [^ 2]

AWS component system configuration

I made it like this.

要約システム.png

The points are as follows.

--The news site is operated by Wordpress. ――I want to keep operating costs down as much as possible. Run the summarization in ECS Fargate (serverless). --May use an outsourcer for maintenance. Make DB a highly versatile RDS.

The division of roles is as follows.

The minimum size of RDS is sufficient. Communication with RDS occurs only a few times a day, and simultaneous connections do not occur. So I chose db.t3.micro.

With this configuration, the operating cost is about 2,000 yen a month. Moreover, most of it is RDS operating costs.

What is ECS Fargate?

Somehow, it came out lightly, but Fargate is a container service that uses Docker. Calculator costs are incurred only when you want to do it. Setting up a Batch job is also very easy.

Work procedure

I can't introduce the commands for various reasons, but the work images are in this order.

  1. Implemented on the local machine.
  2. Put the implemented code in the Docker image. Build a Docker image on your local machine.
  3. Create an ECS cluster in the AWS console.
  4. Create an ECR (Docker repository) on the AWS console. Push the local Docker image to ECR.
  5. Define the task in ECS.
  6. Set up task scheduling in ECS.

Cooperation implementation with Wordpress

I have a Wordpress API package for Python, so use this (https://python-wordpress-xmlrpc.readthedocs.io/en/latest/).

There are other this kind of thing, so this one might have been fine.

Lexrank implementation

The implementation of Japanese Lexrank is introduced in Recruit Technologies' blog.

Since this is the Python2 version, I used the implementation of Python3 version.

There is a Lexrank package, so this may be fine. However, since this is a premise that the morpheme is divided, it is divided into morphemes in advance with Mecab etc. and separated by spaces.


So, I introduced an extraction type summarization system using LexRank.

If the text at hand is like news text, I think it will work reasonably well.

Building a Docker + serverless configuration on AWS has also become easier, so it seems that it can also be used for building batch job systems.

[^ 1]: The question "When did you start deep learning?" Seems to come up, but when it comes to natural language processing, it's about 2015? I feel that the generative automatic summarization has been getting better since around 2017 __. It feels like the encoder-decoder using RNN has become a breakthrough. [^ 2]: For the time being, I did only a simple evaluation. ROUGE-N and BLEU are commonly used to evaluate summarization algorithms. However, it is quite difficult to get people on the business side to understand ROUGE-N and BLEU. This time I did it roughly with Accuracy. It can be expressed in Excel. ʻAccuracy = N (document with summary I judged OK) / N (number of input documents) `

Recommended Posts

Build a cheap summarization system with AWS components
Build a WardPress environment on AWS with pulumi
Build a deb file with Docker
[AWS] Build an ECR with AWS CDK
Make a recommender system with python
Build a web application with Django
Build a Flask / Bottle-like web application on AWS Lambda with Chalice
[AWS / Tello] Build a system to operate the drone on the cloud
Build a blockchain with Python ① Create a class
Build a Tensorflow environment with Raspberry Pi [2020]
Issue a signed URL with AWS SQS
Build a Fast API environment with docker-compose
Create a star system with Blender 2.80 script
[Linux] Build a jenkins environment with Docker
Build a python virtual environment with pyenv
Build a capture acquisition machine with Selenium
Create a private repository with AWS CodeArtifact
Build a modern Python environment with Neovim
Build static library (.a) together with waf
[Linux] Build a Docker environment with Amazon Linux 2
Build a subpixel accuracy measurement system with Jetson Nano + USB camera + OpenCV + Scikit-image
Build a local server with a single command [Mac]
Build a C language development environment with a container
Investment quest: Make a system trade with pyhton (2)
Try Tensorflow with a GPU instance on AWS
Build a python environment with ansible on centos6
Investment quest: Make a system trade with pyhton (1)
[Python] Build a Django development environment with Docker
Create a python3 build environment with Sublime Text3
Build a Django environment with Vagrant in 5 minutes
[AWS] Let's build an ECS Cluster with CDK
[Memo] Build a virtual environment with Pyenv + anaconda
Build a virtual environment with pyenv and venv
Build a Django development environment with Doker Toolbox
AWS Step Functions to learn with a sample
Build a Python environment with OSX El capitan
Quickly build a Python Django environment with IntelliJ
Build a Python machine learning environment with a container
Build a python execution environment with VS Code
[AWS] Development environment version that tried to build a Python environment with eb [Elastic Beanstalk]
[AWS] I made a reminder BOT with LINE WORKS
Easily build HPC on AWS with genuine AWS Cfn Cluster
# 2 Build a Python environment on AWS EC2 instance (ubuntu18.04)
Build a python virtual environment with virtualenv and virtualenvwrapper
Build a python environment for each directory with pyenv-virtualenv
Execute python3 system with PHP exec () on AWS EC2
Build a machine learning application development environment with Python
Build a python virtual environment with virtualenv and virtualenvwrapper
Create a Layer for AWS Lambda Python with Docker
Build AWS EC2 and RDS with Terraform Terraform 3 minutes cooking
Build a development environment with Poetry Django Docker Pycharm
Build a Django environment for Win10 (with virtual space)
Build a numerical calculation environment with pyenv and miniconda3