Task

Compared to a few years ago, I have the impression that the number of services that operate applications in container execution environments such as GKE and ECS has increased considerably. When operating a container, the issue is how to handle logs in tracking application events. The traditional method of periodically rotating old logs and transferring them to another storage is not optimal for cloud-native architectures. As a methodology for application development, Twelve Factor App provides guidelines for treating logs as event streams, but in recent Web applications, the idea of microservices that link systems loosely is the mainstream. It's becoming. The application log is formatted for each service and then delivered to the log collection service. Requirements such as real-time analysis, notification of anomalous data, and data visualization are required as needed. Here, we will introduce an example of a best-practice architecture based on the knowledge of designing container operations and log infrastructure in production environments over the past few years at Metaps.

Traditional design

When I first joined Metaps in 2015, Rails was often used as the development language, and Unicorn was used for the application server, and MySQL and Redis were used for the database. The application log is collected by Fluentd, and if there is abnormal data, it is notified to Slack and all logs are stored in S3. I think that each service of the Metaps Group generally adopted such a mechanism.

ログ基盤-Page-2 (1).png

As time went by, the idea of microservices became common as a software development technique. The development environment has moved from Vagrant to Docker, and cloud services for container execution environments such as GKE, ECS, and EKS have been created. In the Rails area, the development style of Webpack (er) + API mode was born, and the front end and back end have become a cool way to separate services. The problem here is application log management. In the case of a monolithic application configuration, it is sufficient to send related logs in units of user requests, but in microservices, multiple services communicate asynchronously, so just looking at the S3 log will track the time series data for each user. It will be difficult.

ログ基盤-Page-3 (2).png

Therefore, we decided to redesign the infrastructure configuration to suit microservices.

Designed for microservices

Metaps decided to introduce it to a new service as soon as possible after the release of Amazon ECS. The knowledge about ECS is published as ECS operation know-how, so please refer to it if you are interested. Also, regarding log delivery design, we previously published How to handle logs in ECS, but in the early stage of operation, we used CloudWatch Logs, and in the latter half of the article, we will use Fluentd and Elasticsearch for a more advanced log platform I'm migrating. Even now, the basic concept of log delivery has not changed, and the basic configuration is that each microservice that makes up the application delivers data to the back-end log service.

ログ基盤-Page-1.png

When building a general Web application, the logs generated from the system can be classified into the following types.

Log type	Explanation	Typical log contents
Application log	Access Logs Application frameworks and logs intentionally embedded by developers	Logger log(DEBUG, WARN, ERROR, etc.)
System log	Logs of clusters and infrastructure that underlies clusters	Amazon ECS event log, Amazon RDS error log, Amazon VPC flow log
Audit log	A trail of who did what and what they did to the system. Generally required to be retained for 1 to 3 years	Access log, auditd, AWS Config, AWS CloudTrail

Since we will focus on container operation this time, we will introduce the log delivery pattern related to application logs [^ log_type].

Log delivery using a log driver

Docker has a function called Log Driver, which provides a mechanism to deliver standard output / standard error output from the container to local storage, Amazon CloudWatch Logs, and Fluentd.

Try to output the log with the docker command

First, create a program that outputs a character string.

`hello.rb`


puts 'hello'

Then get the Ruby image and execute the command. Since docker has an option to specify the log delivery destination called --log-driver [^ log_driver], specify json-file here.

% docker run -v ${PWD}:/app --log-driver=json-file -w /app ruby:alpine ruby hello.rb
hello

I was able to confirm that the command was executed and hello was also output to the console. If --log-driver is not specified, json-file will be treated as the default format.

Then check the log file. When you run docker ps -a, the container you just ran remains, so let's use docker inspect to check the log output destination.

% docker ps -a
CONTAINER ID   IMAGE                    COMMAND                  CREATED              STATUS                           PORTS     NAMES
01b164f54437   ruby:alpine              "ruby hello.rb"          About a minute ago   Exited (0) About a minute ago              sweet_babbage

% docker inspect 01b164f54437 | grep LogPath
        "LogPath": "/var/lib/docker/containers/01b164f54437e80b8e82767775e697876ea1fbe6b5d95e118defea1b8c6bf4a7/01b164f54437e80b8e82767775e697876ea1fbe6b5d95e118defea1b8c6bf4a7-json.log",

The path described in LogPath is the output destination of the log. In Linux environment, you can open it with cat, but in Mac (Docker for Mac) environment, you will get an error such as "No such file or directory". This is because Docker for Mac is running on a VM (Hyperkit) and you need to connect to the container as a privileged user to connect to the VM.

% docker run -it --rm --privileged --pid=host justincormack/nsenter1
/ # cat /var/lib/docker/containers/01b164f54437e80b8e82767775e697876ea1fbe6b5d95e118defea1b8c6bf4a7/01b164f54437e80b8e82
767775e697876ea1fbe6b5d95e118defea1b8c6bf4a7-json.log
{"log":"hello\n","stream":"stdout","time":"2020-12-25T00:27:10.82632Z"}

I was able to confirm that the log was output when I cated the path written in LogPath.

I introduced how to specify the log driver via the docker command here, but if you specify the driver from docker-compose, you need to specify the logging parameter in docker-compose.yml.

`docker-compose.yml example`


services:
  web:
    logging:
      driver: json-file

GKE

In GKE, Fluentd, which operates as a Daemon Set, collects the logs output by Pod as standard and delivers them to Stackdriver Logging.

Amazon ECS

By specifying logConfiguration [^ log_configuration] in the task definition file, the operation is equivalent to --log-driver and --log-opt.

Log collector

The following tools are available as a service to collect logs from each container [^ log_collector].

Tool name	Feature	GitHub Watch	GitHub Star	concern
Fluentd	It has a lot of usage records and is being actively developed. A wide range of plug-ins and a wide range of log output destination storage	348	9.8k	It is necessary to take measures against loss due to log buffer overflow etc.
Fluent Bit	Written in C language and operates at high speed	90	2.4k	Fewer plugins
Logstash	DevelopedbyElastic Documentsareinplace	831	11.9k	WritteninJRuby(Highmemoryusage)
Beats	Specializingindatacollection(Simplefunction) Written in Golang and runs fast	635	9.5k	Fewer plugins There is little support for the output destination
Flume NG	Redundancy is taken into consideration and strong for large-scale log collection	221	2k	Fewer plugins Complex settings

In addition, recently, services such as Datadog Logs and Stackdriver Logging are also provided as SaaS, and in most cases, Fluentd or Fluentd + Datadog Logs is built as a base configuration in Metaps.

Storage selection

Next, consider where to store the logs. It depends on the cloud platform that executes the container, but I think that S3 is often selected for Amazon and Google BigQuery is selected for GCP. Here, I have summarized the selection criteria and considerations for each storage that I actually used.

storage	Selection criteria	Data visualization	Log monitoring(Anomaly detection)	Consideration
Amazon S3	Low data storage fee You can save even more by moving old logs to Amazon S3 Glacier	Cooperation with BI tools	☓	Data analysis requires schema design using AWS Glue SQL-like search is possible by using Amazon Athena for data analysis(The price depends on the amount of scanned data. In order to keep costs down, it is necessary to create a mechanism such as data partitioning design and table update.)
Amazon Redshift	Supports standard queries Hourly billing	Cooperation with BI tools	☓	High price
Amazon Elasticsearch Service	*Intuitive search and data visualization using Kibana	Kibana	Alerting	AWS offers a fully managed Elasticsearch Service In order to increase availability, it is necessary to separate the master node and the data node, which increases the operating cost.(It can be operated without a master node, but it lacks stability.)
Amazon CloudWatch Logs	Low data storage fee If you are using AWS, it has a high affinity with other services.(Easy streaming of logs to AWS Lambda and Amazon Elasticsearch Service)	CloudWatch Logs Insights	CloudWatch Alerm	There is a problem with the console
Datadog Logs	High affinity with other Datadog products Code management of log monitoring rules with Terraform	Datadog Logs	Log monitor	Billing system based on log delivery volume and storage period(Operation cost is high depending on the usage system)
Google BigQuery	High data load performance Supports standard queries *Query billing	Cooperation with BI tools	☓	Data visualization requires integration with BI tools such as Google Data Studio

Amazon Elasticsearch Service (Kibana) and Datadog Logs have outstanding performance over other services in terms of log searchability and visualization, but on the other hand, maintenance costs tend to be high. Since Fluentd can define multiple data output destinations, it is also possible to build a two-system system that stores logs in Elasticsearch only for a short period of time for visualization purposes and stores persistent logs separately in persistent storage (Amazon S3 etc.) It's one of the hands.

Price comparison

Amazon Elasticsearch Service and Datadog Logs are often selected as log storage, so I compared their usage fees [^ pricing].

Conditions
Generate 3 million records per day
Record size per record is 2KB * 2KB * 30,000,000records * 30days = 180GB/month
Data storage period is one month
Service comparison
- Amazon Elasticssearch Service
Conditions
Instance type: m5.large.elasticsearch
Tokyo Region: \ $ 0.183/hour
Data node: 2 (Multi-AZ)
Dedicated master node: None
Price * $0.183/hour * 730hours * 2nodes = $267.18/month
- Datadog
Price
Ingest (billing by data size)
Data storage or scanning Around 1GB \ $ 0.10
Store 120GB per month (excluding scans) * 180GB ☓ $0.10 = $18/month
Retain or Rehydrate (billing by data retention period)
For 30-day storage
Monthly contract: $ 3.75/100per million * 3,000,000records/day * 30days / 1,000,000records * $3.75 = $337.5/month
Annual contract: $ 2.50/100per million * 3,000,000records/day * 30days / 1,000,000records * $2.50 = $250/month

For services that generate 90 million records per month, Datadog Log (Ingest) was advantageous in terms of price, but Elasticsearch may be more advantageous depending on the log retention period and the number of scans, so which one is generally better I can't say it's good. It is necessary to consider which service should be selected based on the characteristics and requirements of the service.

Log monitoring

I personally like Elasticsearch in terms of data visualization, but Elasticsearch doesn't have the log monitor function (a mechanism that uses search queries to notify alerts that match specific conditions) as Datadog Logs calls it. To be exact, it is included in Elasticsearch provided by Elastic, but it is not provided in Amazon Elasticsearch Service. Before I knew it, the AWS version also had an Alerting function. Apparently, with the release of Open Distro for Elasticsearch [^ distro_es], it seems to be supported from version 6.5. When I tried using the alert function, I was able to add a monitor very easily. The procedure is described below.

Open Kibana and open Alerting from the left sidebar
First, register the alert notification destination, so open the Destinations tab and press Add destination.
- Destination

Name: Name of the notification destination
Type: The type of notification destination. Specified from Amazon SNS, Amazon Chime, Slack, Custom webhook
- Settings
After adding the notification destination settings according to the Type, press Create

Open the Monitors tab and press Create monitor
- Configure monitor

Monitor name: Monitor rule name
- Define monitor
Method of definition: How to specify the conditional expression. Initially select Define using visual graph, which allows you to specify a query from the GUI
Index: Specify the index name to be monitored. You can also specify a pattern like access-log-*
Time field: Specify the date field on the index
Create a monitor for: Create a query that is an alert condition
The following specification means the number of requests per minute when / users/B-hU3nDsTPx3YTEPS_ylMQ / was accessed.
Press Create

Create trigger opens. Here, set the conditions for notifying alerts based on the monitoring conditions set on the monitor.
- Define trigger

Trigger name: Trigger name
Trigger condition: Set the threshold of the condition set on the monitor. The following settings mean "when a request occurs more than once"
- Configure actions
Press Add action to set the action when the threshold is exceeded
Action name: Action name
Destination: Specify the notification destination of the alert registered earlier
Message subject: Alert name
Press Create

This completes the settings. It was confirmed that such a notification flies when a request that exceeds the threshold is generated. Screen Shot 2020-12-12 at 10.43.32.png Open Distro for Elasticsearch publishes Alerting API, so it seems easy to manage monitor rules with code.

[^ log_type]: Exceptions when the application causes an error are excluded from the log. The de facto exception is to manage it with an error tracking tool such as Sentry or Rollbar.

[DOCKER] Best practices for log infrastructure design in container operations