[DOCKER] Best practices for log infrastructure design in container operations

Task

Compared to a few years ago, I have the impression that the number of services that operate applications in container execution environments such as GKE and ECS has increased considerably. When operating a container, the issue is how to handle logs in tracking application events. The traditional method of periodically rotating old logs and transferring them to another storage is not optimal for cloud-native architectures. As a methodology for application development, Twelve Factor App provides guidelines for treating logs as event streams, but in recent Web applications, the idea of ​​microservices that link systems loosely is the mainstream. It's becoming. The application log is formatted for each service and then delivered to the log collection service. Requirements such as real-time analysis, notification of anomalous data, and data visualization are required as needed. Here, we will introduce an example of a best-practice architecture based on the knowledge of designing container operations and log infrastructure in production environments over the past few years at Metaps.

Traditional design

When I first joined Metaps in 2015, Rails was often used as the development language, and Unicorn was used for the application server, and MySQL and Redis were used for the database. The application log is collected by Fluentd, and if there is abnormal data, it is notified to Slack and all logs are stored in S3. I think that each service of the Metaps Group generally adopted such a mechanism.

ログ基盤-Page-2 (1).png

As time went by, the idea of ​​microservices became common as a software development technique. The development environment has moved from Vagrant to Docker, and cloud services for container execution environments such as GKE, ECS, and EKS have been created. In the Rails area, the development style of Webpack (er) + API mode was born, and the front end and back end have become a cool way to separate services. The problem here is application log management. In the case of a monolithic application configuration, it is sufficient to send related logs in units of user requests, but in microservices, multiple services communicate asynchronously, so just looking at the S3 log will track the time series data for each user. It will be difficult.

ログ基盤-Page-3 (2).png

Therefore, we decided to redesign the infrastructure configuration to suit microservices.

Designed for microservices

Metaps decided to introduce it to a new service as soon as possible after the release of Amazon ECS. The knowledge about ECS is published as ECS operation know-how, so please refer to it if you are interested. Also, regarding log delivery design, we previously published How to handle logs in ECS, but in the early stage of operation, we used CloudWatch Logs, and in the latter half of the article, we will use Fluentd and Elasticsearch for a more advanced log platform I'm migrating. Even now, the basic concept of log delivery has not changed, and the basic configuration is that each microservice that makes up the application delivers data to the back-end log service.

ログ基盤-Page-1.png

When building a general Web application, the logs generated from the system can be classified into the following types.

Log type Explanation Typical log contents
Application log Access Logs Application frameworks and logs intentionally embedded by developers Logger log(DEBUG, WARN, ERROR, etc.)
System log Logs of clusters and infrastructure that underlies clusters Amazon ECS event log, Amazon RDS error log, Amazon VPC flow log
Audit log A trail of who did what and what they did to the system. Generally required to be retained for 1 to 3 years Access log, auditd, AWS Config, AWS CloudTrail

Since we will focus on container operation this time, we will introduce the log delivery pattern related to application logs [^ log_type].

Log delivery using a log driver

Docker has a function called Log Driver, which provides a mechanism to deliver standard output / standard error output from the container to local storage, Amazon CloudWatch Logs, and Fluentd.

Try to output the log with the docker command

First, create a program that outputs a character string.

hello.rb


puts 'hello'

Then get the Ruby image and execute the command. Since docker has an option to specify the log delivery destination called --log-driver [^ log_driver], specify json-file here.

% docker run -v ${PWD}:/app --log-driver=json-file -w /app ruby:alpine ruby hello.rb
hello

I was able to confirm that the command was executed and hello was also output to the console. If --log-driver is not specified, json-file will be treated as the default format.

Then check the log file. When you run docker ps -a, the container you just ran remains, so let's use docker inspect to check the log output destination.

% docker ps -a
CONTAINER ID   IMAGE                    COMMAND                  CREATED              STATUS                           PORTS     NAMES
01b164f54437   ruby:alpine              "ruby hello.rb"          About a minute ago   Exited (0) About a minute ago              sweet_babbage

% docker inspect 01b164f54437 | grep LogPath
        "LogPath": "/var/lib/docker/containers/01b164f54437e80b8e82767775e697876ea1fbe6b5d95e118defea1b8c6bf4a7/01b164f54437e80b8e82767775e697876ea1fbe6b5d95e118defea1b8c6bf4a7-json.log",

The path described in LogPath is the output destination of the log. In Linux environment, you can open it with cat, but in Mac (Docker for Mac) environment, you will get an error such as "No such file or directory". This is because Docker for Mac is running on a VM (Hyperkit) and you need to connect to the container as a privileged user to connect to the VM.

% docker run -it --rm --privileged --pid=host justincormack/nsenter1
/ # cat /var/lib/docker/containers/01b164f54437e80b8e82767775e697876ea1fbe6b5d95e118defea1b8c6bf4a7/01b164f54437e80b8e82
767775e697876ea1fbe6b5d95e118defea1b8c6bf4a7-json.log
{"log":"hello\n","stream":"stdout","time":"2020-12-25T00:27:10.82632Z"}

I was able to confirm that the log was output when I cated the path written in LogPath.

I introduced how to specify the log driver via the docker command here, but if you specify the driver from docker-compose, you need to specify the logging parameter in docker-compose.yml.

docker-compose.yml example


services:
  web:
    logging:
      driver: json-file

GKE

In GKE, Fluentd, which operates as a Daemon Set, collects the logs output by Pod as standard and delivers them to Stackdriver Logging.

Amazon ECS

By specifying logConfiguration [^ log_configuration] in the task definition file, the operation is equivalent to --log-driver and --log-opt.

Log collector

The following tools are available as a service to collect logs from each container [^ log_collector].

Tool name Feature GitHub Watch GitHub Star concern
Fluentd *It has a lot of usage records and is being actively developed.
*A wide range of plug-ins and a wide range of log output destination storage
348 9.8k It is necessary to take measures against loss due to log buffer overflow etc.
Fluent Bit Written in C language and operates at high speed 90 2.4k Fewer plugins
Logstash *DevelopedbyElastic
*Documentsareinplace
831 11.9k WritteninJRuby(Highmemoryusage)
Beats *Specializingindatacollection(Simplefunction)
*Written in Golang and runs fast
635 9.5k *Fewer plugins
*There is little support for the output destination
Flume NG Redundancy is taken into consideration and strong for large-scale log collection 221 2k *Fewer plugins
*Complex settings

In addition, recently, services such as Datadog Logs and Stackdriver Logging are also provided as SaaS, and in most cases, Fluentd or Fluentd + Datadog Logs is built as a base configuration in Metaps.

Storage selection

Next, consider where to store the logs. It depends on the cloud platform that executes the container, but I think that S3 is often selected for Amazon and Google BigQuery is selected for GCP. Here, I have summarized the selection criteria and considerations for each storage that I actually used.

storage Selection criteria Data visualization Log monitoring(Anomaly detection) Consideration
Amazon S3 *Low data storage fee
*You can save even more by moving old logs to Amazon S3 Glacier
Cooperation with BI tools *Data analysis requires schema design using AWS Glue
*SQL-like search is possible by using Amazon Athena for data analysis(The price depends on the amount of scanned data. In order to keep costs down, it is necessary to create a mechanism such as data partitioning design and table update.)
Amazon Redshift *Supports standard queries
*Hourly billing
Cooperation with BI tools High price
Amazon Elasticsearch Service *Intuitive search and data visualization using Kibana Kibana Alerting *AWS offers a fully managed Elasticsearch Service
*In order to increase availability, it is necessary to separate the master node and the data node, which increases the operating cost.(It can be operated without a master node, but it lacks stability.)
Amazon CloudWatch Logs *Low data storage fee
*If you are using AWS, it has a high affinity with other services.(Easy streaming of logs to AWS Lambda and Amazon Elasticsearch Service)
CloudWatch Logs Insights CloudWatch Alerm There is a problem with the console
Datadog Logs *High affinity with other Datadog products
*Code management of log monitoring rules with Terraform
Datadog Logs Log monitor Billing system based on log delivery volume and storage period(Operation cost is high depending on the usage system)
Google BigQuery *High data load performance
*Supports standard queries
*Query billing
Cooperation with BI tools Data visualization requires integration with BI tools such as Google Data Studio

Amazon Elasticsearch Service (Kibana) and Datadog Logs have outstanding performance over other services in terms of log searchability and visualization, but on the other hand, maintenance costs tend to be high. Screen_Shot_2020-12-23_at_22_49_17.png Screen_Shot_2020-12-23_at_22_56_28.png Since Fluentd can define multiple data output destinations, it is also possible to build a two-system system that stores logs in Elasticsearch only for a short period of time for visualization purposes and stores persistent logs separately in persistent storage (Amazon S3 etc.) It's one of the hands.

Price comparison

Amazon Elasticsearch Service and Datadog Logs are often selected as log storage, so I compared their usage fees [^ pricing].

For services that generate 90 million records per month, Datadog Log (Ingest) was advantageous in terms of price, but Elasticsearch may be more advantageous depending on the log retention period and the number of scans, so which one is generally better I can't say it's good. It is necessary to consider which service should be selected based on the characteristics and requirements of the service.

Log monitoring

I personally like Elasticsearch in terms of data visualization, but Elasticsearch doesn't have the log monitor function (a mechanism that uses search queries to notify alerts that match specific conditions) as Datadog Logs calls it. To be exact, it is included in Elasticsearch provided by Elastic, but it is not provided in Amazon Elasticsearch Service. Before I knew it, the AWS version also had an Alerting function. Apparently, with the release of Open Distro for Elasticsearch [^ distro_es], it seems to be supported from version 6.5. When I tried using the alert function, I was able to add a monitor very easily. The procedure is described below.

  1. Open Kibana and open Alerting from the left sidebar

  2. First, register the alert notification destination, so open the Destinations tab and press Add destination.

    • Destination
  1. Open the Monitors tab and press Create monitor
    • Configure monitor
  1. Create trigger opens. Here, set the conditions for notifying alerts based on the monitoring conditions set on the monitor.
    • Define trigger

This completes the settings. It was confirmed that such a notification flies when a request that exceeds the threshold is generated. Screen Shot 2020-12-12 at 10.43.32.png Open Distro for Elasticsearch publishes Alerting API, so it seems easy to manage monitor rules with code.

[^ log_type]: Exceptions when the application causes an error are excluded from the log. The de facto exception is to manage it with an error tracking tool such as Sentry or Rollbar.

Recommended Posts

Best practices for log infrastructure design in container operations
Modern best practices for Java testing
Personal Realm Class Design Best Practices (Android)
Prepare for log output using log4j in Eclipse.
Docker Container Operations with Docker-Client API for Java