Compared to a few years ago, I have the impression that the number of services that operate applications in container execution environments such as GKE and ECS has increased considerably. When operating a container, the issue is how to handle logs in tracking application events. The traditional method of periodically rotating old logs and transferring them to another storage is not optimal for cloud-native architectures. As a methodology for application development, Twelve Factor App provides guidelines for treating logs as event streams, but in recent Web applications, the idea of microservices that link systems loosely is the mainstream. It's becoming. The application log is formatted for each service and then delivered to the log collection service. Requirements such as real-time analysis, notification of anomalous data, and data visualization are required as needed. Here, we will introduce an example of a best-practice architecture based on the knowledge of designing container operations and log infrastructure in production environments over the past few years at Metaps.
When I first joined Metaps in 2015, Rails was often used as the development language, and Unicorn was used for the application server, and MySQL and Redis were used for the database. The application log is collected by Fluentd, and if there is abnormal data, it is notified to Slack and all logs are stored in S3. I think that each service of the Metaps Group generally adopted such a mechanism.
As time went by, the idea of microservices became common as a software development technique. The development environment has moved from Vagrant to Docker, and cloud services for container execution environments such as GKE, ECS, and EKS have been created. In the Rails area, the development style of Webpack (er) + API mode was born, and the front end and back end have become a cool way to separate services. The problem here is application log management. In the case of a monolithic application configuration, it is sufficient to send related logs in units of user requests, but in microservices, multiple services communicate asynchronously, so just looking at the S3 log will track the time series data for each user. It will be difficult.
Therefore, we decided to redesign the infrastructure configuration to suit microservices.
Metaps decided to introduce it to a new service as soon as possible after the release of Amazon ECS. The knowledge about ECS is published as ECS operation know-how, so please refer to it if you are interested. Also, regarding log delivery design, we previously published How to handle logs in ECS, but in the early stage of operation, we used CloudWatch Logs, and in the latter half of the article, we will use Fluentd and Elasticsearch for a more advanced log platform I'm migrating. Even now, the basic concept of log delivery has not changed, and the basic configuration is that each microservice that makes up the application delivers data to the back-end log service.
When building a general Web application, the logs generated from the system can be classified into the following types.
Log type | Explanation | Typical log contents |
---|---|---|
Application log | Access Logs Application frameworks and logs intentionally embedded by developers | Logger log(DEBUG, WARN, ERROR, etc.) |
System log | Logs of clusters and infrastructure that underlies clusters | Amazon ECS event log, Amazon RDS error log, Amazon VPC flow log |
Audit log | A trail of who did what and what they did to the system. Generally required to be retained for 1 to 3 years | Access log, auditd, AWS Config, AWS CloudTrail |
Since we will focus on container operation this time, we will introduce the log delivery pattern related to application logs [^ log_type].
Docker has a function called Log Driver, which provides a mechanism to deliver standard output / standard error output from the container to local storage, Amazon CloudWatch Logs, and Fluentd.
First, create a program that outputs a character string.
hello.rb
puts 'hello'
Then get the Ruby image and execute the command. Since docker has an option to specify the log delivery destination called --log-driver
[^ log_driver], specify json-file
here.
% docker run -v ${PWD}:/app --log-driver=json-file -w /app ruby:alpine ruby hello.rb
hello
I was able to confirm that the command was executed and hello
was also output to the console. If --log-driver
is not specified, json-file
will be treated as the default format.
Then check the log file. When you run docker ps -a
, the container you just ran remains, so let's use docker inspect
to check the log output destination.
% docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
01b164f54437 ruby:alpine "ruby hello.rb" About a minute ago Exited (0) About a minute ago sweet_babbage
% docker inspect 01b164f54437 | grep LogPath
"LogPath": "/var/lib/docker/containers/01b164f54437e80b8e82767775e697876ea1fbe6b5d95e118defea1b8c6bf4a7/01b164f54437e80b8e82767775e697876ea1fbe6b5d95e118defea1b8c6bf4a7-json.log",
The path described in LogPath
is the output destination of the log. In Linux environment, you can open it with cat, but in Mac (Docker for Mac) environment, you will get an error such as "No such file or directory". This is because Docker for Mac is running on a VM (Hyperkit) and you need to connect to the container as a privileged user to connect to the VM.
% docker run -it --rm --privileged --pid=host justincormack/nsenter1
/ # cat /var/lib/docker/containers/01b164f54437e80b8e82767775e697876ea1fbe6b5d95e118defea1b8c6bf4a7/01b164f54437e80b8e82
767775e697876ea1fbe6b5d95e118defea1b8c6bf4a7-json.log
{"log":"hello\n","stream":"stdout","time":"2020-12-25T00:27:10.82632Z"}
I was able to confirm that the log was output when I cated the path written in LogPath
.
I introduced how to specify the log driver via the docker command here, but if you specify the driver from docker-compose, you need to specify the logging
parameter in docker-compose.yml.
docker-compose.yml example
services:
web:
logging:
driver: json-file
GKE
In GKE, Fluentd, which operates as a Daemon Set, collects the logs output by Pod as standard and delivers them to Stackdriver Logging.
Amazon ECS
By specifying logConfiguration
[^ log_configuration] in the task definition file, the operation is equivalent to --log-driver
and --log-opt
.
The following tools are available as a service to collect logs from each container [^ log_collector].
Tool name | Feature | GitHub Watch | GitHub Star | concern |
---|---|---|---|---|
Fluentd | *It has a lot of usage records and is being actively developed. *A wide range of plug-ins and a wide range of log output destination storage |
348 | 9.8k | It is necessary to take measures against loss due to log buffer overflow etc. |
Fluent Bit | Written in C language and operates at high speed | 90 | 2.4k | Fewer plugins |
Logstash | *DevelopedbyElastic *Documentsareinplace |
831 | 11.9k | WritteninJRuby(Highmemoryusage) |
Beats | *Specializingindatacollection(Simplefunction) *Written in Golang and runs fast |
635 | 9.5k | *Fewer plugins *There is little support for the output destination |
Flume NG | Redundancy is taken into consideration and strong for large-scale log collection | 221 | 2k | *Fewer plugins *Complex settings |
In addition, recently, services such as Datadog Logs and Stackdriver Logging are also provided as SaaS, and in most cases, Fluentd or Fluentd + Datadog Logs is built as a base configuration in Metaps.
Next, consider where to store the logs. It depends on the cloud platform that executes the container, but I think that S3 is often selected for Amazon and Google BigQuery is selected for GCP. Here, I have summarized the selection criteria and considerations for each storage that I actually used.
storage | Selection criteria | Data visualization | Log monitoring(Anomaly detection) | Consideration |
---|---|---|---|---|
Amazon S3 | *Low data storage fee *You can save even more by moving old logs to Amazon S3 Glacier |
Cooperation with BI tools | ☓ | *Data analysis requires schema design using AWS Glue *SQL-like search is possible by using Amazon Athena for data analysis(The price depends on the amount of scanned data. In order to keep costs down, it is necessary to create a mechanism such as data partitioning design and table update.) |
Amazon Redshift | *Supports standard queries *Hourly billing |
Cooperation with BI tools | ☓ | High price |
Amazon Elasticsearch Service | *Intuitive search and data visualization using Kibana | Kibana | Alerting | *AWS offers a fully managed Elasticsearch Service *In order to increase availability, it is necessary to separate the master node and the data node, which increases the operating cost.(It can be operated without a master node, but it lacks stability.) |
Amazon CloudWatch Logs | *Low data storage fee *If you are using AWS, it has a high affinity with other services.(Easy streaming of logs to AWS Lambda and Amazon Elasticsearch Service) |
CloudWatch Logs Insights | CloudWatch Alerm | There is a problem with the console |
Datadog Logs | *High affinity with other Datadog products *Code management of log monitoring rules with Terraform |
Datadog Logs | Log monitor | Billing system based on log delivery volume and storage period(Operation cost is high depending on the usage system) |
Google BigQuery | *High data load performance *Supports standard queries *Query billing |
Cooperation with BI tools | ☓ | Data visualization requires integration with BI tools such as Google Data Studio |
Amazon Elasticsearch Service (Kibana) and Datadog Logs have outstanding performance over other services in terms of log searchability and visualization, but on the other hand, maintenance costs tend to be high. Since Fluentd can define multiple data output destinations, it is also possible to build a two-system system that stores logs in Elasticsearch only for a short period of time for visualization purposes and stores persistent logs separately in persistent storage (Amazon S3 etc.) It's one of the hands.
Amazon Elasticsearch Service and Datadog Logs are often selected as log storage, so I compared their usage fees [^ pricing].
m5.large.elasticsearch
For services that generate 90 million records per month, Datadog Log (Ingest) was advantageous in terms of price, but Elasticsearch may be more advantageous depending on the log retention period and the number of scans, so which one is generally better I can't say it's good. It is necessary to consider which service should be selected based on the characteristics and requirements of the service.
I personally like Elasticsearch in terms of data visualization, but Elasticsearch doesn't have the log monitor function (a mechanism that uses search queries to notify alerts that match specific conditions) as Datadog Logs calls it. To be exact, it is included in Elasticsearch provided by Elastic, but it is not provided in Amazon Elasticsearch Service. Before I knew it, the AWS version also had an Alerting function. Apparently, with the release of Open Distro for Elasticsearch [^ distro_es], it seems to be supported from version 6.5. When I tried using the alert function, I was able to add a monitor very easily. The procedure is described below.
Open Kibana and open Alerting
from the left sidebar
First, register the alert notification destination, so open the Destinations
tab and press Add destination
.
Amazon SNS
, Amazon Chime
, Slack
, Custom webhook
Create
Monitors
tab and press Create monitor
Define using visual graph
, which allows you to specify a query from the GUIaccess-log-*
/ users/B-hU3nDsTPx3YTEPS_ylMQ /
was accessed.
Create
Create trigger
opens. Here, set the conditions for notifying alerts based on the monitoring conditions set on the monitor.
Add action
to set the action when the threshold is exceededCreate
This completes the settings. It was confirmed that such a notification flies when a request that exceeds the threshold is generated. Open Distro for Elasticsearch publishes Alerting API, so it seems easy to manage monitor rules with code.
[^ log_type]: Exceptions when the application causes an error are excluded from the log. The de facto exception is to manage it with an error tracking tool such as Sentry or Rollbar.