[blackbird-elb] Monitoring AWS ElasticLoadBalancing

blackbird ElasticLoadBalancing plugin

In This Plugin, the CloudWatch API is executed on the blackbird side to get various Metrics of Elastic Load Balancing (hereinafter ELB).

You can get following CloudWatch Metrics

You can get the following CloudWatch Metrics. From left to right, CloudWatch Metric name (see see home for the meaning of Metric itself), which statistic to take (Such as the total or average per unit time (strictly speaking, per acquisition interval)), a little explanation.

CloudWatch Metric Name	Statistics	Detail
HTTPCode_Backend_2XX	Sum	Total of 2XX series status of backend server within acquisition interval
HTTPCode_Backend_3XX	Sum	Total 3XX status of backend server within acquisition interval
HTTPCode_Backend_4XX	Sum	Total 4XX status of backend server within acquisition interval
HTTPCode_ELB_4XX	Sum	Total 5XX status of ELB servers within the acquisition interval
HTTPCode_ELB_5XX	Sum	Total 5XX status of ELB servers within the acquisition interval
Latency (Average)	Average	Average response time within the acquisition interval
Latency (Maximum)	Maximum	Maximum response time within the acquisition interval
Latency (Minimum)	Minimum	Minimum response time within the acquisition interval
RequestCount	Sum	Number of requests within the capture interval
SpilloverCount	Sum	Number of overflows from ELB's internal request queue within the acquisition interval
BackendConnectionErrors	Sum	The number that could not be connected to the backend server within the acquisition interval
HealthyHostCount	Maximum	Number of backend servers in the acquisition interval that have successfully performed Health Check
UnhealthyHostCount	Maximum	Number of backend servers in the acquisition interval that have failed Health Check

About SpilloverCount

Spillover Count, as the name implies, means an overflowing number. The ELB internally stores Requests in a queue (called Surge Queue), and if the backend is too busy to handle, the request is temporarily stored in that queue. And when the queue overflows, SpilloverCount becomes 1 or more. (I think it's bad that I didn't get SurgeQueueLength while writing ... I'll get it soon.)

So, when the SurgeQueueLength increases, the backend server is not able to handle the trafic in the first place, which is quite bad, but if you do your best, you may be able to return it before the queue overflows, in that state. If the Spillover Count increases from the beginning, it probably means that it is hardly functioning as a service.

About BackendConnectionErrors

I think that it is similar in that it does not function as a service when it increases, but it performs epoll (such as nginx) or similar I / O and Network processing asynchronously. If the one is a backend server, the new connection itself can be established, but the process may not catch up in the first place. In such a case, BackendConnectionErrors does not appear, but it seems that SpilloverCount is increasing.

Zabbix Template

From here, we will talk about Zabbix template.

Triggers

It ’s just a rough function,

Does Average Latency exceed _constant_ms?
Does Maximum Latency exceed _constant_ms?
Is 4XX of HTTP Status Code on the Backend side and ELB side more than _constant_number?
Is 5XX of HTTP Status Code on the Backend side and ELB side more than _constant_number?
Is SpilloverCount more than _constant_number?
Is BackendConnectionErrors more than constant?

It will throw an alert like this. The place where each is constant is set to Macro of Template, so please edit it according to the characteristics of Traffic of the service. Also, since each trigger has three levels of Serbility, it is possible to do things like chat for Info, email for Average, and mobile phone for High.

Graphs

Average response time

スクリーンショット_2014-12-03_21_37_51.png

Backend status code integration graph

スクリーンショット_2014-12-03_21_39_00.png

RequestCount

スクリーンショット_2014-12-03_21_44_41.png

Maximum response time

Omitted because it is an ordinary bar graph

ELB status code (4XX, 5XX only) integration graph

Ry for ordinary bar charts

How to Install

Case of Using pip

pip install blackbird-elb

Since it is registered on PyPi, please put it in quickly with pip.

Case of Using yum

First, let's create a repo file. (Of course, in the end, I'd like to do something like http://blackbird.example.com/blackbird_install.sh | sh, but please wait a little longer.)

[blackbird]
name=blackbird package repository
baseurl=https://vagrants.github.io/blackbird/repo/yum/6/x86_64
enabled=0
gpgcheck=0

yum install blackbird --enablerepo=blackbird

How to Use

Configure your blackbird

#Any section name in the ini file format is okay
#However, since a thread is created with this name internally, it is safer not to wear it elsewhere.
[ANYTHING_OK]
#The acquisition interval. You can also set 1 second, but since CloudWatch does not support it, the minimum value is 60 seconds.
interval = 300

#The region name. default is us-east-1。
region_name = ap-northeast-1

#AWS credential
#I'd like to authenticate with sts from IAM Role of EC2 Instance, so please wait for a while. I also don't want to write credentials here
aws_access_key_id = ACCESS_KEY_ID
aws_secret_access_key = SECRET_ACCESS_KEY

#ELB name
load_balancer_name = YOUR_ELB_NAME

#ELB's CloudWatch Metrics can be valued for each AZ, so availability_specify zone(I'm also happy because I know which AZ it depends on.)
availability_zone = ap-northeast-1a

# Zabbix Web(It's the Zabbix screen that you usually use)Specify the Host above. I don't know the destination. .. .. So don't forget to create Host on Zabbix side first.
hostname = YOUR_ELB_NAME_ON_ZABBIX

#module specifies which plugin to use, but here is fixed to elb
module = elb

Therefore, under ʻinclude_dir (blackbird can do something like /etc/blackbird/conf.d/*.conf` called Nginx), this configuration file was set to defaults.cfg of blackbird itself. Please put it or add it under defaults.cfg.

Let's run your blackbird!

Case of Install by pip

blackbird --config YOUR_DEFAULT_CONFIG

Case of Install by yum

sudo service blackbird start

With this, the value will be entered when the time specified for interval elapses! must.