[blackbird-kinesis-stream] Monitoring AWS Kinesis Stream

This Plugin (https://github.com/Vagrants/blackbird-Kinesis-Stream) gets CloudWatch Metrics for Kinesis Stream.

What Metrics does this plugin get?

The Metric Name column is MetricName.Statistics, which is the metric name. Which way to take it (Sum or Avg). Unit is a unit (bytes, ms, etc.).

Metric Name	Unit	Detail
PutRecord.Bytes	Bytes	Number of bytes of items in Stream
PutRecord.Latency	milliseconds	Latency when put in Stream(PutRecord API response time)
PutRecord.Success	Count	Number of successful PutRecord APIs
GetRecords.Bytes	Bytes	Number of bytes retrieved from Stream
GetRecords.IteratorAgeMilliseconds	milliseconds	-
GetRecords.Latency	milliseconds	Latency when acquired from Stream(GetRecords API response time)
GetRecords.Success	Count	GetRecords API Success Count

Zabbix Template

Items

This item is the above CloudWatch Metrics calculated per second.

PutRecord Bytes per Second
PutRecord Success per Second
GetRecords Bytes per Second
GetRecords Success per Second

Triggers

//トリガーの作成はここだけの話超苦労しました。

Due to the characteristics of Kinesis Stream (unlike Queue, Data does not disappear even if it is taken out), the size of the entire Stream does not make much sense, so if the difference between PutRecord and GetRecords is too small (or too large) than the previous time, it will trigger I tried to raise it, but due to the characteristics that the consumer side acquires at once with GerRecords, it flaps quite a bit. Therefore, when I first introduced it, I was honestly monitoring it (at the experimental stage, I was flying only to my mobile phone, so I was crazy).

Therefore, instead of simple difference monitoring, subtract the average value of 3 times (in short span) from the average value of 10 times (in long span) (the number of times to average is an example). I changed it so that an alert is thrown when the difference exceeds 25%.

Now you can even out the flap values and see the averaged difference. Only when the flow rate to the Stream (or the flow rate taken out) increases or decreases extremely, you can recognize that Oh, something happened.

Of course, depending on the characteristics of the application, I think that it is necessary to lengthen or shorten the average value of this long span and short span, so I set the span and threshold value in MACRO respectively.

Average difference monitoring by the above logic
- PutRecord.Bytes
- PutRecord.Success
- GetRecords.Bytes
- GetRecords.Success
PutRecord.Latency
GetRecords.Latency

Is the trigger. You can specify info, average, and high thresholds, respectively, so change chat, email, and notification integration.

Graphs

The above average difference monitoring is difficult to understand in words, but it should be very easy to understand in graphs.

Graph of raw values of GetRecords.Bytes

スクリーンショット_2014-12-11_2_06_20.png

It's quite jagged. If you monitor this with a simple difference from the previous comparison, you will get a lot of alerts.

GetRecords.Bytes average diff graph

スクリーンショット_2014-12-11_2_07_15.png

This is a smoothed graph, but I think it looks quite calm. If you actually look at it with a slightly longer span and compare it with the peak time of the application, it seems that the flow rate will be like this.

As another graph

Get/Put Latency

GetRecords and PutRecord Latency (response time for each API)

Get/Put Success

Number of Success in Get Records and Put Records

Write/Read per second

Read / write bytes per second

there is.

How to Install

I have uploaded the RPM to usual place, so this procedure Create a repo file from //qiita.com/JumpeiArashi/items/849281083b6c7888f25d#case-of-using-yum),

yum install blackbird-kinesis-stream --enablerepo=blackbird

Please do it.

Please wait for a while as pip will be updated soon ＞＜

Isn't there monitoring for each Shard ??

In CloudWatch, you can get Metric for each Shard, so I really want to implement it (I know where the Key depends), but unless the application logic is wrong even with the difference flow rate, something is happening. I think it is possible to notice it.

However, in the future, we plan to monitor each Shard.

Finally

I think Put's Throughput has improved a lot since the Put Records API was implemented the other day, but since it is not yet implemented on the plugin side, I would like to get it as soon as possible and add it to Zabbix Template.