This Plugin (https://github.com/Vagrants/blackbird-Kinesis-Stream) gets CloudWatch Metrics for Kinesis Stream.
The Metric Name column is MetricName.Statistics, which is the metric name. Which way to take it (Sum or Avg). Unit is a unit (bytes, ms, etc.).
Metric Name | Unit | Detail |
---|---|---|
PutRecord.Bytes | Bytes | Number of bytes of items in Stream |
PutRecord.Latency | milliseconds | Latency when put in Stream(PutRecord API response time) |
PutRecord.Success | Count | Number of successful PutRecord APIs |
GetRecords.Bytes | Bytes | Number of bytes retrieved from Stream |
GetRecords.IteratorAgeMilliseconds | milliseconds | - |
GetRecords.Latency | milliseconds | Latency when acquired from Stream(GetRecords API response time) |
GetRecords.Success | Count | GetRecords API Success Count |
Items
This item is the above CloudWatch Metrics calculated per second.
Triggers
//トリガーの作成はここだけの話超苦労しました。
Due to the characteristics of Kinesis Stream (unlike Queue, Data does not disappear even if it is taken out), the size of the entire Stream does not make much sense, so if the difference between PutRecord and GetRecords is too small (or too large) than the previous time, it will trigger I tried to raise it, but due to the characteristics that the consumer side acquires at once with GerRecords, it flaps quite a bit. Therefore, when I first introduced it, I was honestly monitoring it (at the experimental stage, I was flying only to my mobile phone, so I was crazy).
Therefore, instead of simple difference monitoring, subtract the average value of 3 times (in short span) from the average value of 10 times (in long span) (the number of times to average is an example). I changed it so that an alert is thrown when the difference exceeds 25%.
Now you can even out the flap values and see the averaged difference. Only when the flow rate to the Stream (or the flow rate taken out) increases or decreases extremely, you can recognize that Oh, something happened
.
Of course, depending on the characteristics of the application, I think that it is necessary to lengthen or shorten the average value of this long span and short span, so I set the span and threshold value in MACRO respectively.
Is the trigger. You can specify info, average, and high thresholds, respectively, so change chat, email, and notification integration.
Graphs
The above average difference monitoring is difficult to understand in words, but it should be very easy to understand in graphs.
It's quite jagged. If you monitor this with a simple difference from the previous comparison, you will get a lot of alerts.
This is a smoothed graph, but I think it looks quite calm. If you actually look at it with a slightly longer span and compare it with the peak time of the application, it seems that the flow rate will be like this.
As another graph
there is.
I have uploaded the RPM to usual place, so this procedure Create a repo file from //qiita.com/JumpeiArashi/items/849281083b6c7888f25d#case-of-using-yum),
yum install blackbird-kinesis-stream --enablerepo=blackbird
Please do it.
Please wait for a while as pip will be updated soon > <
In CloudWatch, you can get Metric for each Shard, so I really want to implement it (I know where the Key depends), but unless the application logic is wrong even with the difference flow rate, something is happening. I think it is possible to notice it.
However, in the future, we plan to monitor each Shard.
I think Put's Throughput has improved a lot since the Put Records API was implemented the other day, but since it is not yet implemented on the plugin side, I would like to get it as soon as possible and add it to Zabbix Template.
Recommended Posts