Yet another blog about software development

My thoughts about {Golang,Java,Cassandra,Docker,any-buzzword-here}

Application monitoring is your friend so let me tell you the story.

by Adam Jędro. Categories: programming Tags: monitoring / alerting / opentsdb / grafana / scollector / newrelic
Share this post on: Facebook LinkedIn Twitter

The story: When and why I understood that monitoring matters.

You can’t improve if you don’t measure

I have heard this sentence many times. After some time I understood that it is not just a random statement… but let me start from the beginning.

After six months in my current company I was asked to be a technical owner of two services, not critical but quite important ones. One of them was in the middle of a development cycle. It means that every production outage, customer problem, deployment and build issue or just a question will land on my desk or my Jira board. Simply I was dealing with lots of things at once.

People often came to my desk and asked:

Hey Adam, is XYZ service working on SJC DC?

I am doing capacity planning. Could you provide metrics and thresholds?

Alerts fired. What the hell is going on?

Without proper monitoring and alerting I wasn’t able to answer these questions. I was lucky that one of the services had already implemented monitoring through Graphite. It’s crucial to understand what your system does but without proper monitoring it’s not possible.

Problem

Alerts fired. What the hell is going on?

Few weeks ago I was able to trace down an issue with events processing pipeline by watching only on dashboard without even access to server.
When on the Monday morning I sat at my desk and opened my computer I saw a few emails sent by Nagios telling me that the number of unprocessed events is above the limit. When I opened the dashboard I saw the following:
* Errors number is steady and very low
* Application is consuming messages (RabbitMQ as queue in this case)
* Application is processing 4 (!!!) events per second but norm is over 1.5k per second
* Queue of unprocessed events is growing rapidly. This means that we can just kill our RabbitMQ instance by not pulling down messages. (let me ignore backpressure topic here)

I immediately realized that I saw a similar problem a few months ago. Back then I spent several hours trying to figure out what the hell was going on. Finally, I was able to hunt down the issue with the help from the OPS Team. But I did not have full view of the system - it is impossible to see what is going on with my application without a monitoring. Nowadays, by combining my knowledge and charts rendered by Grafana I am able to quickly exclude potential points of failure.

Mix of Nagios/Grafana alerts and Graphite/Grafana dashboards is really useful.

Question

Hey Adam, is XYZ service working on SJC DC?

This kind of questions often come from teams that depend on certain service. Here is the dashboard, let’s check it - I often say while showing TV that hangs on the wall.
Nothing more to say.

Capacity Planning

I am doing capacity planning. Could you provide metrics and thresholds?

Same as above. Here are statistics, here are thresholds we can hit without worrying about the system. We can even use Grafana to send notifications when traffic is increasing.

Bonus

Sometime in the past I opened Graphite and saw that traffic in one our datacenters decreased by a few percent the day before. I started thinking what could have been the reason. I told my team about my observations and then my smart teammate started googling something. He found that one country where one of our biggest customers resides had a public holiday on that day. Informations that we can get from this kind of data are awesome, not to mention usefulness of statistics/trends.

What the ‘application monitoring’ exactly is and why is it really important?

There are lots of tools that help keep our services healthy. New Relic is great when it comes to monitoring core components of server such as CPU utilization, disk usage or bandwidth. It also can collect stats about Go runtime as well as over hundred of other third party components. This SaaS tool has wide range of features and monitoring is only one of them. I really recommend you to take a look.

Infrastructure monitoring is at least as important as application monitoring but the later is a slightly different thing. It provides a view of how your application behaves, how clients use your product, when you should expect higher traffic and so on. As I mentioned earlier, combination of good metrics can be very useful, not only when it comes to failure detection.
Imagine you have a service that processes events. What does ‘processes events’ mean? Well, very often it’s a combination of different steps often called a pipeline. Let’s take a look at how a simple pipeline can look like in a few steps:
* read incoming event from {RabbitMQ,Kafka,File,etc.}
* apply business logic: compute something, read event details from database, etc.
* send result to client over Websocket, HTTP or just persist it in MySQL

Now imagine that your customer is complaining that service is processing events very slowly and thus they are unable to get what they already paid for. It happens sometimes :)

What would you do? Add more servers, who cares! Well, it’s not the best way to start with :)

I would check usage stats. Where do we spent most of our time? Were we ever getting incoming events? Was it just peak or real failure?

Well, I see and usually perform three ways to tell why our service is so slow:
* check application logs - grep or kibana
* check application usage statistics
* check {transactions, spend} time in New Relic

First way is a bit annoying but just works. Mix of the last two ways is what you need. From application you can extract specific value of each metric to data store such as OpenTSDB or InfluxDB and then with the help of Grafana you will be able to see beautiful dashboards. New Relic Insights works in similar way, especially if you use their fancy plugins.

How do I…

implement application monitoring on my own!

In the first paragraph I mentioned that one of the services that I take care of have already implemented monitoring but the second one did not have it. Recently, I had a chance to implement monitoring of second service with the help of OpenTSDB and Grafana stack. Interestingly, I was able to do this without touching the application source code so I want to share with you my thoughts.

Backend

OpenTSDB is designed to store your time series data. I noticed that InfluxDB is getting more popular recently but I haven’t tried it yet. Back to OpenTSDB - data can be stored in Cassandra, HBase and Google Big Table. Big Table support table can be a huge benefit if you do not want to maintain your own HBase or Cassandra cluster as it can be a significant cost to your operations teams, both in terms of money and human resources. Google Cloud isn’t free but it can save a lot of pain with scaling, debugging and maintaining HBase on your machines.

Frontend

I mainly use Grafana at work as well as in my side projects.
As per its landing page, Grafana is described as:

The leading tool for querying and visualizing time series and metrics

It just works, I did not need to check other tools.

Collector

We already know where events will be stored and what tool will render graphs for us. Last tool we need to implement is the collector - it will send our metrics data to OpenTSDB.

I use scollector which allows me to collect metrics without even touching my application source code which is great if you want to have production ready graphs as soon as possible. It uses a very simple approach - you implement your own script that will send data in OpenTSDB format to stdout.

Application that I was playing with has log statements. Lots of log statements. It means that in various places in the application we can see something like this:

logger.info("New event received. Queue: {}, Event id: {}", queueName, eventIt)

Every message sent to logger will land primarily in a file which is rolled every hour. One file = one hour of history of what the application was doing.
Log statement in plain form often looks like this:
INFO:Mon Dec 26 21:48:47 UTC 2016 Thread-1 com.jakon.app.MainListener New event received. Queue: DummyQueue, Event id: 1

Back to scollector, I wrote script that runs every 60 seconds and has to:
* use since to extract new log file content
* use grep to extract interesting values. Interesting values in log statement shown above are time and queue name as well as sum of received events.
* add tags like host and dc
* prepare and send data in the form known by OpenTSDB to standard output.

The final result of running script will be something like:

jakon.me.events host=vm01,dc=rbx,app=events-pipeline 1356998400 1
jakon.me.events host=vm01,dc=rbx,app=events-pipeline 1356998402 1
jakon.me.events host=vm01,dc=rbx,app=events-pipeline 1356998407 1

These entries will be send to OpenTSDB by scollector and thus you will be able to prepare dashboards in Grafana.

This approach works well for production service for a few months for now but another approach I am currently implementing is to prepare OpenTSDB entry inside my app, so the script that scollector runs will search log files for text inside [opentsdb] tags. Content extracted from [opentsdb] tags will be send directly to stdout by script, without grepping.

This way of collecting data is not perfect as script has to be maintained along with the log statements inside the application but overall I am very happy with results we managed to achieve in a very short time.

Summary

Monitoring is crucial to understand how your application behaves as well as for usage predictions and failure investigation. I have shown one approach that works for me, however no solution is perfect. Dashboards are beautiful and useful but after some time you can get bored watching them so please do not forget about alerting - Grafana has implemented alerting system recently and I really recommend that you take a look at it. I was wondering whether I should add instructions how setup OpenTSDB&Grafana&scollector inside Docker containers or not but it is really straightforward and you should not have any problems to setup simple cluster.


Feel free to share this post if you like it. Facebook LinkedIn Twitter
comments powered by Disqus