Yet another blog about software development

My thoughts about {Golang,Java,Cassandra,Docker,any-buzzword-here}

Service discovery and deployment orchestration with Consul in the background.

by Adam Jędro. Categories: distributed-systems / programming Tags: consul / deployment / distributed / systems
Share this post on: Facebook LinkedIn Twitter

Nowadays most of the enterprise applications bigger than just simple CRUD has some challenges to solve, both on engineering and business side. From the engineering perspective, apart from business specific tasks, the most common and challenging are: resilience, high availability, scalability, fault tolerance, ability to being deployed frequently. Lots of tools have been created to help us achieve these goals but new software and patterns are still emerging as in the software world solutions are neither perfect nor universal.

Zookeeper as a precursor

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

Zookeeper is very popular and mature coordination service for distributed applications. Created in 2008, battle tested by many teams across the world. The idea behind this tool is simple: set of Zookeeper servers keep track of a shared hierarchal namespace(tree like structure), allowing external systems to coordinate actions based on that data. Namespaces together with nodes are like standard file system however the whole dataset must fit into memory. Zookeeper API is very simple and contains mostly CRUD operations but together with hight availability, consistency and Watches (triggered when the node changed) it provides a way for reacting to changes in distributed environment. Kafka, Hbase, Solr, Neo4j - all of these projects use Zookeeper as core component, not to mention companies like Yahoo, Rackspace, Reddit and many other.

The name space provided by ZooKeeper is much like that of a standard file system. A name is a sequence of path elements separated by a slash (/). Every node in ZooKeeper’s name space is identified by a path.


Consul comes into play

Consul was introduced in 2014 as “solution for service discovery and configuration”. It’s much more powerful than Zookeeper but under the hood the assumptions are similar - consistent, highly available store for data critical from the infrastructure point of view. Consul cluster, unlike Zookeeper, contains two node types: server and agent. Server responsibilities include, but are not limited to, maintaining state of registered nodes and services, storing KV data, maintaining cluster state, serving RPC calls from other cluster members and so on. Agent is just lightweight application that sits on the same machine as our application. Agent responsibilities are: responding to user requests, executing health checks and taking part in LAN gossip, providing failure detection inside cluster. It’s worth mentioning that agent doesn’t store any data locally, it forwards all queries to the server.

There is only one Consul Server handling writes at one time. This server mode is called leader. As Consul is DC aware - there is only one leader in each DC.

Aside from KV store, a very useful feature is health checking, on both service and node level. All nodes and services are registered in Catalog, which is strong consistent store serving by Consul Server in leader mode.

Consul supports two levels of checks: service(application) and node(machine). One node can have more than one service. If a check for given node/service is not passing, it remains as not healthy inside the cluster.

Where Consul shines

Recently I had a chance to take part in internal DevOps Hackathon where the high level idea was simple: orchestrate auto deployment of stateless web application without interrupting traffic from users. There were no strict requirements set but it’s worth taking into account that the application is currently running in production hence it would be good to consider using current infrastructure. Without going very deeply into infrastructure, it looks very usual:
* Load balancer sits in the front of VMs
* Application sits on VM
* VM seems pretty static: we never shutdown without reason, it’s configured once while provisioning and then remains as it is.

My first idea was obvious: let’s put application into Docker image, run it on Kubernetes and we are done. Well, in ideal world it might work. Kubernetes could work, but it is not silver bullet and it could introduce new problem we currently don’t have. It would require also lots of changes in infrastructure, build pipeline as well as introducing learning curve for OPS.

Second idea was born in another way. I started thinking what are the requirements behind “orchestrate deployment of…” because it does not ring a bell to me.

For the purpose of POC, the requirements are:
* Deployment is done without manual work. After start button is triggered, it does its magic.
* Application is always operational even while deploying.
* Load balancer routes requests only to health servers.
* Rollback should happen automatically as deployment might fail.
* Current infrastructure should be reused
* New application/VM is able to register itself into load balancer without any manual work.

Right after I saw these goals in front of me it was clear that I have a few topics to cover: service discovery, deployment scheduling and failure handling. I work sometimes as freelancer and realized that I did solve similar problem few months ago! One side project I worked on had REST API and a few types of batch and realtime worker applications. Whole project sits on 10 VMs so definitely it’s not big scale but I experienced some cool issues to solve. Back then, I decided to use Consul as it has perfect features for my use case:
* Service discovery - Consul maintains Catalog of nodes(machines) and services(applications) and executes health check on them. With the help of the Consul Template, my load balancer configuration was always up-to-date, getting info about new/dead nodes in the cluster nearly in real time.
* KV store - Consul consistent KV store was the source of truth for configuration values for tens of background jobs. Standalone applications used also Consul for discovering components like MySQL or RabbitMQ.

Back to my hackathon idea, I have implemented very simple Deployment Scheduler that was responsible for informing the VMs to run deployment script. This script was very simple, as the real deployment was done by other application that runs on VM. It simply contains few lines of bash code: put service into maintenance node, run real deployment, update status so Scheduler will know what is going on, update current version of application in Consul KV store, put application back in ready state in Consul.

Please note that we didn’t set the application in health state - this state will be set if all health checks are passing. Consul takes care of it.

Custom event delivery can be implemented in two different ways in Consul: each node might watch for new values in KV store, for example VM named app-01 might watch for new values on key /kv/nodes/app-01/deployment/new. When new value appears for this key, Consul Agent can run defined script. Another way is to send event using consul event command or REST API, however Consul does not guarantee delivery and the payload size for event is limited. Finally, Deployment Scheduler was aware of the current status of deployment on every machine as the machine itself sets values on the defined key so in case of deployment failure, VM can be asked to rollback deployment. Consul was used here as a kind of event broker, which is not top-lvl feature but as long as it’s used for the purposes of POC it is ok.

High level overview of proposed solution:

consul diagram-consul

The missing parts are failure handling and deployment logs forwarding to Scheduler but I didn’t have time to think about it. For sure there are couple of ways to do this but within few hours I really had to focus on the core problems. Consul with Consul Template out-of-the-box solved many requirements for service discovering, health checking, load balancer config updates and so on. It’s also worth to mention that Consul KV feature was used to store information about application version that is running on each VM so we would never have to worry about what versions are running where.

Consul is great software and I always enjoy working with it. While the hackathon I had a lot of fun, especially when Consul out-of-the-box solved few challenges and I could focus on other problems. Naturally, it’s not a production ready solution but merely the very first POC but I hope that some of the presented approaches could be useful. Consul offers many more features than battle tested Apache Zookeeper so service orchestration and discovery is much easier now, allowing us to focus on other architectural challenges.

Feel free to share this post if you like it. Facebook LinkedIn Twitter
comments powered by Disqus