Is there such thing as a DevOps Hierarchy of Needs?

In 1943 the psychologist Abraham Maslow proposed the concept of a ‘hierarchy of needs’ to describe human motivation. Most often portrayed as a pyramid, with the more fundamental needs occupying the largest space in the bottom layers, his theory states that only in the fulfilment of the lower-level needs can one hope to progress to the next strata of the pyramid. The bottom-most need is of course physiological (i.e. food, shelter, beer etc); once this is achieved we can start to think about safety (i.e. lets make sure nobody takes our beer) then we start looking for Love and Self Esteem before ending up cross-legged in an ashram searching for Self-Actualization and Transcendence.

maslows_hierarchy

Is this a Devops blog or what? Yes, yes it is. The suggestion is not that we should all be striving for Devops Transcendence or anything, but that perhaps the general gist of Maslow’s theory could be applied to coin a DevOps Hierarchy of Needs, and we could use the brief history of our own Devops team at Spaceape to bolster this idea.

In the beginning there was one man. This man was tasked with building the infrastructure to run our first game, Samurai Siege; not just the game-serving tier but also a Graphite installation, an ELK stack, a large Redis farm, an OpenVPN server, a Jenkins server, et cetera et cetera. At this juncture we could not even be certain that Samurai Siege would be a success. The remit was to get something that worked, to run our game to the standards expected by our players.

Some sound technological choices were made at this point, chief of which was to build our game within AWS.

With very few exceptions, we run everything in AWS. We’re exceedingly happy with AWS, and its suits our purposes. You may choose a different cloud provider; you may forego a cloud provider altogether and run your infrastructure on-premise. Whichever it is, this is the service that provides the first layer on our DHoN. You need some sort of request driven IaaS to equate to Maslow’s Physiological layer. Ideally this would include not only VMs but also your virtual network and storage. Without this service, whatever it might be (and it might be as simple as, say, a set of scripts to build KVM instances), you can’t hope to build toward the upper reaches of the pyramid.

Samurai Siege was launched. It was a runaway success. Even under load the game remained up, functional and performant. The one-man Devops machine left the company and Phase 2 in our short history commenced. We now had an in-house team of two and one remote contractor and we set about improving our lot, striving unawares for that next level of needs. It quickly became apparent, however, that we might face some difficulty…

If AWS provided the rock on which we built our proverbial church, we found that the church itself needed some repairs, someone had stolen the lead from its roof.

Another sound technology choice that was made early was to use Chef as the configuration management tool. Unfortunately – and unsurprisingly given the mitigating circumstances – the implementation was less than perfect. It appeared that Chef had only been used in the initial building of the infrastructure, attempts to run it on any sort of interval led inevitably to what we had started to call ‘facepalm moments’. We had a number of worrying 3rd party dependencies and if Chef was problematic, Cloudformation was outright dangerous. We had accrued what is commonly known as technical debt.

Clearly we had a lot of work to do. We set about wresting back control of our infrastructure. Chef was the first victim: we took a knife to our community cookbooks, we introduced unit tests and cookbook versioning, we separated configuration from code, we even co-opted Consul to help us. Once we had Chef back on-side we had little choice but to rebuild our infrastructure in its entirety, underneath a running game. With the backing of our CTO we undertook a policy of outsourcing components that we considered non-core (this was particularly efficacious with Graphite, more on this one day soon). This enabled us to concentrate  our efforts and to deliver a comprehensive game-serving platform, of which we were able to stamp out a new iteration for our now well-under-development second game, Rival Kingdoms.

It would be easy at this point to draw parallels with Maslow’s second tier, Safety. Our systems were resilient and monitored, we could safely scale them up and down or rebuild them. But actually what we had reached at this point was Repeatability. Our entire estate – from the network, load-balancers, security policies and autoscaling groups through to the configuration of Redis and Elasticsearch or the specifics of our deployment process – was represented as code. In the event of a disaster we could repeat our entire infrastructure.

Now, you might think this is a lazy observation. Of course you should build things in a repeatable fashion, especially in this age of transient hosts, build for failure, chaos monkeys, and all the rest of it. The fact is though, that whilst this should be a foremost concern of a Devops team, quite often it is not. Furthermore, there may be genuine reasons (normally business related) why this is so. The argument here is that you can’t successfully attain the higher layers of our hypothetical DHoN until you’ve reached this stage. You might even believe that you can but be assured that as your business grows, the cracks will appear.

At Spaceape we were entering Phase 3 of our Devops journey, the team had by now gained some and lost some staff, and gained a manager. The company itself was blossoming, the release date of Rival Kingdoms had been set, and we were rapidly employing the best game developers and QA engineers in London.

With our now sturdy IaaS and Repeatability layers in place, we were able to start construction of the next layer of our hierarchy  – Tooling. Of course we had built some tools in our journey thus far (they could perhaps be thought of as tiny little ladders resting on the side of our pyramid) but its only once things are standardised and repeatable that you can really start building effective tooling for the consumption of others. Any software that tries to encompass a non-standard, cavalier infrastructure will result in a patchwork of ugly if..then..else clauses and eventually a re-write when your estate grows to a point where this is unsustainable. At Spaceape, we developed ApeEye (a hilarious play on the acronym API) which is a RESTful Rails application that just happens to have a nice UI in front of it. Perennially under development, eventually it will provide control over all aspects of our estate but for now it facilitates the deployment of game code to our multifarious environments (we have a lot of environments – thanks to the benefits of standardisation we are able to very quickly spin up entirely new environments contained on a single virtual host).

And so the launch of Rival Kingdoms came and went. It was an unmitigated success, the infrastructure behaved – and continues to behave – impeccably. We have a third game under development, for which building the infrastructure is now down to a fine art. Perched as we are atop our IaaS, Repeatabilty and Tooling layers, we can start to think about the next layer.

But what is the next layer? It probably has something to do with having the time to reflect, write blog posts, contribute to OSS and speak at Devops events, perhaps in some way analogous to Maslow’s Esteem level. But in all honesty we don’t know, we’ve but scarcely laid the foundations for our Tooling level. More likely is that there is no next level, just a continuous re-hashing of the foundations beneath us as new technologies and challenges enter the fray.

The real point here is a simple truth – only once you have a solid, stable, repeatable and predictable base can you start to build on it to become as creative and as, well, awesome as you’d like to be. Try to avoid the temptation to take shortcuts in the beginning and you’ll reap the benefits in the long term. Incorporate the practices and behaviours that you know you should be,  as soon as you can.  Be kind to your future self.

Chef and Consul

Here at Spaceape, our configuration management tool of choice is Chef. We are also big fans of Consul, the distributed key-value-store-cum-service-discovery-tool from the good folks at Hashicorp. It might not be immediately clear why the two technologies should be mentioned in the same paragraph, but here is the story of how they became strange bedfellows.

consul - logo-gradient-94098a4aIn a previous blog post, we told the story of our experiences with Chef. That post goes into far greater detail, but suffice it to say that our infrastructure code base was not always as reliable, configurable or even predictable as it is now. We found ourselves in a dark place where Chef was run on an ad-hoc basis, often with fingers well and truly crossed. To wrest back control and gain confidence we needed to be able to run it on a 15-minute interval.

Simple, you say, write a cron-job. Well yes, that is true. But it’s only a very small part of the story. We would find occasions where initiating a seemingly harmless Chef run could obliterate a server, and yet the same run on an ostensibly similar server would reach the end without incident. In short we had little confidence in our code, certainly not enough to start running it on Production. Furthermore – and this applies still today – often we really don’t want to allow Chef to run simultaneously across a given environment. For example, we may push a change that restarts our game-serving process. I don’t need to expand on what would happen if that change ran across Production at 12:15 one day…

Wouldn’t it be nice, we asked ourselves, if we had some sort of global locking mechanism? To prevent us propagating potential catastrophes? Not only would this allow us to push infrastructure changes through our estate, it might just have other benefits…

Enter Consul!

Like all the Hashicorp products we’ve tried, Consul is solid. Written in Go, it employs the Raft consensus algorithm atop the Serf gossip protocol to provide a highly available distributed system that even passes the Jepsen test. It is somewhat of a swiss army knife that aims to replace or augment your existing service delivery, configuration management and monitoring tools.

The utility we decided to employ for the locking mechanism was the key-value store. We built our own processes and tooling around this, as we’ll see, but it should be noted that more recent versions of Consul than we had available at the time actually have a semaphore offering.

Stored in Consul, we have a number of per-tier, per-environment key-spaces. As an example:

chef/logstash/es-indexer

Logstash is the environment, es-indexer the service. Within this keyspace, the only pre-requisite is a value for the maximum number of concurrent Chef runs we wish to allow, which we call max_concurrent:

chef/logstash/es-indexer/max_concurrent

Generally this value is set to 1, but on some larger environments we set it higher.

When a server wishes to run Chef the first thing it does is to retrieve this max_concurrent value. Assuming the value is a positive integer (a value of -1 will simply allow all Chef runs) it then attempts to acquire a ‘slot’. A slot is obtained by checking this key:

chef/logstash/es-indexer/current

Which is a running count of the number of hosts in the tier currently running Chef. It’s absence denotes zero. If the current value is less than the `max_concurrent` value the server increments the counter and registers itself as ‘running’ by creating a key like this, the value of which is a timestamp:

chef/logstash/es-indexer/running/hostname.of.the.box

The sharper amongst you will have noticed a problem here. What if two hosts try to grab the slot at the same time? To avoid this happening we use Consul’s Check-and-Set feature. The way this works is that, upon the initial read of the current value, a ModifyIndex is retrieved along with the actual Value. If the server decides that current < max_concurrent it attempts to update current by passing a `?cas=ModifyIndex` parameter. If the ModifyIndex does not match that which is stored on Consul, it indicates that something else has updated it in the meantime, and the write fails.

With the slot obtained, the Chef run is allowed to commence. Upon success, the running key is removed and the counter decremented. If however the run fails the lock (or the ‘slot’) is held, no further hosts on the tier are able to acquire a slot, and Chef runs are thereby suspended.

Our monitoring tools, by checking the timestamp of the running key, are able to warn us when locks have been held for a certain period of time (i.e. longer than the average Chef run takes) and failures are contained to one (or rather max_concurrent) hosts.

And so… this all works rather well. Many has been the time when we’d look at one another, puff our cheeks, and say, “Thank goodness for the locking system!” Over time it has allowed us to unpick our infrastructure code and get to the smug position we find ourselves. Almost never do we see locks being held for anything but trivial problems (Chef server timeouts for instance), nothing that a little judicious sleep-and-retry doesn’t fix. It also gives us great control over when and what runs Chef as we can easily disable Chef runs for a given tier by setting max_concurrent to 0.

But the purists amongst you will no doubt be screaming something about a CI system, or better unit tests, or something, or something. And you’d be right. The truth is that we were unable to shoehorn a CI system into infrastructure code which was underpinning a live game, in which we did not have complete confidence. Having the backup-parachute of the mechanism described above, though, has enabled us to address this. But that doesn’t mean we’ll be discarding it. On the contrary, it will form the backbone of our CI system, facilitating the automatic propagation of infrastructure code throughout our estate. More on that to follow.