De-comming EC2 Instances With Serverless and Go

Nothing is certain but Death and Taxes, goes the old idiom, and EC2 instances are not exempt. You will have to pay for them, and at some point they will die. Truly progressive outfits embrace this fact and pick off unsuspecting instances Chaos-Monkey-style, others wait for that obituary email from Amazon. Either way, we all have to make allowances for those dearly departed instances, and tidy them up once they are gone.

This article describes one way of doing so automatically; using Lambda, the Serverless Framework, and Go.

Why?

Why Lambda? The obvious benefit is that there is no need to run and maintain a host to watch for dying instances. Also it integrates nicely with Cloudwatch Events, which is the best way to get notified of them.

Why Serverless? The Serverless Framework is an open-source effort to provide a unified way of building entire serverless architectures. Originally designed specifically for Lambda it is gaining increasing support for other providers too. Beyond just deploying Lambda functions, it allows you to manage all of the supporting infrastructural components (e.g. IAM, ELBs, S3 buckets) in one place, by supplementing your Lambda code with Cloudformation templates.

Why Go? Aside from it being one of our operational languages (along with Ruby), this is perhaps the hardest one to answer, as AWS don’t actually support it natively (yet). However some recent developments have made it more attractive: in Go 1.8, support was added for plugins. These are Go programs that are compiled as shared modules, to be consumed by other programs. The guys at eawsy with their awesome aws-lambda-go-shim immediately saw the potential this had in running Go code from a Python Lambda function. No more spawning a process to run a binary, instead have Python link the shared module and call it directly. Their Github page suggests that this is the second fastest way of executing a Lambda function, faster even than NodeJS, the serverless poster-boy!

It is this shim that we have used to build our EC2 Decomissioner, and we have also borrowed heavily from this idea (we found that we just needed a bit more flexibility, notably in pulling build-time secrets from Vault, outside the scope of this article).

How?

Cloudwatch Events are a relatively recent addition to the AWS ecosystem. They allow us to be notified of various events through one or more targets (e.g. Lambda functions, Kinesis streams).

Pertinently for this application, we can be told when an EC2 instance enters the terminated state, and the docs tell us the event JSON received by the target (in our case a Lambda function) will look like this:

{
   "id":"7bf73129-1428-4cd3-a780-95db273d1602",
   "detail-type":"EC2 Instance State-change Notification",
   "source":"aws.ec2",
   "account":"123456789012",
   "time":"2015-11-11T21:29:54Z",
   "region":"us-east-1",
   "resources":[
      "arn:aws:ec2:us-east-1:123456789012:instance/i-abcd1111"
   ],
   "detail":{
      "instance-id":"i-abcd1111",
      "state":"terminated"
   }
}

The detail is in the…detail, as they say. The rest is just pre-amble common to all Cloudwatch Events. But here we can see that we are told the instance-id, and the state to which it has transitioned.

So we just need to hook up a Lambda function to a specific type of Cloudwatch Event. This is exactly what the Serverless Framework makes easy for us.

Note, the easiest way to play along is to follow the excellent instructions detailed here, below we are configuring the setup in a semi-manual fashion, to illustrate what is going on. Either way you’ll need to install the Serverless CLI.

Create a directory to house the project (lets say serverless-ec2). Then create a serverless.yml file with contents something like this:

service: serverless-ec2
package:
  artifact: handler.zip
provider:
  name: aws
  stage: production
  region: us-east-1
  runtime: python2.7
  iamRoleStatements:
    - Effect: "Allow"
      Action:
        - "ec2:DescribeTags"
      Resource: "*"
functions:
  terminate:
    handler: handler.HandleTerminate
    events:
      - cloudwatchEvent:
          event:
            source:
              - "aws.ec2"
            detail-type:
              - "EC2 Instance State-change Notification"
            detail:
              state:
               - terminated

This config describes a service (analogous to a project) called serverless-ec2.

The package section specifies that the handler.zip file is the artifact containing Lambda function code that is uploaded to AWS. Ordinarily the framework takes care of the the zipping for us, but we will be building our own artifact (more on that in a moment).

The provider section specifies some AWS information, along with an IAM Role that will be created, that allows our function to describe EC2 tags.

Finally the functions section specifies a function, terminate, that is triggered by a Cloudwatch Event of type ‘aws.ec2’, with an additional filter applied to match only those events that have a ‘state’ of ‘terminated’ in the detail section of the event (see above).  The function is to be handled by the handler.HandleTerminate function, which is to be the name of the Go function we will write.

So lets go ahead and write it. First, run the following to grab the runtime dependency:

go get -u -d github.com/eawsy/aws-lambda-go-core/...

Then we are good to compose our function, create a handler.go with the following content:

package main

import (
	"log"

	"github.com/eawsy/aws-lambda-go-core/service/lambda/runtime"
)

// CloudwatchEvent represents an AWS Cloudwatch Event
type CloudwatchEvent struct {
	ID     string `json:"id"`
	Region string `json:"region"`
	Detail map[string]string
}

// HandleTerminate decomissions the terminated instance
func HandleTerminate(evt *CloudwatchEvent, ctx *runtime.Context) (interface{}, error) {
	log.Printf("instance %s has entered the '%s' state\n", evt.Detail["instance-id"], evt.Detail["state"])
	return nil, nil
}

Some points to note:

  • Your Handle* functions must reside in the main package, but you are free to organise the rest of your code as you wish. Here we have declared HandleTerminate, which is the function referenced in serverless.yml.
  • The github.com/eawsy/aws-lambda-go-core/service/lambda/runtime package provides access to a runtime.Context object that allows you the same access to the runtime context as the official Lambda runtimes (to access, for example, the AWS request ID or remaining execution time).
  • The return value will be JSON marshalled and sent back to the client, unless the error is non-nil, in which case the function is treated as having failed.

Perhaps the most important piece of information here is how the event data is passed into the function. In our case this is the Cloudwatch EC2 Event JSON as shown above, but it may take the form of any number of JSON events. All we need to know is that the event is automatically JSON unmarshalled into the first argument.

This is why we have defined a CloudwatchEvent struct, which will be populated neatly by the raw JSON being unmarshalled. It should be noted that there are already a number of predefined type definitions available here, we are just showing this for explanatory purposes.

The rest of the function is extremely simple, it just uses the standard library’s log function to log that the instance has been terminated (you should use this over fmt as it plays more nicely with Cloudwatch Logs).

With our code in place we can build the handler.zip that will be uploaded by the Serverless Framework. This is where things get a little complicated. Thankfully, the chaps at eawsy have provided us with a Docker image (with Go 1.8, and some tools used in the build process, installed). They also provide a Makefile (with an alternative one here) that you should definitely use, again what follows is just to demystify the process:

Run:

docker pull eawsy/aws-lambda-go-shim:latest

docker run --rm -it -v $GOPATH:/go -v $(pwd):/build -w /build eawsy/aws-lambda-go-shim go build --buildmode=plugin -ldflags='-w -w' -o handler.so

This builds our code as a Go plugin (handler.so) from within the provided Docker container. Next, run:

docker run --rm -it -v $GOPATH:/go -v $(pwd):/build -w /build eawsy/aws-lambda-go-shim pack handler handler.so handler.zip

This runs a custom ‘pack’ script that creates a zip archive (handler.zip) that includes our recently compiled handler.so along with the Python shim required for it to work on AWS. The very same handler.zip referenced in the serverless.yml above!

The final step then is to actually deploy the function, which is as simple as:

sls deploy

Once the Serverless tool has finished doing its thing, you should have a function that logs whenever an EC2 instance is terminated!

Clearly, you want to do more than just log the terminated instance. But the actual decomissioning is subjective. For instance, amongst other things, we remove the instance’s Route53 record, delete its Chef node/client, and remove any locks it might be holding in our Consul cluster. The point is that this is now just Go code – you can do with it whatever you wish.

Note that if you require access to anything inside your VPC as part of the tidying-up process, you need to explicitly state the VPC and subnets/security groups in which Lambda functions will run. But don’t worry, the Serverless tool has you covered.

Custom Inspec Resources

When developing Chef cookbooks, a good test suite is an invaluable ally. It confers the power of confidence, confidence to refactor code or add new functionality and be…confident that you haven’t broken anything.

But when deciding how to test cookbooks, there is a certain amount of choice. Test Kitchen is a given, there really is no competition. But then do you run unit tests, integration tests, or both? Do you use Chefspec, Serverspec or Inspec? At Spaceape we have settled on writing unit tests only where they make sense, and concentrating on integration tests: we want to test the final state of servers running our cookbooks rather than necessarily how they get there. Serverspec has traditionally been our framework of choice but, following the lead of the good folks at Chef, we’ve recently started using Inspec.

Inspec is the natural successor to Serverspec. We already use it to test for security compliance against the CIS rulebook, so it makes sense for us to try and converge onto one framework. As such we’ve been writing our own custom Inspec resources and, with it being a relatively new field, wanted to share our progress.

The particular resource we’ll describe here is used to test our in-house Redis cookbook, sag_redis. It is a rather complex cookbook that actually uses information stored in Consul to build out Redis farms that register themselves with a Sentinel cluster. We’ll forego all that complexity here and just concentrate on how we go about testing the end state.

In the following example, we’ll be using Test Kitchen with the kitchen-vagrant plugin.

Directory Structure:

Within our sag_redis cookbook, we’ll create an inspec profile. This is a set of files that describe what should be tested, and how. The directory structure of an inspec profile is hugely important, if you deviate even slightly then the tests will fail to run. The best way to ensure compliance is to use the Inspec CLI, which is bundled with later versions of the Chef DK.

Create a directory test/integration then run:

inspec init profile default

This will create an Inspec profile called ‘default’ consisting in a bunch of files, some of which can be unsentimentally culled (the example control.rb for instance). As a bare minimum, we need a structure that looks like this:

—test
└── integration
│       ├── default
│       │   ├── controls
│       │   ├── inspec.yml
│       │   └── libraries

The default inspec.yml will need to be changed, that should be self-evident. The controls directory will house our test specs, and the libraries directory is a good place to stick the custom resource we are about to write.

The Resource:

First, lets take a look at what an ‘ordinary’ Inspec matcher looks like:

describe user('redis') do
  it { should exist }
  its('uid') { should eq 1234 }
  its('gid') { should eq 1234 }
end

Fairly self-explanatory and readable (which incidentally was one of the original goals of the Inspec project). The purpose of writing a custom resource is to bury a certain amount of complexity in a library, and expose it in the DSL as something akin to the above.

The resource we’ll write will be used to confirm that on-disk Redis configuration is as we expect. It will parse the config file and provide methods to check each of the options contained therein. In DSL it should look something like this:

describe redis_config('my_redis_service') do
  its('port') { should eq(6382) }
  its('az') { should eq('us-east-1b') }
end

So, in the default/libraries directory, we’ll create a file called redis_config.rb with the following contents:

class RedisConfig < Inspec.resource(1)
  name 'redis_config'

  desc '
    Check Redis on-disk configuration.
  '

  example '
    describe redis_config('dummy_service_6') do
      its('port') { should eq('6382') }
      its('slave-priority') { should eq('69') }
    end
  '

  def initialize(service)
    @service = service
    @path = "/etc/redis/#{service}"
    @file = inspec.file(@path)

    begin
      @params = Hash[*@file.content.split("\n")
                           .reject{ |l| l =~ /^#/ or l =~ /^save/ }
                           .collect { |v| [ v.chomp.split ] }
                      .flatten]
        rescue StandardError
          return skip_resource "#{@file}: #{$!}"
      end
    end
  end

  def exists?
    @file.file?
  end

  def method_missing(name)
    @params[name.to_s]
  end

end

There’s a fair bit going on here.

The resource is initialised with a single parameter – the name of the Redis service under test. From this we derive the @path of the its on-disk configuration. We then use this @path to initialise another Inspec resource: @file.

Why do this, why not just use a common-or-garden ::File object and be done with it? There is a good reason, and this is important: the test is run on the host machine, not the guest. If we were to use ::File then Inspec would check the machine running Test Kitchen, not the VM being tested. By using the Inspec file resource, we ensure we are checking the file at the given path on the Vagrant VM.

The remainder of the initialize function is dedicated to parsing the on-disk Redis config into a hash (@params) of attribute:value pairs. The ‘save’ lines that configure bgsync snapshotting are unique in that they have more than one value after the parameter name, so we ignore them. If we wanted to test these options we’d need to write a separate function.

The exists? function acts on our Inspec file resource, returning a boolean. Through some Inspec DSL sleight-of-hand this allows us to use the matcher it { should exist } (or indeed it { should_not exist } ).

The final function delegates all missing methods to the @params hash, so we are able to reference the config options directly as ‘port’ or ‘slave-priority’, for instance.

The Controls:

In Inspec parlance, the controls are where we describe the tests we wish to run.

In the interests of keeping it simple, we’ll write a single test case in default/controls/redis_configure_spec.rb that looks like this:

describe redis_config(“leaderboard_service") do
  it { should exist }
  its('slave-priority') { should eq('50') }
  its('rdbcompression') { should eq('yes') }
  its('dbfilename') { should eq('leaderboard_service.rdb') }
end

The Test:

Now we just need to instruct Test Kitchen to actually run the test.

The .kitchen.yml file in the base of our sag_redis cookbook looks like this:

driver:
  name: vagrant
  require_chef_omnibus: 12.3.0
  provision: true
  vagrantfiles:
    - vagrant.rb

provisioner:
  name: chef_zero

verifier:
  name: inspec

platforms:
  - name: ubuntu-14.04
    driver:
      box: ubuntu64-ami
      customize:
        memory: 1024

suites:
  - name: default
    provisioner:
      client_rb:
        environment: test
    run_list:
      - role[sag_redis_default]

Obviously this is quite subjective, but the important points to note are that we set the verifier to be inspec and we provide the name: default to the particular suite we wish to test (recall that our Inspec profile is called ‘default’).

And thats it! Now we can just run kitchen test and our Inspec custom resource will check that our Redis services are configured as we expect.

Go Wavefront!

Long ago we took the decision to outsource our metrics platform. We generate a lot of metrics, and we came to realise that our solution at the time, Graphite, was not up to the task. Instead of spending in-house resource building a new platform, we decided to find an external partner, so we could focus on our core competency – running mobile games.

We eventually settled on Wavefront, in private beta at the time. Even in these early stages of their development, we were wildly impressed with the product. The responsiveness of the graphs in-browser, and the stability of the metric ingestion platform particularly impressed us.

This was over 2 years ago. Since then we have grown alongside Wavefront and watched as they came out of stealth mode, and continuously added to their bevy of features to offer the world-beating product they have today.

We contributed largely to their Ruby client, which has been open-sourced and continues to improve. But now we’re happy to announce another OSS project, go-wavefront.

Go-wavefront is a set of Golang libraries and a bundled CLI for interacting with the Wavefront API. It also includes a simple Writer library for sending metrics. It was borne out of an itch we needed to scratch to integrate the smattering of Go applications we have with our metric provider. We hope that in opening it up to the wider Wavefront and Golang community we can improve what we have, and be better able to keep up with the new features Wavefront throw at us.

As a cute-but-probably-useless gimmick, the CLI can plot a Wavefront graph live in the terminal window! Check it out, all feedback and pull requests are welcome.

live-graph

Is there such thing as a DevOps Hierarchy of Needs?

In 1943 the psychologist Abraham Maslow proposed the concept of a ‘hierarchy of needs’ to describe human motivation. Most often portrayed as a pyramid, with the more fundamental needs occupying the largest space in the bottom layers, his theory states that only in the fulfilment of the lower-level needs can one hope to progress to the next strata of the pyramid. The bottom-most need is of course physiological (i.e. food, shelter, beer etc); once this is achieved we can start to think about safety (i.e. lets make sure nobody takes our beer) then we start looking for Love and Self Esteem before ending up cross-legged in an ashram searching for Self-Actualization and Transcendence.

maslows_hierarchy

Is this a Devops blog or what? Yes, yes it is. The suggestion is not that we should all be striving for Devops Transcendence or anything, but that perhaps the general gist of Maslow’s theory could be applied to coin a DevOps Hierarchy of Needs, and we could use the brief history of our own Devops team at Spaceape to bolster this idea.

In the beginning there was one man. This man was tasked with building the infrastructure to run our first game, Samurai Siege; not just the game-serving tier but also a Graphite installation, an ELK stack, a large Redis farm, an OpenVPN server, a Jenkins server, et cetera et cetera. At this juncture we could not even be certain that Samurai Siege would be a success. The remit was to get something that worked, to run our game to the standards expected by our players.

Some sound technological choices were made at this point, chief of which was to build our game within AWS.

With very few exceptions, we run everything in AWS. We’re exceedingly happy with AWS, and its suits our purposes. You may choose a different cloud provider; you may forego a cloud provider altogether and run your infrastructure on-premise. Whichever it is, this is the service that provides the first layer on our DHoN. You need some sort of request driven IaaS to equate to Maslow’s Physiological layer. Ideally this would include not only VMs but also your virtual network and storage. Without this service, whatever it might be (and it might be as simple as, say, a set of scripts to build KVM instances), you can’t hope to build toward the upper reaches of the pyramid.

Samurai Siege was launched. It was a runaway success. Even under load the game remained up, functional and performant. The one-man Devops machine left the company and Phase 2 in our short history commenced. We now had an in-house team of two and one remote contractor and we set about improving our lot, striving unawares for that next level of needs. It quickly became apparent, however, that we might face some difficulty…

If AWS provided the rock on which we built our proverbial church, we found that the church itself needed some repairs, someone had stolen the lead from its roof.

Another sound technology choice that was made early was to use Chef as the configuration management tool. Unfortunately – and unsurprisingly given the mitigating circumstances – the implementation was less than perfect. It appeared that Chef had only been used in the initial building of the infrastructure, attempts to run it on any sort of interval led inevitably to what we had started to call ‘facepalm moments’. We had a number of worrying 3rd party dependencies and if Chef was problematic, Cloudformation was outright dangerous. We had accrued what is commonly known as technical debt.

Clearly we had a lot of work to do. We set about wresting back control of our infrastructure. Chef was the first victim: we took a knife to our community cookbooks, we introduced unit tests and cookbook versioning, we separated configuration from code, we even co-opted Consul to help us. Once we had Chef back on-side we had little choice but to rebuild our infrastructure in its entirety, underneath a running game. With the backing of our CTO we undertook a policy of outsourcing components that we considered non-core (this was particularly efficacious with Graphite, more on this one day soon). This enabled us to concentrate  our efforts and to deliver a comprehensive game-serving platform, of which we were able to stamp out a new iteration for our now well-under-development second game, Rival Kingdoms.

It would be easy at this point to draw parallels with Maslow’s second tier, Safety. Our systems were resilient and monitored, we could safely scale them up and down or rebuild them. But actually what we had reached at this point was Repeatability. Our entire estate – from the network, load-balancers, security policies and autoscaling groups through to the configuration of Redis and Elasticsearch or the specifics of our deployment process – was represented as code. In the event of a disaster we could repeat our entire infrastructure.

Now, you might think this is a lazy observation. Of course you should build things in a repeatable fashion, especially in this age of transient hosts, build for failure, chaos monkeys, and all the rest of it. The fact is though, that whilst this should be a foremost concern of a Devops team, quite often it is not. Furthermore, there may be genuine reasons (normally business related) why this is so. The argument here is that you can’t successfully attain the higher layers of our hypothetical DHoN until you’ve reached this stage. You might even believe that you can but be assured that as your business grows, the cracks will appear.

At Spaceape we were entering Phase 3 of our Devops journey, the team had by now gained some and lost some staff, and gained a manager. The company itself was blossoming, the release date of Rival Kingdoms had been set, and we were rapidly employing the best game developers and QA engineers in London.

With our now sturdy IaaS and Repeatability layers in place, we were able to start construction of the next layer of our hierarchy  – Tooling. Of course we had built some tools in our journey thus far (they could perhaps be thought of as tiny little ladders resting on the side of our pyramid) but its only once things are standardised and repeatable that you can really start building effective tooling for the consumption of others. Any software that tries to encompass a non-standard, cavalier infrastructure will result in a patchwork of ugly if..then..else clauses and eventually a re-write when your estate grows to a point where this is unsustainable. At Spaceape, we developed ApeEye (a hilarious play on the acronym API) which is a RESTful Rails application that just happens to have a nice UI in front of it. Perennially under development, eventually it will provide control over all aspects of our estate but for now it facilitates the deployment of game code to our multifarious environments (we have a lot of environments – thanks to the benefits of standardisation we are able to very quickly spin up entirely new environments contained on a single virtual host).

And so the launch of Rival Kingdoms came and went. It was an unmitigated success, the infrastructure behaved – and continues to behave – impeccably. We have a third game under development, for which building the infrastructure is now down to a fine art. Perched as we are atop our IaaS, Repeatabilty and Tooling layers, we can start to think about the next layer.

But what is the next layer? It probably has something to do with having the time to reflect, write blog posts, contribute to OSS and speak at Devops events, perhaps in some way analogous to Maslow’s Esteem level. But in all honesty we don’t know, we’ve but scarcely laid the foundations for our Tooling level. More likely is that there is no next level, just a continuous re-hashing of the foundations beneath us as new technologies and challenges enter the fray.

The real point here is a simple truth – only once you have a solid, stable, repeatable and predictable base can you start to build on it to become as creative and as, well, awesome as you’d like to be. Try to avoid the temptation to take shortcuts in the beginning and you’ll reap the benefits in the long term. Incorporate the practices and behaviours that you know you should be,  as soon as you can.  Be kind to your future self.

Chef and Consul

Here at Spaceape, our configuration management tool of choice is Chef. We are also big fans of Consul, the distributed key-value-store-cum-service-discovery-tool from the good folks at Hashicorp. It might not be immediately clear why the two technologies should be mentioned in the same paragraph, but here is the story of how they became strange bedfellows.

consul - logo-gradient-94098a4aIn a previous blog post, we told the story of our experiences with Chef. That post goes into far greater detail, but suffice it to say that our infrastructure code base was not always as reliable, configurable or even predictable as it is now. We found ourselves in a dark place where Chef was run on an ad-hoc basis, often with fingers well and truly crossed. To wrest back control and gain confidence we needed to be able to run it on a 15-minute interval.

Simple, you say, write a cron-job. Well yes, that is true. But it’s only a very small part of the story. We would find occasions where initiating a seemingly harmless Chef run could obliterate a server, and yet the same run on an ostensibly similar server would reach the end without incident. In short we had little confidence in our code, certainly not enough to start running it on Production. Furthermore – and this applies still today – often we really don’t want to allow Chef to run simultaneously across a given environment. For example, we may push a change that restarts our game-serving process. I don’t need to expand on what would happen if that change ran across Production at 12:15 one day…

Wouldn’t it be nice, we asked ourselves, if we had some sort of global locking mechanism? To prevent us propagating potential catastrophes? Not only would this allow us to push infrastructure changes through our estate, it might just have other benefits…

Enter Consul!

Like all the Hashicorp products we’ve tried, Consul is solid. Written in Go, it employs the Raft consensus algorithm atop the Serf gossip protocol to provide a highly available distributed system that even passes the Jepsen test. It is somewhat of a swiss army knife that aims to replace or augment your existing service delivery, configuration management and monitoring tools.

The utility we decided to employ for the locking mechanism was the key-value store. We built our own processes and tooling around this, as we’ll see, but it should be noted that more recent versions of Consul than we had available at the time actually have a semaphore offering.

Stored in Consul, we have a number of per-tier, per-environment key-spaces. As an example:

chef/logstash/es-indexer

Logstash is the environment, es-indexer the service. Within this keyspace, the only pre-requisite is a value for the maximum number of concurrent Chef runs we wish to allow, which we call max_concurrent:

chef/logstash/es-indexer/max_concurrent

Generally this value is set to 1, but on some larger environments we set it higher.

When a server wishes to run Chef the first thing it does is to retrieve this max_concurrent value. Assuming the value is a positive integer (a value of -1 will simply allow all Chef runs) it then attempts to acquire a ‘slot’. A slot is obtained by checking this key:

chef/logstash/es-indexer/current

Which is a running count of the number of hosts in the tier currently running Chef. It’s absence denotes zero. If the current value is less than the `max_concurrent` value the server increments the counter and registers itself as ‘running’ by creating a key like this, the value of which is a timestamp:

chef/logstash/es-indexer/running/hostname.of.the.box

The sharper amongst you will have noticed a problem here. What if two hosts try to grab the slot at the same time? To avoid this happening we use Consul’s Check-and-Set feature. The way this works is that, upon the initial read of the current value, a ModifyIndex is retrieved along with the actual Value. If the server decides that current < max_concurrent it attempts to update current by passing a `?cas=ModifyIndex` parameter. If the ModifyIndex does not match that which is stored on Consul, it indicates that something else has updated it in the meantime, and the write fails.

With the slot obtained, the Chef run is allowed to commence. Upon success, the running key is removed and the counter decremented. If however the run fails the lock (or the ‘slot’) is held, no further hosts on the tier are able to acquire a slot, and Chef runs are thereby suspended.

Our monitoring tools, by checking the timestamp of the running key, are able to warn us when locks have been held for a certain period of time (i.e. longer than the average Chef run takes) and failures are contained to one (or rather max_concurrent) hosts.

And so… this all works rather well. Many has been the time when we’d look at one another, puff our cheeks, and say, “Thank goodness for the locking system!” Over time it has allowed us to unpick our infrastructure code and get to the smug position we find ourselves. Almost never do we see locks being held for anything but trivial problems (Chef server timeouts for instance), nothing that a little judicious sleep-and-retry doesn’t fix. It also gives us great control over when and what runs Chef as we can easily disable Chef runs for a given tier by setting max_concurrent to 0.

But the purists amongst you will no doubt be screaming something about a CI system, or better unit tests, or something, or something. And you’d be right. The truth is that we were unable to shoehorn a CI system into infrastructure code which was underpinning a live game, in which we did not have complete confidence. Having the backup-parachute of the mechanism described above, though, has enabled us to address this. But that doesn’t mean we’ll be discarding it. On the contrary, it will form the backbone of our CI system, facilitating the automatic propagation of infrastructure code throughout our estate. More on that to follow.