Vault Configuration as Code

Here at Space Ape we use Vault extensively. All of our instances authenticate with Vault using the EC2 auth backend which allows us to restrict the scope of secrets any instance has access to.

Behind Vault, we use Consul as a backend to persist our secrets with a good level of durability and make use of Consul’s snapshot feature to create backups, which means we can restore both Consul and Vault from the backup if the worst case occurred.

Where we’ve struggled with Vault is in managing the configuration: which policies, roles, auth backends do we have? Which of our AWS accounts are setup for the EC2 auth and how do we update or replicate any of these configurations? If we had to set up a new instance of Vault, or recover an existing one, how long would it take us to get everything setup? Probably a lot longer than it should.

This isn’t something we accept elsewhere in our estate: We use CloudFormation to manage precisely how out AWS infrastructure looks; we use Chef to manage exactly how our instances are setup and applications are configured. All of this is configuration is stored in Git. In short we treat our configuration as code.

For those looking to manage configuration in Vault, help is at hand. In November 2016 Hashicorp’s Seth Vargo penned a blog post that caught our interest – Codifying vault policies and configuration – in which he describes how to use the Vault API to apply configuration from files. There a few things we can learn from Seth’s post:

  • The API calls are idempotent
  • The script ignores the response as you’ll often get non-200 responses (for instance if a mount already exists)
  • He maps the directory structure to the API, this makes it easy to rewrite the code in any language without having to change your directory structure.
  • API calls need to be applied in the correct order (e.g. An Auth backend must exist before you can apply configuration to it.
  • You can integrate this into your CI lifecycle.

A couple of things that are missing:

  • Code testing
  • Verifying the result of our API call was successful.

Taking Seth’s blog post as our starting point, we set out to implement configuration-managed Vault clusters using the API.

We use a lot of Ruby here so it makes sense to create a gem to apply our configuration for us and we can take the opportunity to apply unit tests. We can use Jenkins to test applying our actual configuration.

Requirements

  • Code should be tested
  • We should verify that our config has been applied correctly
  • We want a CI pipeline for our configuration.

We quickly realised that a lot of the process is repeated for each API endpoint:

  1. Locate files containing the configuration
  2. Parse the files containing the configuration
  3. Apply the configuration
  4. Verify the configuration

We have a Setup class that handles creating an instance of the Vault Client and locating the relevant files for each configuration type.

We created a Base class that our implementation (policies, auths backends etc) classes can inherit that will parse, apply and verify configuration.

Setup class

To create a Vault client it’s as simple as using the  Vault gem and providing the usual configuration details such as the address and a token.

We also have methods to locate the relevant files for any configuration item, such as policies. We simply need to supply the path to the directory in which the configuration files reside.

Base class

In the Base class we start by parsing files that the Setup class located for us. We accept hcl, yaml or json files and parse them into a hash.

We then call apply and verify methods which are implemented in classes specific to the configuration item such as Policies or Auths.

Policies

Applying policies is a good starting point as they represent a lot of our configuration and are referenced by other sections of configuration.

We save time by inheriting the the Base class we discussed above and we have an instance of the Setup class so that we can locate the files we need and have a Vault client to use.

We then implement the apply and verify methods. For a Policy the apply method is very simple, it simply uses the name of the file as the name of the policy and the contents of the file, which we translate into Json, as the body of the policy

client.sys.put_policy(name, hash.to_json)

Next we verify that our Policy was correctly applied. The first step of this is to request the policy from Vault which we can simply ask for by name (also filename).

client.sys.policy(name)

Then we can:

  1. Check that we received a Policy and not an error or an empty blob of Json.
  2. That the Json we receive matches the Json that we sent. We use the JsonCompare gem to verify each key value pair that is returned.
└── sys
    └── policy
        └── admins.hcl

The directory structure in which we store our policies. /sys/policy/admins would be the api path to post a policy if you wanted to use the API directly.

path "*" {
 policy = "sudo"
}

A really bad example of a Vault role that admin.hcl might contain. We parse this as HCL to post to /sys/policy/admins.

Testing

One of our requirements was to write tests. Below are our tests for policies.

require "spec_helper"

describe Spaceape::VaultSetup::Policy do
  subject do
    Spaceape::VaultSetup::Policy.new(
      Spaceape::VaultSetup::Setup.new(
        vault_address: "http://vault:8200",
        ssl_verify: false,
        config_dir: "spec/fixtures/main",
        vault_token: vault_token
      ),
      false
    )
  end

  let(:test_policy) do
    {
      "path": {
         "auth/app-id/map/user-id/*":
           {
             "policy": "write"
           }
        }
     }
   end

  it "applies and verifies a policy" do
    subject.apply("test-policy", test_policy)
      expect { subject.verify("test-policy", test_policy) }
        .to_not raise_error
  end

  it "identifies invalid policy" do
    subject.apply("test-policy", test_policy)
    wrong_role = test_policy.dup
    wrong_role[:path] = "/auth/app-id/map/uuuuuu/*"
    expect { subject.verify("test-policy", wrong_role) }
      .to raise_error(Spaceape::VaultSetup::ItemMismatchError)
  end

  it "applies all policies in config_dir" do
    subject.apply_items(subject.policy_files)
    expect(subject.client.sys.policies)
      .to include("test-policy2", "test-policy")
  end
end

From the test above you can see that can see that we test against a vault server at vault:8200. We run these tests in in Docker and make use of Docker compose so we can create a Vault server in dev mode and then a ruby container, with our code mounted in a volume, to run our tests.

Auths and Mounts

Policies were easy – we parse the file, make a single API call to apply the policy and another to verify it. Auths and Mounts are a bit more complicated. There are essentially three parts to each:

  1. Enable the Auth/Mount
  2. Tune the Auth Mount
  3. Configure the Auth Mount

Enabling is pretty simple you pass the name (this is what you want to call it), the type (such as secret, github or pki) and an optional description.

We store this information in sys/auth/<name>.ext. The API endpoint is sys/auth/<name>.

└── sys
    └── auth
        └── github-spaceape.json

This contents of this file may look like:

{
  "type": "github",
  "description": "spaceape github",
  "config": {
    "max_lease_ttl": "87600h",
    "default_lease_ttl": "3h"
  }
}

Notice it contains the type and description which we covered above. It also includes a config key, this is actually the tuning we can apply to the Auth/Mount. This is applied to the API endpoint sys/auth/<name>/tune so it seems to make sense to store it in this file.

So far so good, but now we come onto configuring the Auth or Mount. There’s no standard pattern here and they sometimes require secrets. We decided to exclude any secrets from the config. These can be applied as manual steps later. We can however apply some configuration.

For example we can set the organisation for the Github auth, but we don’t wouldn’t want to set the AWS credentials for the EC2 auth backend.

The API endpoint for applying configuration to Auths is auth/<name>/config and Mounts is <name>/config/<config_item>. We decided to group our mounts under a mounts directory, veering slightly from the file structure matching the API path.

Our directory structure now looks a little like this:

└── sys
|   └── auth
|   |   └── github-spaceape.json
|   └── mounts
|       └── spaceape-pki
└── auth
|   └── github-spaceape
|       └── config.json
└── mounts
    └── spaceape-pki
        └── config
        |   └── urls.json
        └── roles
            └── example-role.json

This is where mapping the file path as the API comes into it’s own: we can handle any of the Auths or Mounts without having explicitly write code for the exact type, we just have to get the structure correct.

Gotchas

There are a few things to look out for.

  1. When verifying our changes were applied Vault sometimes gives you more back than you expect. We just verify the fields we pass in.
  2. Time based fields (like the various ttl fields) are not always returned in the same format, you may get the time in seconds, or days and hour, etc. We found the chronic_duration gem useful for parsing the times for easy comparison.
  3. You may find some configuration on an Auth or Mount may have to be applied in a specific order, this is where we would have to write custom code to handle that particular type of Auth or Mount. Perhaps a configuration file could define the order in which to apply certain configuration.

Continuous Integration

When we check in code, a Jenkins job is triggered which will run our tests. As mentioned earlier we run our tests inside of Docker containers, this means that we don’t have to worry about having clashing versions of gems from other Ruby based builds we have on Jenkins.

More interesting to us is that we can now test our actual Vault configuration. So when we add a new policy we know it applies correctly. Again we use Jenkins to do this. Each time we commit a change to our Vault configuration git repository we trigger a build which attempts to apply the configuration to an instance of Vault running in Dev mode. If any of the configuration fails we can be prompted through Slack to see what caused it.

It’s still up to us to apply the changes to the production instance of Vault after the Jenkins tests have run successfully. This is mainly because we don’t want to give privileged Vault tokens out to Jenkins.

Final Words

The process we’ve described above for managing Vault configuration is just one way you could go about solving this problem. From our experience, it works – we can test our configuration and apply it in a repeatable and programmatic way.

It is however still a work in progress and the will doubtless be a few problems to overcome as we continue our development. We hope to Open Source the code in the future, but right now we feel there are still some improvements, for instance at the moment we test against Vault 0.6.5 (the latest release is 0.7.3) and we’ve only tested against a handful of Mounts and Authentication backends.

 

De-comming EC2 Instances With Serverless and Go

Nothing is certain but Death and Taxes, goes the old idiom, and EC2 instances are not exempt. You will have to pay for them, and at some point they will die. Truly progressive outfits embrace this fact and pick off unsuspecting instances Chaos-Monkey-style, others wait for that obituary email from Amazon. Either way, we all have to make allowances for those dearly departed instances, and tidy them up once they are gone.

This article describes one way of doing so automatically; using Lambda, the Serverless Framework, and Go.

Why?

Why Lambda? The obvious benefit is that there is no need to run and maintain a host to watch for dying instances. Also it integrates nicely with Cloudwatch Events, which is the best way to get notified of them.

Why Serverless? The Serverless Framework is an open-source effort to provide a unified way of building entire serverless architectures. Originally designed specifically for Lambda it is gaining increasing support for other providers too. Beyond just deploying Lambda functions, it allows you to manage all of the supporting infrastructural components (e.g. IAM, ELBs, S3 buckets) in one place, by supplementing your Lambda code with Cloudformation templates.

Why Go? Aside from it being one of our operational languages (along with Ruby), this is perhaps the hardest one to answer, as AWS don’t actually support it natively (yet). However some recent developments have made it more attractive: in Go 1.8, support was added for plugins. These are Go programs that are compiled as shared modules, to be consumed by other programs. The guys at eawsy with their awesome aws-lambda-go-shim immediately saw the potential this had in running Go code from a Python Lambda function. No more spawning a process to run a binary, instead have Python link the shared module and call it directly. Their Github page suggests that this is the second fastest way of executing a Lambda function, faster even than NodeJS, the serverless poster-boy!

It is this shim that we have used to build our EC2 Decomissioner, and we have also borrowed heavily from this idea (we found that we just needed a bit more flexibility, notably in pulling build-time secrets from Vault, outside the scope of this article).

How?

Cloudwatch Events are a relatively recent addition to the AWS ecosystem. They allow us to be notified of various events through one or more targets (e.g. Lambda functions, Kinesis streams).

Pertinently for this application, we can be told when an EC2 instance enters the terminated state, and the docs tell us the event JSON received by the target (in our case a Lambda function) will look like this:

{
   "id":"7bf73129-1428-4cd3-a780-95db273d1602",
   "detail-type":"EC2 Instance State-change Notification",
   "source":"aws.ec2",
   "account":"123456789012",
   "time":"2015-11-11T21:29:54Z",
   "region":"us-east-1",
   "resources":[
      "arn:aws:ec2:us-east-1:123456789012:instance/i-abcd1111"
   ],
   "detail":{
      "instance-id":"i-abcd1111",
      "state":"terminated"
   }
}

The detail is in the…detail, as they say. The rest is just pre-amble common to all Cloudwatch Events. But here we can see that we are told the instance-id, and the state to which it has transitioned.

So we just need to hook up a Lambda function to a specific type of Cloudwatch Event. This is exactly what the Serverless Framework makes easy for us.

Note, the easiest way to play along is to follow the excellent instructions detailed here, below we are configuring the setup in a semi-manual fashion, to illustrate what is going on. Either way you’ll need to install the Serverless CLI.

Create a directory to house the project (lets say serverless-ec2). Then create a serverless.yml file with contents something like this:

service: serverless-ec2
package:
  artifact: handler.zip
provider:
  name: aws
  stage: production
  region: us-east-1
  runtime: python2.7
  iamRoleStatements:
    - Effect: "Allow"
      Action:
        - "ec2:DescribeTags"
      Resource: "*"
functions:
  terminate:
    handler: handler.HandleTerminate
    events:
      - cloudwatchEvent:
          event:
            source:
              - "aws.ec2"
            detail-type:
              - "EC2 Instance State-change Notification"
            detail:
              state:
               - terminated

This config describes a service (analogous to a project) called serverless-ec2.

The package section specifies that the handler.zip file is the artifact containing Lambda function code that is uploaded to AWS. Ordinarily the framework takes care of the the zipping for us, but we will be building our own artifact (more on that in a moment).

The provider section specifies some AWS information, along with an IAM Role that will be created, that allows our function to describe EC2 tags.

Finally the functions section specifies a function, terminate, that is triggered by a Cloudwatch Event of type ‘aws.ec2’, with an additional filter applied to match only those events that have a ‘state’ of ‘terminated’ in the detail section of the event (see above).  The function is to be handled by the handler.HandleTerminate function, which is to be the name of the Go function we will write.

So lets go ahead and write it. First, run the following to grab the runtime dependency:

go get -u -d github.com/eawsy/aws-lambda-go-core/...

Then we are good to compose our function, create a handler.go with the following content:

package main

import (
	"log"

	"github.com/eawsy/aws-lambda-go-core/service/lambda/runtime"
)

// CloudwatchEvent represents an AWS Cloudwatch Event
type CloudwatchEvent struct {
	ID     string `json:"id"`
	Region string `json:"region"`
	Detail map[string]string
}

// HandleTerminate decomissions the terminated instance
func HandleTerminate(evt *CloudwatchEvent, ctx *runtime.Context) (interface{}, error) {
	log.Printf("instance %s has entered the '%s' state\n", evt.Detail["instance-id"], evt.Detail["state"])
	return nil, nil
}

Some points to note:

  • Your Handle* functions must reside in the main package, but you are free to organise the rest of your code as you wish. Here we have declared HandleTerminate, which is the function referenced in serverless.yml.
  • The github.com/eawsy/aws-lambda-go-core/service/lambda/runtime package provides access to a runtime.Context object that allows you the same access to the runtime context as the official Lambda runtimes (to access, for example, the AWS request ID or remaining execution time).
  • The return value will be JSON marshalled and sent back to the client, unless the error is non-nil, in which case the function is treated as having failed.

Perhaps the most important piece of information here is how the event data is passed into the function. In our case this is the Cloudwatch EC2 Event JSON as shown above, but it may take the form of any number of JSON events. All we need to know is that the event is automatically JSON unmarshalled into the first argument.

This is why we have defined a CloudwatchEvent struct, which will be populated neatly by the raw JSON being unmarshalled. It should be noted that there are already a number of predefined type definitions available here, we are just showing this for explanatory purposes.

The rest of the function is extremely simple, it just uses the standard library’s log function to log that the instance has been terminated (you should use this over fmt as it plays more nicely with Cloudwatch Logs).

With our code in place we can build the handler.zip that will be uploaded by the Serverless Framework. This is where things get a little complicated. Thankfully, the chaps at eawsy have provided us with a Docker image (with Go 1.8, and some tools used in the build process, installed). They also provide a Makefile (with an alternative one here) that you should definitely use, again what follows is just to demystify the process:

Run:

docker pull eawsy/aws-lambda-go-shim:latest

docker run --rm -it -v $GOPATH:/go -v $(pwd):/build -w /build eawsy/aws-lambda-go-shim go build --buildmode=plugin -ldflags='-w -w' -o handler.so

This builds our code as a Go plugin (handler.so) from within the provided Docker container. Next, run:

docker run --rm -it -v $GOPATH:/go -v $(pwd):/build -w /build eawsy/aws-lambda-go-shim pack handler handler.so handler.zip

This runs a custom ‘pack’ script that creates a zip archive (handler.zip) that includes our recently compiled handler.so along with the Python shim required for it to work on AWS. The very same handler.zip referenced in the serverless.yml above!

The final step then is to actually deploy the function, which is as simple as:

sls deploy

Once the Serverless tool has finished doing its thing, you should have a function that logs whenever an EC2 instance is terminated!

Clearly, you want to do more than just log the terminated instance. But the actual decomissioning is subjective. For instance, amongst other things, we remove the instance’s Route53 record, delete its Chef node/client, and remove any locks it might be holding in our Consul cluster. The point is that this is now just Go code – you can do with it whatever you wish.

Note that if you require access to anything inside your VPC as part of the tidying-up process, you need to explicitly state the VPC and subnets/security groups in which Lambda functions will run. But don’t worry, the Serverless tool has you covered.

Go Wavefront!

Long ago we took the decision to outsource our metrics platform. We generate a lot of metrics, and we came to realise that our solution at the time, Graphite, was not up to the task. Instead of spending in-house resource building a new platform, we decided to find an external partner, so we could focus on our core competency – running mobile games.

We eventually settled on Wavefront, in private beta at the time. Even in these early stages of their development, we were wildly impressed with the product. The responsiveness of the graphs in-browser, and the stability of the metric ingestion platform particularly impressed us.

This was over 2 years ago. Since then we have grown alongside Wavefront and watched as they came out of stealth mode, and continuously added to their bevy of features to offer the world-beating product they have today.

We contributed largely to their Ruby client, which has been open-sourced and continues to improve. But now we’re happy to announce another OSS project, go-wavefront.

Go-wavefront is a set of Golang libraries and a bundled CLI for interacting with the Wavefront API. It also includes a simple Writer library for sending metrics. It was borne out of an itch we needed to scratch to integrate the smattering of Go applications we have with our metric provider. We hope that in opening it up to the wider Wavefront and Golang community we can improve what we have, and be better able to keep up with the new features Wavefront throw at us.

As a cute-but-probably-useless gimmick, the CLI can plot a Wavefront graph live in the terminal window! Check it out, all feedback and pull requests are welcome.

live-graph

Introducing ComposeECS

The DevOps team here at Space Ape have just open-sourced a small Ruby gem that provides a mechanism to convert Docker Compose specifications to AWS EC2 Container Service task definitions – we’ve called it ComposeECS.

We run the majority of our infrastructure on AWS, using source-controlled CloudFormation templates to manage each of our stacks. Over time we’ve built up a toolchain to help us manage changes to our CloudFormation stacks known internally as ApeStack which incorporates CFNDSL and our internal conventions and processes.

Toward the end of 2015 we started to build out a new platform that would support deployments of containerised applications where it made sense. As heavy users of AWS, EC2 Container Service (ECS) was the obvious choice for running containers in the cloud. The potential advantages of deep integration with AWS services like ELB and IAM have significant implications when it comes to integrating the new platform with our existing stack.

As a team, we are huge fans of specifying configuration in YAML. It’s then perhaps no surprise that we much prefer the syntax of Docker Compose definitions over the JSON-based ECS task definitions. We wanted to be able to specify our task definitions in YAML alongside our CloudFormation templates. Furthermore, we wanted to construct ECS services and their supporting infrastructure with a single command. To this end, we wrote ComposeECS.

ComposeECS reads any Docker Compose file and translates supported attributes(including volume definitions) into an equivalent ECS task definition JSON – sanity-checking your attributes and ensuring compatibility with the ECS task definition specification as it does so. The advantages ComposeECS provides include:

  • Container definitions are more readable and therefore easier to maintain.
  • Docker Compose definitions written to run in our local Docker environment now run on ECS with little modification.
  • Unlike services translated with the ECS-CLI, which supports Docker Compose deployment, our services are CloudFormation-managed whilst still taking advantage of the Docker Compose syntax.

We’re very pleased with ComposeECS. However, it is but one of many hurdles on the path to a production-ready Space Ape container platform. For us, many questions remain around service discovery; efficient and reliable deployment; injection of configuration; handling scale and how to effectively monitor our cluster. We look forward to sharing more from our journey.


If you’re interested in integrating ComposeECS into your toolchain, or contributing, head over to the project Github page.

CoreOS London January Meetup 2016

 

On Wednesday 13th January some members of our Devops team attended the CoreOS London Meetup hosted at BlackRock.

At Space Ape we are taking steps towards running some of our services in containers and are very interested in CoreOS and the ecosystem around it.

The first talk of the evening by Ric Harvey from ngineered took us through running CoreOS on AWS, specifically talking about their approach and findings running Kubernetes.

We started off with an overview of Kubernetes installation on AWS and covered some of the problems you may run into. One to note was pulling Kubernetes from GitHub. If you decide to pull from GitHub over a lot of nodes in a VPC you’ll probably run into API limits, to get around this you can host Kubernetes in an S3 bucket (or other endpoint you control)
and pull from there. This approach is also preferable as it allows you more granular control over the version of Kubernetes you are deploying.

Ric talked us through their CoreOS setup and how they’ve worked to implement AWS best practises in their cluster. To start with they run everything in a VPC and ensure that they deploy the cluster over multiple availability zones (AZ). Within each AZ they deploy at least two subnets, one which will contain their Elastic Load Balancers (ELB) and a second which contains the CoreOS nodes and shared storage nodes.

The CoreOS nodes are setup in autoscaling groups which has allowed them to scale the fleet up and down automatically. On top of this they’ve got the Kubernetes replication controller deploying the containers around the cluster, ensuring they’ve always got the desired amount. The autoscaling groups can either be scaled manually or they can
make use of cloudwatch alarms. One improvement they are working on is making use of custom metrics for scaling (e.g. container count) as right now they depend on the metrics from the hypervisors.

Outside of the autoscaling groups they will deploy the Kubernetes master node. This allows them to control the cluster and to ensure that the node will not get destroyed during a scaling event.

The approach to load balancers was very interesting. While Kubernetes can control ELB creation, ngineered opted not to use this and instead use ELBs for public access which forward the traffic to haproxy instances. These haproxy instances run on all of the CoreOS instances and forward the traffic to the right container via IP addresses defined for that service by Kubernetes. It was a good example of controlling costs while still ensuring that the setup was flexible.

It was really good to see a production Kubernetes cluster being deployed on AWS and to get an insight into the challenges that you could face doing it. It was helpful to get an insight into what works and what doesn’t and how careful design will help with cost control.

The slides from this talk are available on Google Docs.

 
The second talk of the evening was by Joseph Schorr from CoreOS who was telling us about the security work CoreOS have been doing at all levels of the stack.

Joseph started off talking about system compromise and asking at what level can you now trust the system? How do you know if the hardware/bootloaders have been compromised?

To combat this uncertainty CoreOS have been working on signing all levels of the system boot process using keys that are stored in a trusted platform module (TPM). At a high level this allows for a set of keys to be embedded in hardware on the system. First the TPM is verified to ensure it’s authentic. Once that has taken place each component in the system can be loaded, with a signature being checked at each stage. If the signature is validated the component will be loaded. If there are any failures the boot will halt.

This system ensures that by the time the OS has been loaded you’ve got a verified trail of each step of the process which can later be audited.

Within your trusted OS environment you can now deploy your containers. It would be expected that you would build your containers from within a trusted environment and have them signed. Using these signed containers you are able to verify and deploy them into your trusted OS environment.

This is a great step forward for security and integrity of the OS and should give administrators a lot more confidence in the systems they are deploying. CoreOS have blogged on this system providing far more technical detail on how it works.

Joseph then went on to talk about Quay.io and the steps they’ve been implementing to provide security insights into the containers they are hosting. Quay.io are using their new tool Clair to scan all of the containers they host against the CVE database (and common vendor databases) and report when your container contains an insecure package. Each layer of the container is scanned so you will get alerted on insecure packages that you might not know are present on the system. Alerts are triggered via webhooks which allow you to receive notifications in a number of different places.

If you’re using Quay.io to host your container images using the Clair scanning seems like something worth using right away.

The slides from this talk are also available on Google Docs.

This talk really showed in todays world with exploits getting more and more sophisticated, security needs to be thought of at all levels of the system from the hardware up. It also showed that through the use of clever tooling it’s possible to be alerted to potential problems in your infrastructure very soon after they are made public and how containers can help you to resolve those problems and quickly prove they are resolved.

Thanks to BlackRock for hosting, to Ric and Joseph for speaking and for the organisers for organising an enjoyable evening. We’re looking forward to the next one.

The Evolution of a CI System

The process of building and maintaining repeatable infrastructure, a process we now know as configuration management, has evolved over the years. It has had to, to keep up with the seismic shifts within the industry. 

In the beginning there were shell-scripts and Kickstart manifests, accompanied by – if you were lucky – lengthy procedural documents. Inevitably some clever folk encapsulated these into tools and frameworks such as cfEngine, Puppet and Chef. With these tools at our disposal we now found we could represent our infrastructure as code and, since it was just code, why not apply some of the principals that our developer cousins had been preaching? Namely, unit and integration tests, code reviews, continuous integration and deployment etc etc.

chef_logo

In keeping with the trend, eventually these configuration management tools were themselves further abstracted. Companies built their own bespoke CI systems to solve their own specific problems. 

This is the story of how Space Ape’s Chef-based CI system evolved. Hopefully it may resonate with others, and even provide inspiration to those facing similar problems.

We started with community cookbooks. A lot of community cookbooks. We had cookbooks wrapping those community cookbooks, we even had cookbooks wrapping those wrapper cookbooks. We had no version constraints; if you pushed some code to the Chef server you pushed it to all environments, instantly.

Versioning cookbooks against environments seemed an obvious place to start, so we did. We used the knife spork tool. Knife spork is a handy knife plugin that will ‘bump’ cookbook versions, and ‘promote’ those new versions through environments. Crucially it leaves your production code running a previous version of a cookbook until such time you decide it is safe to promote.

Now, the community cookbook paradigm is great for getting things up and running quickly. But the long tail of dependencies soon becomes unwieldy: do you really need code to install Java on Windows; or yum repository management, when you’re running Ubuntu? Why do we have a runit cookbook, we’ve never even used runit? The problem is that community cookbooks need to support all manner of operating systems and frameworks, not just the specific ones you happen to use. So we took a policy of re-writing all of our infrastructure code, removing unwanted cruft and distilling only that which we absolutely needed.

Eventually, as the quality of our cookbook code improved, we found that often we would want to promote cookbooks through all environments. What better way to achieve this than a for loop?

for env in $(knife environment list); do knife spork promote ${env} sag_logstash; done

Any time you find yourself using the same for-loop each day, its probably time to write a script, or shell-helper at least. Additionally, the only safeguard we had with the for-loop, in the event of a problem, was to frantically hit Ctrl-C before it hit production.

Enter Space Ape’s first, really, er, rubbish CI system:

Our First CI

Essentially our first tool was that same for loop, with some ASCII art thrown in, and some very rudimentary testing between environments (i.e. did the Chef run finish?). It was still a long way from perfect, but a slight improvement. Our main gripe with this approach (apart from the obvious fact that is was indeed a rubbish CI system) was the fact that it still provided very little in the way of safety, and completely ignored our integration tests.

In time we decided that maybe it was time we made some proper use of those tests. A shell-script just would no longer cut it, ASCII art or not. No, we needed a system we could trust to continuously deploy our cookbook code, dependent on tests, with a proper queueing mechanism and relevant notifications upon failure.

Being decidedly not a ‘not invented here’ Devops team, we investigated some open-source and COTS offerings, but ultimately found them to be not quite suitable or malleable enough for our needs. We decided to build our own.

And so SeaEye was born. OK, it’s a silly name an amazing pun, and we already had another amazing pun, ApeEye, a system we use for deploying code, so it made sense.

SeaEye is a Rails app that runs on Docker, uses Sidekiq as a background job processor and an AWS RDS database as a backend. It is first and foremost an HTTP API, which just happens to have a nice(-ish) web frontend. This allows us to build command line tools that poke and poll the API for various means.

Screen Shot 2015-11-27 at 15.59.21

Beneath the nice(-ish) facade are a hierarchy of stateful workflows, each corresponding to a Sidekiq job and represented as finite-state-machine workflows using the this workflow gem. The basic unit of work is the CookbookPush, which is made up of a number of sub-tasks, one for each environment to be pushed through. The CookbookPush is responsible for monitoring the progress of each sub-task, and only when one has successfully completed does it allow the next to run. It makes use of the Consul-based locks we described in this post to add an element of safety to the whole process. 

A CookbookPush can be initiated manually, but that is only half of the story. We wanted SeaEye to integrate with our development workflow. Like most Chef shops, we use Test Kitchen to test our cookbooks. Locally we test using Vagrant, and remotely using Travis-CI with the kitchen-ec2 plugin. We perform work on a branch and, once happy, merge the branch into master. What we’d traditionally do is then watch for the tests to pass before manually kicking off the CookbookPush.

Screen Shot 2015-11-27 at 15.59.47

We knew we could do better. So we added another stateful workflow, called the CI. The premise here is that SeaEye itself polls Github for commits against the master branch. If it finds one, and there is a specific tag against it, it will manually kick off a Travis build. Travis is then polled periodically as to the success (or otherwise) of the build, and CookbookPush-es are created for each cookbook concerned. The DevOps team are kept informed of the progress through Slack messages sent by SeaEye.

There are many ways to skin this particular CI cat, and many off-the-shelf products to help facilitate the skinning.  Rolling our own has happened to have worked well for us, but every team and business is different. We’ve since built a suite of command-line tools, and even integrated SeaEye with ChatOps. Hopefully our experiences will help inspire others facing similar problems.

Is there such thing as a DevOps Hierarchy of Needs?

In 1943 the psychologist Abraham Maslow proposed the concept of a ‘hierarchy of needs’ to describe human motivation. Most often portrayed as a pyramid, with the more fundamental needs occupying the largest space in the bottom layers, his theory states that only in the fulfilment of the lower-level needs can one hope to progress to the next strata of the pyramid. The bottom-most need is of course physiological (i.e. food, shelter, beer etc); once this is achieved we can start to think about safety (i.e. lets make sure nobody takes our beer) then we start looking for Love and Self Esteem before ending up cross-legged in an ashram searching for Self-Actualization and Transcendence.

maslows_hierarchy

Is this a Devops blog or what? Yes, yes it is. The suggestion is not that we should all be striving for Devops Transcendence or anything, but that perhaps the general gist of Maslow’s theory could be applied to coin a DevOps Hierarchy of Needs, and we could use the brief history of our own Devops team at Spaceape to bolster this idea.

In the beginning there was one man. This man was tasked with building the infrastructure to run our first game, Samurai Siege; not just the game-serving tier but also a Graphite installation, an ELK stack, a large Redis farm, an OpenVPN server, a Jenkins server, et cetera et cetera. At this juncture we could not even be certain that Samurai Siege would be a success. The remit was to get something that worked, to run our game to the standards expected by our players.

Some sound technological choices were made at this point, chief of which was to build our game within AWS.

With very few exceptions, we run everything in AWS. We’re exceedingly happy with AWS, and its suits our purposes. You may choose a different cloud provider; you may forego a cloud provider altogether and run your infrastructure on-premise. Whichever it is, this is the service that provides the first layer on our DHoN. You need some sort of request driven IaaS to equate to Maslow’s Physiological layer. Ideally this would include not only VMs but also your virtual network and storage. Without this service, whatever it might be (and it might be as simple as, say, a set of scripts to build KVM instances), you can’t hope to build toward the upper reaches of the pyramid.

Samurai Siege was launched. It was a runaway success. Even under load the game remained up, functional and performant. The one-man Devops machine left the company and Phase 2 in our short history commenced. We now had an in-house team of two and one remote contractor and we set about improving our lot, striving unawares for that next level of needs. It quickly became apparent, however, that we might face some difficulty…

If AWS provided the rock on which we built our proverbial church, we found that the church itself needed some repairs, someone had stolen the lead from its roof.

Another sound technology choice that was made early was to use Chef as the configuration management tool. Unfortunately – and unsurprisingly given the mitigating circumstances – the implementation was less than perfect. It appeared that Chef had only been used in the initial building of the infrastructure, attempts to run it on any sort of interval led inevitably to what we had started to call ‘facepalm moments’. We had a number of worrying 3rd party dependencies and if Chef was problematic, Cloudformation was outright dangerous. We had accrued what is commonly known as technical debt.

Clearly we had a lot of work to do. We set about wresting back control of our infrastructure. Chef was the first victim: we took a knife to our community cookbooks, we introduced unit tests and cookbook versioning, we separated configuration from code, we even co-opted Consul to help us. Once we had Chef back on-side we had little choice but to rebuild our infrastructure in its entirety, underneath a running game. With the backing of our CTO we undertook a policy of outsourcing components that we considered non-core (this was particularly efficacious with Graphite, more on this one day soon). This enabled us to concentrate  our efforts and to deliver a comprehensive game-serving platform, of which we were able to stamp out a new iteration for our now well-under-development second game, Rival Kingdoms.

It would be easy at this point to draw parallels with Maslow’s second tier, Safety. Our systems were resilient and monitored, we could safely scale them up and down or rebuild them. But actually what we had reached at this point was Repeatability. Our entire estate – from the network, load-balancers, security policies and autoscaling groups through to the configuration of Redis and Elasticsearch or the specifics of our deployment process – was represented as code. In the event of a disaster we could repeat our entire infrastructure.

Now, you might think this is a lazy observation. Of course you should build things in a repeatable fashion, especially in this age of transient hosts, build for failure, chaos monkeys, and all the rest of it. The fact is though, that whilst this should be a foremost concern of a Devops team, quite often it is not. Furthermore, there may be genuine reasons (normally business related) why this is so. The argument here is that you can’t successfully attain the higher layers of our hypothetical DHoN until you’ve reached this stage. You might even believe that you can but be assured that as your business grows, the cracks will appear.

At Spaceape we were entering Phase 3 of our Devops journey, the team had by now gained some and lost some staff, and gained a manager. The company itself was blossoming, the release date of Rival Kingdoms had been set, and we were rapidly employing the best game developers and QA engineers in London.

With our now sturdy IaaS and Repeatability layers in place, we were able to start construction of the next layer of our hierarchy  – Tooling. Of course we had built some tools in our journey thus far (they could perhaps be thought of as tiny little ladders resting on the side of our pyramid) but its only once things are standardised and repeatable that you can really start building effective tooling for the consumption of others. Any software that tries to encompass a non-standard, cavalier infrastructure will result in a patchwork of ugly if..then..else clauses and eventually a re-write when your estate grows to a point where this is unsustainable. At Spaceape, we developed ApeEye (a hilarious play on the acronym API) which is a RESTful Rails application that just happens to have a nice UI in front of it. Perennially under development, eventually it will provide control over all aspects of our estate but for now it facilitates the deployment of game code to our multifarious environments (we have a lot of environments – thanks to the benefits of standardisation we are able to very quickly spin up entirely new environments contained on a single virtual host).

And so the launch of Rival Kingdoms came and went. It was an unmitigated success, the infrastructure behaved – and continues to behave – impeccably. We have a third game under development, for which building the infrastructure is now down to a fine art. Perched as we are atop our IaaS, Repeatabilty and Tooling layers, we can start to think about the next layer.

But what is the next layer? It probably has something to do with having the time to reflect, write blog posts, contribute to OSS and speak at Devops events, perhaps in some way analogous to Maslow’s Esteem level. But in all honesty we don’t know, we’ve but scarcely laid the foundations for our Tooling level. More likely is that there is no next level, just a continuous re-hashing of the foundations beneath us as new technologies and challenges enter the fray.

The real point here is a simple truth – only once you have a solid, stable, repeatable and predictable base can you start to build on it to become as creative and as, well, awesome as you’d like to be. Try to avoid the temptation to take shortcuts in the beginning and you’ll reap the benefits in the long term. Incorporate the practices and behaviours that you know you should be,  as soon as you can.  Be kind to your future self.

Chef and Consul

Here at Spaceape, our configuration management tool of choice is Chef. We are also big fans of Consul, the distributed key-value-store-cum-service-discovery-tool from the good folks at Hashicorp. It might not be immediately clear why the two technologies should be mentioned in the same paragraph, but here is the story of how they became strange bedfellows.

consul - logo-gradient-94098a4aIn a previous blog post, we told the story of our experiences with Chef. That post goes into far greater detail, but suffice it to say that our infrastructure code base was not always as reliable, configurable or even predictable as it is now. We found ourselves in a dark place where Chef was run on an ad-hoc basis, often with fingers well and truly crossed. To wrest back control and gain confidence we needed to be able to run it on a 15-minute interval.

Simple, you say, write a cron-job. Well yes, that is true. But it’s only a very small part of the story. We would find occasions where initiating a seemingly harmless Chef run could obliterate a server, and yet the same run on an ostensibly similar server would reach the end without incident. In short we had little confidence in our code, certainly not enough to start running it on Production. Furthermore – and this applies still today – often we really don’t want to allow Chef to run simultaneously across a given environment. For example, we may push a change that restarts our game-serving process. I don’t need to expand on what would happen if that change ran across Production at 12:15 one day…

Wouldn’t it be nice, we asked ourselves, if we had some sort of global locking mechanism? To prevent us propagating potential catastrophes? Not only would this allow us to push infrastructure changes through our estate, it might just have other benefits…

Enter Consul!

Like all the Hashicorp products we’ve tried, Consul is solid. Written in Go, it employs the Raft consensus algorithm atop the Serf gossip protocol to provide a highly available distributed system that even passes the Jepsen test. It is somewhat of a swiss army knife that aims to replace or augment your existing service delivery, configuration management and monitoring tools.

The utility we decided to employ for the locking mechanism was the key-value store. We built our own processes and tooling around this, as we’ll see, but it should be noted that more recent versions of Consul than we had available at the time actually have a semaphore offering.

Stored in Consul, we have a number of per-tier, per-environment key-spaces. As an example:

chef/logstash/es-indexer

Logstash is the environment, es-indexer the service. Within this keyspace, the only pre-requisite is a value for the maximum number of concurrent Chef runs we wish to allow, which we call max_concurrent:

chef/logstash/es-indexer/max_concurrent

Generally this value is set to 1, but on some larger environments we set it higher.

When a server wishes to run Chef the first thing it does is to retrieve this max_concurrent value. Assuming the value is a positive integer (a value of -1 will simply allow all Chef runs) it then attempts to acquire a ‘slot’. A slot is obtained by checking this key:

chef/logstash/es-indexer/current

Which is a running count of the number of hosts in the tier currently running Chef. It’s absence denotes zero. If the current value is less than the `max_concurrent` value the server increments the counter and registers itself as ‘running’ by creating a key like this, the value of which is a timestamp:

chef/logstash/es-indexer/running/hostname.of.the.box

The sharper amongst you will have noticed a problem here. What if two hosts try to grab the slot at the same time? To avoid this happening we use Consul’s Check-and-Set feature. The way this works is that, upon the initial read of the current value, a ModifyIndex is retrieved along with the actual Value. If the server decides that current < max_concurrent it attempts to update current by passing a `?cas=ModifyIndex` parameter. If the ModifyIndex does not match that which is stored on Consul, it indicates that something else has updated it in the meantime, and the write fails.

With the slot obtained, the Chef run is allowed to commence. Upon success, the running key is removed and the counter decremented. If however the run fails the lock (or the ‘slot’) is held, no further hosts on the tier are able to acquire a slot, and Chef runs are thereby suspended.

Our monitoring tools, by checking the timestamp of the running key, are able to warn us when locks have been held for a certain period of time (i.e. longer than the average Chef run takes) and failures are contained to one (or rather max_concurrent) hosts.

And so… this all works rather well. Many has been the time when we’d look at one another, puff our cheeks, and say, “Thank goodness for the locking system!” Over time it has allowed us to unpick our infrastructure code and get to the smug position we find ourselves. Almost never do we see locks being held for anything but trivial problems (Chef server timeouts for instance), nothing that a little judicious sleep-and-retry doesn’t fix. It also gives us great control over when and what runs Chef as we can easily disable Chef runs for a given tier by setting max_concurrent to 0.

But the purists amongst you will no doubt be screaming something about a CI system, or better unit tests, or something, or something. And you’d be right. The truth is that we were unable to shoehorn a CI system into infrastructure code which was underpinning a live game, in which we did not have complete confidence. Having the backup-parachute of the mechanism described above, though, has enabled us to address this. But that doesn’t mean we’ll be discarding it. On the contrary, it will form the backbone of our CI system, facilitating the automatic propagation of infrastructure code throughout our estate. More on that to follow.