Vault Configuration as Code

Here at Space Ape we use Vault extensively. All of our instances authenticate with Vault using the EC2 auth backend which allows us to restrict the scope of secrets any instance has access to.

Behind Vault, we use Consul as a backend to persist our secrets with a good level of durability and make use of Consul’s snapshot feature to create backups, which means we can restore both Consul and Vault from the backup if the worst case occurred.

Where we’ve struggled with Vault is in managing the configuration: which policies, roles, auth backends do we have? Which of our AWS accounts are setup for the EC2 auth and how do we update or replicate any of these configurations? If we had to set up a new instance of Vault, or recover an existing one, how long would it take us to get everything setup? Probably a lot longer than it should.

This isn’t something we accept elsewhere in our estate: We use CloudFormation to manage precisely how out AWS infrastructure looks; we use Chef to manage exactly how our instances are setup and applications are configured. All of this is configuration is stored in Git. In short we treat our configuration as code.

For those looking to manage configuration in Vault, help is at hand. In November 2016 Hashicorp’s Seth Vargo penned a blog post that caught our interest – Codifying vault policies and configuration – in which he describes how to use the Vault API to apply configuration from files. There a few things we can learn from Seth’s post:

  • The API calls are idempotent
  • The script ignores the response as you’ll often get non-200 responses (for instance if a mount already exists)
  • He maps the directory structure to the API, this makes it easy to rewrite the code in any language without having to change your directory structure.
  • API calls need to be applied in the correct order (e.g. An Auth backend must exist before you can apply configuration to it.
  • You can integrate this into your CI lifecycle.

A couple of things that are missing:

  • Code testing
  • Verifying the result of our API call was successful.

Taking Seth’s blog post as our starting point, we set out to implement configuration-managed Vault clusters using the API.

We use a lot of Ruby here so it makes sense to create a gem to apply our configuration for us and we can take the opportunity to apply unit tests. We can use Jenkins to test applying our actual configuration.

Requirements

  • Code should be tested
  • We should verify that our config has been applied correctly
  • We want a CI pipeline for our configuration.

We quickly realised that a lot of the process is repeated for each API endpoint:

  1. Locate files containing the configuration
  2. Parse the files containing the configuration
  3. Apply the configuration
  4. Verify the configuration

We have a Setup class that handles creating an instance of the Vault Client and locating the relevant files for each configuration type.

We created a Base class that our implementation (policies, auths backends etc) classes can inherit that will parse, apply and verify configuration.

Setup class

To create a Vault client it’s as simple as using the  Vault gem and providing the usual configuration details such as the address and a token.

We also have methods to locate the relevant files for any configuration item, such as policies. We simply need to supply the path to the directory in which the configuration files reside.

Base class

In the Base class we start by parsing files that the Setup class located for us. We accept hcl, yaml or json files and parse them into a hash.

We then call apply and verify methods which are implemented in classes specific to the configuration item such as Policies or Auths.

Policies

Applying policies is a good starting point as they represent a lot of our configuration and are referenced by other sections of configuration.

We save time by inheriting the the Base class we discussed above and we have an instance of the Setup class so that we can locate the files we need and have a Vault client to use.

We then implement the apply and verify methods. For a Policy the apply method is very simple, it simply uses the name of the file as the name of the policy and the contents of the file, which we translate into Json, as the body of the policy

client.sys.put_policy(name, hash.to_json)

Next we verify that our Policy was correctly applied. The first step of this is to request the policy from Vault which we can simply ask for by name (also filename).

client.sys.policy(name)

Then we can:

  1. Check that we received a Policy and not an error or an empty blob of Json.
  2. That the Json we receive matches the Json that we sent. We use the JsonCompare gem to verify each key value pair that is returned.
└── sys
    └── policy
        └── admins.hcl

The directory structure in which we store our policies. /sys/policy/admins would be the api path to post a policy if you wanted to use the API directly.

path "*" {
 policy = "sudo"
}

A really bad example of a Vault role that admin.hcl might contain. We parse this as HCL to post to /sys/policy/admins.

Testing

One of our requirements was to write tests. Below are our tests for policies.

require "spec_helper"

describe Spaceape::VaultSetup::Policy do
  subject do
    Spaceape::VaultSetup::Policy.new(
      Spaceape::VaultSetup::Setup.new(
        vault_address: "http://vault:8200",
        ssl_verify: false,
        config_dir: "spec/fixtures/main",
        vault_token: vault_token
      ),
      false
    )
  end

  let(:test_policy) do
    {
      "path": {
         "auth/app-id/map/user-id/*":
           {
             "policy": "write"
           }
        }
     }
   end

  it "applies and verifies a policy" do
    subject.apply("test-policy", test_policy)
      expect { subject.verify("test-policy", test_policy) }
        .to_not raise_error
  end

  it "identifies invalid policy" do
    subject.apply("test-policy", test_policy)
    wrong_role = test_policy.dup
    wrong_role[:path] = "/auth/app-id/map/uuuuuu/*"
    expect { subject.verify("test-policy", wrong_role) }
      .to raise_error(Spaceape::VaultSetup::ItemMismatchError)
  end

  it "applies all policies in config_dir" do
    subject.apply_items(subject.policy_files)
    expect(subject.client.sys.policies)
      .to include("test-policy2", "test-policy")
  end
end

From the test above you can see that can see that we test against a vault server at vault:8200. We run these tests in in Docker and make use of Docker compose so we can create a Vault server in dev mode and then a ruby container, with our code mounted in a volume, to run our tests.

Auths and Mounts

Policies were easy – we parse the file, make a single API call to apply the policy and another to verify it. Auths and Mounts are a bit more complicated. There are essentially three parts to each:

  1. Enable the Auth/Mount
  2. Tune the Auth Mount
  3. Configure the Auth Mount

Enabling is pretty simple you pass the name (this is what you want to call it), the type (such as secret, github or pki) and an optional description.

We store this information in sys/auth/<name>.ext. The API endpoint is sys/auth/<name>.

└── sys
    └── auth
        └── github-spaceape.json

This contents of this file may look like:

{
  "type": "github",
  "description": "spaceape github",
  "config": {
    "max_lease_ttl": "87600h",
    "default_lease_ttl": "3h"
  }
}

Notice it contains the type and description which we covered above. It also includes a config key, this is actually the tuning we can apply to the Auth/Mount. This is applied to the API endpoint sys/auth/<name>/tune so it seems to make sense to store it in this file.

So far so good, but now we come onto configuring the Auth or Mount. There’s no standard pattern here and they sometimes require secrets. We decided to exclude any secrets from the config. These can be applied as manual steps later. We can however apply some configuration.

For example we can set the organisation for the Github auth, but we don’t wouldn’t want to set the AWS credentials for the EC2 auth backend.

The API endpoint for applying configuration to Auths is auth/<name>/config and Mounts is <name>/config/<config_item>. We decided to group our mounts under a mounts directory, veering slightly from the file structure matching the API path.

Our directory structure now looks a little like this:

└── sys
|   └── auth
|   |   └── github-spaceape.json
|   └── mounts
|       └── spaceape-pki
└── auth
|   └── github-spaceape
|       └── config.json
└── mounts
    └── spaceape-pki
        └── config
        |   └── urls.json
        └── roles
            └── example-role.json

This is where mapping the file path as the API comes into it’s own: we can handle any of the Auths or Mounts without having explicitly write code for the exact type, we just have to get the structure correct.

Gotchas

There are a few things to look out for.

  1. When verifying our changes were applied Vault sometimes gives you more back than you expect. We just verify the fields we pass in.
  2. Time based fields (like the various ttl fields) are not always returned in the same format, you may get the time in seconds, or days and hour, etc. We found the chronic_duration gem useful for parsing the times for easy comparison.
  3. You may find some configuration on an Auth or Mount may have to be applied in a specific order, this is where we would have to write custom code to handle that particular type of Auth or Mount. Perhaps a configuration file could define the order in which to apply certain configuration.

Continuous Integration

When we check in code, a Jenkins job is triggered which will run our tests. As mentioned earlier we run our tests inside of Docker containers, this means that we don’t have to worry about having clashing versions of gems from other Ruby based builds we have on Jenkins.

More interesting to us is that we can now test our actual Vault configuration. So when we add a new policy we know it applies correctly. Again we use Jenkins to do this. Each time we commit a change to our Vault configuration git repository we trigger a build which attempts to apply the configuration to an instance of Vault running in Dev mode. If any of the configuration fails we can be prompted through Slack to see what caused it.

It’s still up to us to apply the changes to the production instance of Vault after the Jenkins tests have run successfully. This is mainly because we don’t want to give privileged Vault tokens out to Jenkins.

Final Words

The process we’ve described above for managing Vault configuration is just one way you could go about solving this problem. From our experience, it works – we can test our configuration and apply it in a repeatable and programmatic way.

It is however still a work in progress and the will doubtless be a few problems to overcome as we continue our development. We hope to Open Source the code in the future, but right now we feel there are still some improvements, for instance at the moment we test against Vault 0.6.5 (the latest release is 0.7.3) and we’ve only tested against a handful of Mounts and Authentication backends.

 

Introducing ComposeECS

The DevOps team here at Space Ape have just open-sourced a small Ruby gem that provides a mechanism to convert Docker Compose specifications to AWS EC2 Container Service task definitions – we’ve called it ComposeECS.

We run the majority of our infrastructure on AWS, using source-controlled CloudFormation templates to manage each of our stacks. Over time we’ve built up a toolchain to help us manage changes to our CloudFormation stacks known internally as ApeStack which incorporates CFNDSL and our internal conventions and processes.

Toward the end of 2015 we started to build out a new platform that would support deployments of containerised applications where it made sense. As heavy users of AWS, EC2 Container Service (ECS) was the obvious choice for running containers in the cloud. The potential advantages of deep integration with AWS services like ELB and IAM have significant implications when it comes to integrating the new platform with our existing stack.

As a team, we are huge fans of specifying configuration in YAML. It’s then perhaps no surprise that we much prefer the syntax of Docker Compose definitions over the JSON-based ECS task definitions. We wanted to be able to specify our task definitions in YAML alongside our CloudFormation templates. Furthermore, we wanted to construct ECS services and their supporting infrastructure with a single command. To this end, we wrote ComposeECS.

ComposeECS reads any Docker Compose file and translates supported attributes(including volume definitions) into an equivalent ECS task definition JSON – sanity-checking your attributes and ensuring compatibility with the ECS task definition specification as it does so. The advantages ComposeECS provides include:

  • Container definitions are more readable and therefore easier to maintain.
  • Docker Compose definitions written to run in our local Docker environment now run on ECS with little modification.
  • Unlike services translated with the ECS-CLI, which supports Docker Compose deployment, our services are CloudFormation-managed whilst still taking advantage of the Docker Compose syntax.

We’re very pleased with ComposeECS. However, it is but one of many hurdles on the path to a production-ready Space Ape container platform. For us, many questions remain around service discovery; efficient and reliable deployment; injection of configuration; handling scale and how to effectively monitor our cluster. We look forward to sharing more from our journey.


If you’re interested in integrating ComposeECS into your toolchain, or contributing, head over to the project Github page.

Is there such thing as a DevOps Hierarchy of Needs?

In 1943 the psychologist Abraham Maslow proposed the concept of a ‘hierarchy of needs’ to describe human motivation. Most often portrayed as a pyramid, with the more fundamental needs occupying the largest space in the bottom layers, his theory states that only in the fulfilment of the lower-level needs can one hope to progress to the next strata of the pyramid. The bottom-most need is of course physiological (i.e. food, shelter, beer etc); once this is achieved we can start to think about safety (i.e. lets make sure nobody takes our beer) then we start looking for Love and Self Esteem before ending up cross-legged in an ashram searching for Self-Actualization and Transcendence.

maslows_hierarchy

Is this a Devops blog or what? Yes, yes it is. The suggestion is not that we should all be striving for Devops Transcendence or anything, but that perhaps the general gist of Maslow’s theory could be applied to coin a DevOps Hierarchy of Needs, and we could use the brief history of our own Devops team at Spaceape to bolster this idea.

In the beginning there was one man. This man was tasked with building the infrastructure to run our first game, Samurai Siege; not just the game-serving tier but also a Graphite installation, an ELK stack, a large Redis farm, an OpenVPN server, a Jenkins server, et cetera et cetera. At this juncture we could not even be certain that Samurai Siege would be a success. The remit was to get something that worked, to run our game to the standards expected by our players.

Some sound technological choices were made at this point, chief of which was to build our game within AWS.

With very few exceptions, we run everything in AWS. We’re exceedingly happy with AWS, and its suits our purposes. You may choose a different cloud provider; you may forego a cloud provider altogether and run your infrastructure on-premise. Whichever it is, this is the service that provides the first layer on our DHoN. You need some sort of request driven IaaS to equate to Maslow’s Physiological layer. Ideally this would include not only VMs but also your virtual network and storage. Without this service, whatever it might be (and it might be as simple as, say, a set of scripts to build KVM instances), you can’t hope to build toward the upper reaches of the pyramid.

Samurai Siege was launched. It was a runaway success. Even under load the game remained up, functional and performant. The one-man Devops machine left the company and Phase 2 in our short history commenced. We now had an in-house team of two and one remote contractor and we set about improving our lot, striving unawares for that next level of needs. It quickly became apparent, however, that we might face some difficulty…

If AWS provided the rock on which we built our proverbial church, we found that the church itself needed some repairs, someone had stolen the lead from its roof.

Another sound technology choice that was made early was to use Chef as the configuration management tool. Unfortunately – and unsurprisingly given the mitigating circumstances – the implementation was less than perfect. It appeared that Chef had only been used in the initial building of the infrastructure, attempts to run it on any sort of interval led inevitably to what we had started to call ‘facepalm moments’. We had a number of worrying 3rd party dependencies and if Chef was problematic, Cloudformation was outright dangerous. We had accrued what is commonly known as technical debt.

Clearly we had a lot of work to do. We set about wresting back control of our infrastructure. Chef was the first victim: we took a knife to our community cookbooks, we introduced unit tests and cookbook versioning, we separated configuration from code, we even co-opted Consul to help us. Once we had Chef back on-side we had little choice but to rebuild our infrastructure in its entirety, underneath a running game. With the backing of our CTO we undertook a policy of outsourcing components that we considered non-core (this was particularly efficacious with Graphite, more on this one day soon). This enabled us to concentrate  our efforts and to deliver a comprehensive game-serving platform, of which we were able to stamp out a new iteration for our now well-under-development second game, Rival Kingdoms.

It would be easy at this point to draw parallels with Maslow’s second tier, Safety. Our systems were resilient and monitored, we could safely scale them up and down or rebuild them. But actually what we had reached at this point was Repeatability. Our entire estate – from the network, load-balancers, security policies and autoscaling groups through to the configuration of Redis and Elasticsearch or the specifics of our deployment process – was represented as code. In the event of a disaster we could repeat our entire infrastructure.

Now, you might think this is a lazy observation. Of course you should build things in a repeatable fashion, especially in this age of transient hosts, build for failure, chaos monkeys, and all the rest of it. The fact is though, that whilst this should be a foremost concern of a Devops team, quite often it is not. Furthermore, there may be genuine reasons (normally business related) why this is so. The argument here is that you can’t successfully attain the higher layers of our hypothetical DHoN until you’ve reached this stage. You might even believe that you can but be assured that as your business grows, the cracks will appear.

At Spaceape we were entering Phase 3 of our Devops journey, the team had by now gained some and lost some staff, and gained a manager. The company itself was blossoming, the release date of Rival Kingdoms had been set, and we were rapidly employing the best game developers and QA engineers in London.

With our now sturdy IaaS and Repeatability layers in place, we were able to start construction of the next layer of our hierarchy  – Tooling. Of course we had built some tools in our journey thus far (they could perhaps be thought of as tiny little ladders resting on the side of our pyramid) but its only once things are standardised and repeatable that you can really start building effective tooling for the consumption of others. Any software that tries to encompass a non-standard, cavalier infrastructure will result in a patchwork of ugly if..then..else clauses and eventually a re-write when your estate grows to a point where this is unsustainable. At Spaceape, we developed ApeEye (a hilarious play on the acronym API) which is a RESTful Rails application that just happens to have a nice UI in front of it. Perennially under development, eventually it will provide control over all aspects of our estate but for now it facilitates the deployment of game code to our multifarious environments (we have a lot of environments – thanks to the benefits of standardisation we are able to very quickly spin up entirely new environments contained on a single virtual host).

And so the launch of Rival Kingdoms came and went. It was an unmitigated success, the infrastructure behaved – and continues to behave – impeccably. We have a third game under development, for which building the infrastructure is now down to a fine art. Perched as we are atop our IaaS, Repeatabilty and Tooling layers, we can start to think about the next layer.

But what is the next layer? It probably has something to do with having the time to reflect, write blog posts, contribute to OSS and speak at Devops events, perhaps in some way analogous to Maslow’s Esteem level. But in all honesty we don’t know, we’ve but scarcely laid the foundations for our Tooling level. More likely is that there is no next level, just a continuous re-hashing of the foundations beneath us as new technologies and challenges enter the fray.

The real point here is a simple truth – only once you have a solid, stable, repeatable and predictable base can you start to build on it to become as creative and as, well, awesome as you’d like to be. Try to avoid the temptation to take shortcuts in the beginning and you’ll reap the benefits in the long term. Incorporate the practices and behaviours that you know you should be,  as soon as you can.  Be kind to your future self.

Chef and Consul

Here at Spaceape, our configuration management tool of choice is Chef. We are also big fans of Consul, the distributed key-value-store-cum-service-discovery-tool from the good folks at Hashicorp. It might not be immediately clear why the two technologies should be mentioned in the same paragraph, but here is the story of how they became strange bedfellows.

consul - logo-gradient-94098a4aIn a previous blog post, we told the story of our experiences with Chef. That post goes into far greater detail, but suffice it to say that our infrastructure code base was not always as reliable, configurable or even predictable as it is now. We found ourselves in a dark place where Chef was run on an ad-hoc basis, often with fingers well and truly crossed. To wrest back control and gain confidence we needed to be able to run it on a 15-minute interval.

Simple, you say, write a cron-job. Well yes, that is true. But it’s only a very small part of the story. We would find occasions where initiating a seemingly harmless Chef run could obliterate a server, and yet the same run on an ostensibly similar server would reach the end without incident. In short we had little confidence in our code, certainly not enough to start running it on Production. Furthermore – and this applies still today – often we really don’t want to allow Chef to run simultaneously across a given environment. For example, we may push a change that restarts our game-serving process. I don’t need to expand on what would happen if that change ran across Production at 12:15 one day…

Wouldn’t it be nice, we asked ourselves, if we had some sort of global locking mechanism? To prevent us propagating potential catastrophes? Not only would this allow us to push infrastructure changes through our estate, it might just have other benefits…

Enter Consul!

Like all the Hashicorp products we’ve tried, Consul is solid. Written in Go, it employs the Raft consensus algorithm atop the Serf gossip protocol to provide a highly available distributed system that even passes the Jepsen test. It is somewhat of a swiss army knife that aims to replace or augment your existing service delivery, configuration management and monitoring tools.

The utility we decided to employ for the locking mechanism was the key-value store. We built our own processes and tooling around this, as we’ll see, but it should be noted that more recent versions of Consul than we had available at the time actually have a semaphore offering.

Stored in Consul, we have a number of per-tier, per-environment key-spaces. As an example:

chef/logstash/es-indexer

Logstash is the environment, es-indexer the service. Within this keyspace, the only pre-requisite is a value for the maximum number of concurrent Chef runs we wish to allow, which we call max_concurrent:

chef/logstash/es-indexer/max_concurrent

Generally this value is set to 1, but on some larger environments we set it higher.

When a server wishes to run Chef the first thing it does is to retrieve this max_concurrent value. Assuming the value is a positive integer (a value of -1 will simply allow all Chef runs) it then attempts to acquire a ‘slot’. A slot is obtained by checking this key:

chef/logstash/es-indexer/current

Which is a running count of the number of hosts in the tier currently running Chef. It’s absence denotes zero. If the current value is less than the `max_concurrent` value the server increments the counter and registers itself as ‘running’ by creating a key like this, the value of which is a timestamp:

chef/logstash/es-indexer/running/hostname.of.the.box

The sharper amongst you will have noticed a problem here. What if two hosts try to grab the slot at the same time? To avoid this happening we use Consul’s Check-and-Set feature. The way this works is that, upon the initial read of the current value, a ModifyIndex is retrieved along with the actual Value. If the server decides that current < max_concurrent it attempts to update current by passing a `?cas=ModifyIndex` parameter. If the ModifyIndex does not match that which is stored on Consul, it indicates that something else has updated it in the meantime, and the write fails.

With the slot obtained, the Chef run is allowed to commence. Upon success, the running key is removed and the counter decremented. If however the run fails the lock (or the ‘slot’) is held, no further hosts on the tier are able to acquire a slot, and Chef runs are thereby suspended.

Our monitoring tools, by checking the timestamp of the running key, are able to warn us when locks have been held for a certain period of time (i.e. longer than the average Chef run takes) and failures are contained to one (or rather max_concurrent) hosts.

And so… this all works rather well. Many has been the time when we’d look at one another, puff our cheeks, and say, “Thank goodness for the locking system!” Over time it has allowed us to unpick our infrastructure code and get to the smug position we find ourselves. Almost never do we see locks being held for anything but trivial problems (Chef server timeouts for instance), nothing that a little judicious sleep-and-retry doesn’t fix. It also gives us great control over when and what runs Chef as we can easily disable Chef runs for a given tier by setting max_concurrent to 0.

But the purists amongst you will no doubt be screaming something about a CI system, or better unit tests, or something, or something. And you’d be right. The truth is that we were unable to shoehorn a CI system into infrastructure code which was underpinning a live game, in which we did not have complete confidence. Having the backup-parachute of the mechanism described above, though, has enabled us to address this. But that doesn’t mean we’ll be discarding it. On the contrary, it will form the backbone of our CI system, facilitating the automatic propagation of infrastructure code throughout our estate. More on that to follow.