Vault Configuration as Code

Here at Space Ape we use Vault extensively. All of our instances authenticate with Vault using the EC2 auth backend which allows us to restrict the scope of secrets any instance has access to.

Behind Vault, we use Consul as a backend to persist our secrets with a good level of durability and make use of Consul’s snapshot feature to create backups, which means we can restore both Consul and Vault from the backup if the worst case occurred.

Where we’ve struggled with Vault is in managing the configuration: which policies, roles, auth backends do we have? Which of our AWS accounts are setup for the EC2 auth and how do we update or replicate any of these configurations? If we had to set up a new instance of Vault, or recover an existing one, how long would it take us to get everything setup? Probably a lot longer than it should.

This isn’t something we accept elsewhere in our estate: We use CloudFormation to manage precisely how out AWS infrastructure looks; we use Chef to manage exactly how our instances are setup and applications are configured. All of this is configuration is stored in Git. In short we treat our configuration as code.

For those looking to manage configuration in Vault, help is at hand. In November 2016 Hashicorp’s Seth Vargo penned a blog post that caught our interest – Codifying vault policies and configuration – in which he describes how to use the Vault API to apply configuration from files. There a few things we can learn from Seth’s post:

  • The API calls are idempotent
  • The script ignores the response as you’ll often get non-200 responses (for instance if a mount already exists)
  • He maps the directory structure to the API, this makes it easy to rewrite the code in any language without having to change your directory structure.
  • API calls need to be applied in the correct order (e.g. An Auth backend must exist before you can apply configuration to it.
  • You can integrate this into your CI lifecycle.

A couple of things that are missing:

  • Code testing
  • Verifying the result of our API call was successful.

Taking Seth’s blog post as our starting point, we set out to implement configuration-managed Vault clusters using the API.

We use a lot of Ruby here so it makes sense to create a gem to apply our configuration for us and we can take the opportunity to apply unit tests. We can use Jenkins to test applying our actual configuration.

Requirements

  • Code should be tested
  • We should verify that our config has been applied correctly
  • We want a CI pipeline for our configuration.

We quickly realised that a lot of the process is repeated for each API endpoint:

  1. Locate files containing the configuration
  2. Parse the files containing the configuration
  3. Apply the configuration
  4. Verify the configuration

We have a Setup class that handles creating an instance of the Vault Client and locating the relevant files for each configuration type.

We created a Base class that our implementation (policies, auths backends etc) classes can inherit that will parse, apply and verify configuration.

Setup class

To create a Vault client it’s as simple as using the  Vault gem and providing the usual configuration details such as the address and a token.

We also have methods to locate the relevant files for any configuration item, such as policies. We simply need to supply the path to the directory in which the configuration files reside.

Base class

In the Base class we start by parsing files that the Setup class located for us. We accept hcl, yaml or json files and parse them into a hash.

We then call apply and verify methods which are implemented in classes specific to the configuration item such as Policies or Auths.

Policies

Applying policies is a good starting point as they represent a lot of our configuration and are referenced by other sections of configuration.

We save time by inheriting the the Base class we discussed above and we have an instance of the Setup class so that we can locate the files we need and have a Vault client to use.

We then implement the apply and verify methods. For a Policy the apply method is very simple, it simply uses the name of the file as the name of the policy and the contents of the file, which we translate into Json, as the body of the policy

client.sys.put_policy(name, hash.to_json)

Next we verify that our Policy was correctly applied. The first step of this is to request the policy from Vault which we can simply ask for by name (also filename).

client.sys.policy(name)

Then we can:

  1. Check that we received a Policy and not an error or an empty blob of Json.
  2. That the Json we receive matches the Json that we sent. We use the JsonCompare gem to verify each key value pair that is returned.
└── sys
    └── policy
        └── admins.hcl

The directory structure in which we store our policies. /sys/policy/admins would be the api path to post a policy if you wanted to use the API directly.

path "*" {
 policy = "sudo"
}

A really bad example of a Vault role that admin.hcl might contain. We parse this as HCL to post to /sys/policy/admins.

Testing

One of our requirements was to write tests. Below are our tests for policies.

require "spec_helper"

describe Spaceape::VaultSetup::Policy do
  subject do
    Spaceape::VaultSetup::Policy.new(
      Spaceape::VaultSetup::Setup.new(
        vault_address: "http://vault:8200",
        ssl_verify: false,
        config_dir: "spec/fixtures/main",
        vault_token: vault_token
      ),
      false
    )
  end

  let(:test_policy) do
    {
      "path": {
         "auth/app-id/map/user-id/*":
           {
             "policy": "write"
           }
        }
     }
   end

  it "applies and verifies a policy" do
    subject.apply("test-policy", test_policy)
      expect { subject.verify("test-policy", test_policy) }
        .to_not raise_error
  end

  it "identifies invalid policy" do
    subject.apply("test-policy", test_policy)
    wrong_role = test_policy.dup
    wrong_role[:path] = "/auth/app-id/map/uuuuuu/*"
    expect { subject.verify("test-policy", wrong_role) }
      .to raise_error(Spaceape::VaultSetup::ItemMismatchError)
  end

  it "applies all policies in config_dir" do
    subject.apply_items(subject.policy_files)
    expect(subject.client.sys.policies)
      .to include("test-policy2", "test-policy")
  end
end

From the test above you can see that can see that we test against a vault server at vault:8200. We run these tests in in Docker and make use of Docker compose so we can create a Vault server in dev mode and then a ruby container, with our code mounted in a volume, to run our tests.

Auths and Mounts

Policies were easy – we parse the file, make a single API call to apply the policy and another to verify it. Auths and Mounts are a bit more complicated. There are essentially three parts to each:

  1. Enable the Auth/Mount
  2. Tune the Auth Mount
  3. Configure the Auth Mount

Enabling is pretty simple you pass the name (this is what you want to call it), the type (such as secret, github or pki) and an optional description.

We store this information in sys/auth/<name>.ext. The API endpoint is sys/auth/<name>.

└── sys
    └── auth
        └── github-spaceape.json

This contents of this file may look like:

{
  "type": "github",
  "description": "spaceape github",
  "config": {
    "max_lease_ttl": "87600h",
    "default_lease_ttl": "3h"
  }
}

Notice it contains the type and description which we covered above. It also includes a config key, this is actually the tuning we can apply to the Auth/Mount. This is applied to the API endpoint sys/auth/<name>/tune so it seems to make sense to store it in this file.

So far so good, but now we come onto configuring the Auth or Mount. There’s no standard pattern here and they sometimes require secrets. We decided to exclude any secrets from the config. These can be applied as manual steps later. We can however apply some configuration.

For example we can set the organisation for the Github auth, but we don’t wouldn’t want to set the AWS credentials for the EC2 auth backend.

The API endpoint for applying configuration to Auths is auth/<name>/config and Mounts is <name>/config/<config_item>. We decided to group our mounts under a mounts directory, veering slightly from the file structure matching the API path.

Our directory structure now looks a little like this:

└── sys
|   └── auth
|   |   └── github-spaceape.json
|   └── mounts
|       └── spaceape-pki
└── auth
|   └── github-spaceape
|       └── config.json
└── mounts
    └── spaceape-pki
        └── config
        |   └── urls.json
        └── roles
            └── example-role.json

This is where mapping the file path as the API comes into it’s own: we can handle any of the Auths or Mounts without having explicitly write code for the exact type, we just have to get the structure correct.

Gotchas

There are a few things to look out for.

  1. When verifying our changes were applied Vault sometimes gives you more back than you expect. We just verify the fields we pass in.
  2. Time based fields (like the various ttl fields) are not always returned in the same format, you may get the time in seconds, or days and hour, etc. We found the chronic_duration gem useful for parsing the times for easy comparison.
  3. You may find some configuration on an Auth or Mount may have to be applied in a specific order, this is where we would have to write custom code to handle that particular type of Auth or Mount. Perhaps a configuration file could define the order in which to apply certain configuration.

Continuous Integration

When we check in code, a Jenkins job is triggered which will run our tests. As mentioned earlier we run our tests inside of Docker containers, this means that we don’t have to worry about having clashing versions of gems from other Ruby based builds we have on Jenkins.

More interesting to us is that we can now test our actual Vault configuration. So when we add a new policy we know it applies correctly. Again we use Jenkins to do this. Each time we commit a change to our Vault configuration git repository we trigger a build which attempts to apply the configuration to an instance of Vault running in Dev mode. If any of the configuration fails we can be prompted through Slack to see what caused it.

It’s still up to us to apply the changes to the production instance of Vault after the Jenkins tests have run successfully. This is mainly because we don’t want to give privileged Vault tokens out to Jenkins.

Final Words

The process we’ve described above for managing Vault configuration is just one way you could go about solving this problem. From our experience, it works – we can test our configuration and apply it in a repeatable and programmatic way.

It is however still a work in progress and the will doubtless be a few problems to overcome as we continue our development. We hope to Open Source the code in the future, but right now we feel there are still some improvements, for instance at the moment we test against Vault 0.6.5 (the latest release is 0.7.3) and we’ve only tested against a handful of Mounts and Authentication backends.

 

Chef and Consul

Here at Spaceape, our configuration management tool of choice is Chef. We are also big fans of Consul, the distributed key-value-store-cum-service-discovery-tool from the good folks at Hashicorp. It might not be immediately clear why the two technologies should be mentioned in the same paragraph, but here is the story of how they became strange bedfellows.

consul - logo-gradient-94098a4aIn a previous blog post, we told the story of our experiences with Chef. That post goes into far greater detail, but suffice it to say that our infrastructure code base was not always as reliable, configurable or even predictable as it is now. We found ourselves in a dark place where Chef was run on an ad-hoc basis, often with fingers well and truly crossed. To wrest back control and gain confidence we needed to be able to run it on a 15-minute interval.

Simple, you say, write a cron-job. Well yes, that is true. But it’s only a very small part of the story. We would find occasions where initiating a seemingly harmless Chef run could obliterate a server, and yet the same run on an ostensibly similar server would reach the end without incident. In short we had little confidence in our code, certainly not enough to start running it on Production. Furthermore – and this applies still today – often we really don’t want to allow Chef to run simultaneously across a given environment. For example, we may push a change that restarts our game-serving process. I don’t need to expand on what would happen if that change ran across Production at 12:15 one day…

Wouldn’t it be nice, we asked ourselves, if we had some sort of global locking mechanism? To prevent us propagating potential catastrophes? Not only would this allow us to push infrastructure changes through our estate, it might just have other benefits…

Enter Consul!

Like all the Hashicorp products we’ve tried, Consul is solid. Written in Go, it employs the Raft consensus algorithm atop the Serf gossip protocol to provide a highly available distributed system that even passes the Jepsen test. It is somewhat of a swiss army knife that aims to replace or augment your existing service delivery, configuration management and monitoring tools.

The utility we decided to employ for the locking mechanism was the key-value store. We built our own processes and tooling around this, as we’ll see, but it should be noted that more recent versions of Consul than we had available at the time actually have a semaphore offering.

Stored in Consul, we have a number of per-tier, per-environment key-spaces. As an example:

chef/logstash/es-indexer

Logstash is the environment, es-indexer the service. Within this keyspace, the only pre-requisite is a value for the maximum number of concurrent Chef runs we wish to allow, which we call max_concurrent:

chef/logstash/es-indexer/max_concurrent

Generally this value is set to 1, but on some larger environments we set it higher.

When a server wishes to run Chef the first thing it does is to retrieve this max_concurrent value. Assuming the value is a positive integer (a value of -1 will simply allow all Chef runs) it then attempts to acquire a ‘slot’. A slot is obtained by checking this key:

chef/logstash/es-indexer/current

Which is a running count of the number of hosts in the tier currently running Chef. It’s absence denotes zero. If the current value is less than the `max_concurrent` value the server increments the counter and registers itself as ‘running’ by creating a key like this, the value of which is a timestamp:

chef/logstash/es-indexer/running/hostname.of.the.box

The sharper amongst you will have noticed a problem here. What if two hosts try to grab the slot at the same time? To avoid this happening we use Consul’s Check-and-Set feature. The way this works is that, upon the initial read of the current value, a ModifyIndex is retrieved along with the actual Value. If the server decides that current < max_concurrent it attempts to update current by passing a `?cas=ModifyIndex` parameter. If the ModifyIndex does not match that which is stored on Consul, it indicates that something else has updated it in the meantime, and the write fails.

With the slot obtained, the Chef run is allowed to commence. Upon success, the running key is removed and the counter decremented. If however the run fails the lock (or the ‘slot’) is held, no further hosts on the tier are able to acquire a slot, and Chef runs are thereby suspended.

Our monitoring tools, by checking the timestamp of the running key, are able to warn us when locks have been held for a certain period of time (i.e. longer than the average Chef run takes) and failures are contained to one (or rather max_concurrent) hosts.

And so… this all works rather well. Many has been the time when we’d look at one another, puff our cheeks, and say, “Thank goodness for the locking system!” Over time it has allowed us to unpick our infrastructure code and get to the smug position we find ourselves. Almost never do we see locks being held for anything but trivial problems (Chef server timeouts for instance), nothing that a little judicious sleep-and-retry doesn’t fix. It also gives us great control over when and what runs Chef as we can easily disable Chef runs for a given tier by setting max_concurrent to 0.

But the purists amongst you will no doubt be screaming something about a CI system, or better unit tests, or something, or something. And you’d be right. The truth is that we were unable to shoehorn a CI system into infrastructure code which was underpinning a live game, in which we did not have complete confidence. Having the backup-parachute of the mechanism described above, though, has enabled us to address this. But that doesn’t mean we’ll be discarding it. On the contrary, it will form the backbone of our CI system, facilitating the automatic propagation of infrastructure code throughout our estate. More on that to follow.