Deep Reinforcement Learning for Small Teams

On Thursday, October 12th, we hosted a tech event at our HQ to share some of the shiny new toys we’ve been building.

The office was jam-packed, so we’ve written up our talks for those that couldn’t make it. We’ve got more events in the pipeline, so be sure to follow us on Twitter (@SpaceApeGames) so you can get a heads-up before the next event fills up.

This is what we got talked about this time:

  • Scalability & Big Data Challenges In Real Time Multiplayer Games, by Yan Cui and Tony Yang, Space Ape Games
  • Advanced Machine Learning For Small Teams, by Atiyo Ghosh and Dennis Waldron, Space Ape Games
  • Serverless: The Next Evolution of Cloud Computing, by Dr. Steve Turner, Amazon Web Services

Check out Tony and Yan’s post on creating a real-time multiplayer stack!

Dennis and I talked about our recent adventures with reinforcement learning (see video at the bottom of this post). We had an ambitious agenda:

  • How reinforcement learning can help our customers get what they want, when they want it.
  • An overview of Deep Mind’s deep-Q learning algorithm, and how we adapted it to our use case.
  • How we used a serverless stack to minimise friction in building, maintaining and training the model. We are small team, busy with building new things: low maintenance stacks are our friends.
  • How our choice of stack determined our choice of deep learning framework.

It’s a lot of material to cover in a short talk, but we managed to answer some questions at the pub afterwards. For those of you with questions who couldn’t make it there, leave a note in the comments 🙂

Tackling scalability challenges in realtime multiplayer games with Akka and AWS

We hosted a tech event at our HQ last week and welcomed over 200 attendees to join us for an evening of talks and networking. It was an absolute blast to meet so many talented people all at once! We plan to host a series of similar events in the future so keep coming back here or follow us on Twitter (@SpaceApeGames) to listen for announcements.

We had three talks on the night, covering a range of interesting topics:

  • Scalability & Big Data Challenges In Real Time Multiplayer Games, by Yan Cui and Tony Yang, Space Ape Games
  • Advanced Machine Learning For Small Teams, by Atiyo Ghosh and Dennis Waldron, Space Ape Games
  • Serverless: The Next Evolution of Cloud Computing, by Dr. Steve Turner, Amazon Web Services

The recording of me and Tony’s talk on building realtime multiplayer games is now online (see end of the post), with the accompanying slides.

In this talk we discussed the market opportunity for realtime multiplayer games and the technical challenges one have to face, as well as the tradeoffs that we need to keep in mind when making those decisions.

  • do you deploy infrastructure globally or run them from one (AWS) region?
  • do you build your own networking stack vs using an off-the-shelf solution?
  • do you go with a server authoritative approach or implement a lock-step system?
  • how do you write a highly performant multiplayer server on the JVM?
  • how do you load test this system?
  • and many more.

Over the next few weeks we’ll publish the rest of the talks, so don’t forget to check back here once in a while 😉

Building a Custom Terraform Provider for Wavefront

At Space Ape we’re increasingly turning to Golang for creating tools and utilities, for example – De-comming EC2 Instances With Serverless and Go. Inevitably we’ll need to interact with our metric provider – Wavefront. To this end, our colleague Louis has been working on a Go client for interacting with the Wavefront API, which allows us to query Wavefront and create resources such as Alerts and Dashboards. Up until now, we’ve been configuring these components by hand, which worries us – what happens if they disappear or are changed? How do we revert to a known good version or restore a lost Dashboard?

We were set to start creating our own tool for managing Wavefront resources, but as luck would have it Hashicorp released version 0.10.0 of Terraform which splits providers out from the main Terraform code base and allows you to load custom (not managed by Hashicorp) providers without recompiling Terraform.

So we set about creating a custom provider and have so far implemented Alerts, Alert Targets and Dashboards and fully intend to continue adding functionality to both the SDK and the provider in the future.

Now creating an Alert is as simple as:

resource "wavefront_alert" "a_terraform_managed_alert" {
 name                   = "Terraform Managed Alert"
 target                 = "test@example.com"
 condition              = "ts()"
 display_expression     = "ts()"
 minutes                = 4
 resolve_after_minutes  = 4
 additional_information = "This alert is triggered because..."
 severity               = "WARN"

 tags = [
   "terraform",
 ]
}

You can find the latest released version, complete with binary here.

Creating our own provider for Wavefront means that we get all the benefits of Terraform; resource graphs, plans, state, versioning and locking with just a little bit of effort required by us. Hashicorp has made a number of helper methods which means that writing and testing the provider is relatively simple.

Another benefit to writing a provider is that we can use the import functionality of Terraform to import our existing resource into state. Hopefully, Hashicorp will improve this to generate Terraform code soon, in the meantime, it shouldn’t be too difficult to script turning a state file (JSON) into Terraforms HCL.

Using the Provider

Terraform is clever enough to go and fetch the officially supported providers for you when you run terraform init. Unfortunately, with custom providers, it’s a little bit more complicated. You need to build the binary (We upload the compiled binary with each of our releases) and place it in the ~/.terraform.d/plugins/darwin_amd64/ (or equivalent for your system). Now when we run terraform init it will be able to find the plugin. After this the setup is pretty simple:

provider "wavefront" {
 address = "foo.wavefront.com"
 token   = "wavefront_token"
}

You can export the address and token as an environment variable (WAVEFRONT_ADDRESS and WAVEFRONT_TOKEN respectively) to avoid committing them to Source Control (We highly recommend you do this for the token!).

Writing your own Provider

If you fancy having a go at writing your own provider then this blog post by Hashicorp is a good way to get started. I’d also recommend taking a look at the Hashicorp supported providers and using them as a reference when writing your own.

 

How to load test a realtime multiplayer mobile game with AWS Lambda and Akka

Tencent’s Kings of Glory is one of the top grossing games worldwide in 2017 so far.

Over the last 12 months, we have seen a number of team-based multiplayer games hit the market as companies look to replicate the success of Tencent’s King of Glory (known as Arena of Valor in the west) which is one of the top grossing games in the world in 2017.

Even our partners Supercell has recently dipped into the genre with Brawl Stars, which offers a different take on the traditional MOBA (Multiplayer-Online-Battle-Arena) formula.

Supercell’s Brawl Stars offers a different experience to the traditional MOBA format, it is built with mobile in mind and prefers simple controls & maps, as well as shorter matches.

Here at Space Ape Games, we have been exploring ideas for a competitive multiplayer game, which is still in prototype so I can’t talk about it here. However, I can talk about how we use AWS Lambda to load test our homegrown networking stack.

Why Lambda?

The traditional approach of using EC2 servers to drive the load testing has several problems:

  • slow to start : any sizeable load test would require many EC2 instances to generate the desired load. Since it costs you to keep these EC2 instances around, it’s likely that you’ll only spawn them when you need to run a load test. Which means there’s a 10–15 mins lead time before every test just to wait for the EC2 instances to be ready.
  • wastage : when the load test is short-lived (say, < 1 hour) you can incur a lot of wastage because EC2 instances are billed by the hour with a minimum charge for one hour (per-second billing is coming to non-Windows EC2 instances in Oct 2017, which would address this problem).
  • hard to deploy updates : to update the load test code itself (perhaps to introduce new behaviours to bot players), you need to invest in the infrastructure for updating the load test code on the running EC2 instances. Whilst this doesn’t have to be difficult, after all, you probably already have a similar infrastructure in place for your game servers. Nonetheless, it’s yet another distraction that I would happily avoid.

AWS Lambda addresses all of these problems.

It does introduce its own limitations — especially the 5 min execution time limit. However, as I have written before, you can work around this limit by writing your Lambda function as a recursive function and taking advantage of container reuse to persist local state from one invocation to the next.

I’m a big fan of the work the Nordstrom guys have done with the serverless-artillery project. Unfortunately we’re not able to use it here because the game (the client app written in Unity3D) converses with the multiplayer server in a custom protocol via TCP, and in the future that conversation would happen over Reliable UDP too.

Akka

Our multiplayer server is written in Scala with the Akka framework. To help us optimize our implementation we collect lots of metrics about the Akka system as well as the JVM — GC, heap, CPU usage, memory usage, etc.

The Kamon framework is a big help here, it made quick work of getting us insight into the running of the Akka system — no. of actors, no. of messages, how much time a message spends waiting in the mailbox, how much time we spend processing each message, etc.

All of these data points are sent to Wavefront, via Telegraf.

We collect lots of metrics about the Akka system and the JVM.

We also have a standalone Akka-based load test client that can simulate many concurrent players. Each player is modelled as an actor, which simulates the behaviour of the Unity3D game client during a match:

  1. find a multiplayer match
  2. connect to the multiplayer server and authenticate itself
  3. play a 4 minute match, sending inputs at 15 times a second
  4. report “client side” telemetries so we can collect the RTT (Round-Trip Time) as experienced by the client, and use these telemetries as a qualitative measure for our networking stack

In the load test client, we use the t-digest algorithm to minimise the memory footprint required to track the RTTs during a match. This allows us to simulate more concurrent players in a memory-constrained environment such as a Lambda function.

AWS Lambda + Akka

We can run the load test client inside a Java8 Lambda function and simulate 100 players per invocation. To simulate X concurrent players, we can create X/100 concurrent executions of the function via SNS (which has an one-invocation-per-message policy).

To create a gradual ramp up in load, a recursive Orchestrator function will gradually dial up the no. of current executions by publishing more messages into SNS, each triggering a new recursive load test client function.

LoadTest function that is triggered by API Gateway allows us to easily kick off a load test from a Jenkins pipeline.

Using the push-pull pattern (see this post for detail), we can track the progress of all the concurrent load test client functions. When they have all finished simulating their matches, we’ll kick off the Aggregator function.

The Aggregator function would collect the RTT metrics published by the load test clients and produce a report detailing the various percentile RTTs.

{
  "loadTestId": "62db5790-da53-4b49-b673-0f60e891252a",
  "status": "completed",
  "successful": 43,
  "failed": 2,
  "metrics": {    
    "client-interval": {      
      "count": 7430209,
      "min": 0,
      "max": 140,
      "percentile80": 70.000000193967,
      "percentile90": 70.00001559848,
      "percentile99": 71.000000496589,
      "percentile99point9": 80.000690623146,
      "percentile99point99": 86.123610689566
    },    
    "RTT": {      
      "count": 744339,
      "min": 70,
      "max": 320,
      "percentile80": 134.94761466541,
      "percentile90": 142.64720935496,
      "percentile99": 155.30086042676,
      "percentile99point9": 164.46137375328,
      "percentile99point99": 175.90215268392
    }
  }
}

If you would like to learn more about the technical challenges in developing successful mobile games, come join us for an evening of talks, drinks, food and networking in our office on the 12th Oct.

We’re running a free event in partnership with AWS where we will talk about:

  • the opportunities and challenges in building a realtime multiplayer game
  • data science and machine learning
  • serverless with AWS Lambda (by Dr Steve Turner from AWS)

Get your free ticket here!

The problems with DynamoDB Auto Scaling and how it might be improved

Here at Space Ape Games we developed some in-house tech to auto scale DynamoDB throughput and have used it successfully in production for a few years. It’s even integrated with our LiveOps tooling and scales up our DynamoDB tables according to the schedule of live events. This way, our tables are always provisioned just ahead of that inevitable spike in traffic at the start of an event.

Auto scaling DynamoDB is a common problem for AWS customers, I have personally implemented similar tech to deal with this problem at two previous companies. I’ve even applied the same technique to auto scale Kinesis streams too.

When AWS announced DynamoDB Auto Scaling we were excited. However, the blog post that accompanied the announcement illustrated two problems:

  • the reaction time to scaling up is slow (10–15 mins)
  • it did not scale sufficiently to maintain the 70% utilization level
Notice the high no. of throttled operations despite the scaling activity. If you were scaling the table manually, would you have settled for this result?

It looks as though the author’s test did not match the kind of workload that DynamoDB Auto Scaling is designed to accommodate:

In our case, we also have a high write-to-read ratio (typically around 1:1) because every action the players perform in a game changes their state in some way. So unfortunately we can’t use DAX as a get-out-of-jail free card.

How DynamoDB Auto Scaling works

When you modify the auto scaling settings on a table’s read or write throughput, it automatically creates/updates CloudWatch alarms for that table — four for writes and four for reads.

As you can see from the screenshot below, DynamoDB auto scaling uses CloudWatch alarms to trigger scaling actions. When the consumed capacity units breaches the utilization level on the table (which defaults to 70%) for 5 mins consecutively it will then scale up the corresponding provisioned capacity units.

Problems with the current system, and how it might be improved

From our own tests we found DynamoDB’s lacklustre performance at scaling up is rooted in 2 problems:

  1. The CloudWatch alarms requires 5 consecutive threshold breaches. When you take into account the latency in CloudWatch metrics (which typically are a few mins behind) it means scaling actions occur up to 10 mins after the specified utilization level is first breached. This reaction time is too slow.
  2. The new provisioned capacity unit is calculated based on consumed capacity units rather than the actual request count. The consumed capacity units is itself constrained by the provisioned capacity units even though it’s possible to temporarily exceed the provisioned capacity units with burst capacity. What this means is that once you’ve exhausted the saved burst capacity, the actual request count can start to outpace the consumed capacity units and scaling up is not able to keep pace with the increase in actual request count. We will see the effect of this in the results from the control group later.

Based on these observations, we hypothesize that you can make two modifications to the system to improve its effectiveness:

  1. trigger scaling up after 1 threshold breach instead of 5, which is in-line with the mantra of “scale up early, scale down slowly”.
  2. trigger scaling activity based on actual request count instead of consumed capacity units, and calculate the new provisioned capacity units using actual request count as well.

As part of this experiment, we also prototyped these changes (by hijacking the CloudWatch alarms) to demonstrate their improvement.

Testing Methodology

The most important thing for this test is a reliable and reproducible way of generating the desired traffic patterns.

To do that, we have a recursive function that will make BatchPut requests against the DynamoDB table under test every second. The items per second rate is calculated based on the elapsed time (t) in seconds so it gives us a lot of flexibility to shape the traffic pattern we want.

Since a Lambda function can only run for a max of 5 mins, when context.getRemainingTimeInMillis() is less than 2000 the function will recurse and pass the last recorded elapsed time (t) in the payload for the next invocation.

The result is a continuous, smooth traffic pattern you see below.

We tested with 2 traffic patterns we see regularly.

Bell Curve

This should be a familiar traffic pattern for most — a slow & steady buildup of traffic from the trough to the peak, followed by a faster drop off as users go to sleep. After a period of steady traffic throughout the night things start to pick up again the next day.

For many of us whose user base is concentrated in the North America region, the peak is usually around 3–4am UK time — the more reason we need DynamoDB Auto Scaling to do its job and not wake us up!

This traffic pattern is characterised by a) steady traffic at the trough, b) slow & steady build up towards the peak, c) fast drop off towards the trough, and repeat.

Top Heavy

This sudden burst of traffic is usually precipitated by an event — a marketing campaign, a promotion by the app store, or in our case a scheduled LiveOps event.

In most cases these events are predictable and we scale up DynamoDB tables ahead of time via our automated tooling. However, in the unlikely event of an unplanned burst of traffic (and it has happened to us a few times) a good auto scaling system should scale up quickly and aggressively to minimise the disruption to our players.

This pattern is characterised by a) sharp climb in traffic, b) a slow & steady decline, c) stay at a stead level until the anomaly finishes and it goes back to the Bell Curve again.

We tested these traffic patterns against several utilization level settings (default is 70%) to see how it handles them. We measured the performance of the system by:

  • the % of successful requests (ie. consumed capacity / request count)
  • the total no. of throttled requests during the test

These results will act as our control group.

We then tested the same traffic patterns against the 2 hypothetical auto scaling changes we proposed above.

To prototype the proposed changes we hijacked the CloudWatch alarms created by DynamoDB auto scaling using CloudWatch events.

When a PutMetricAlarm API call is made, our change_cw_alarm function is invoked and replaces the existing CloudWatch alarms with the relevant changes — ie. set the EvaluationPeriods to 1 minute for hypothesis 1.

To avoid an invocation loop, the Lambda function will only make changes to the CloudWatch alarm if the EvaluationPeriod has not been changed to 1 min already.
The change_cw_alarm function changed the breach threshold for the CloudWatch alarms to 1 min.

For hypothesis 2, we have to take over the responsibility of scaling up the table as we need to calculate the new provisioned capacity units using a custom metric that tracks the actual request count. Hence why the AlarmActions for the CloudWatch alarm is also overridden here.

Result (Bell Curve)

The test is setup as following:

  1. table starts off with 50 write capacity unit
  2. traffic holds steady for 15 mins at 25 writes/s
  3. traffic then increases to peak level (300 writes/s) at a steady rate over the next 45 mins
  4. traffic drops off back to 25 writes/s at a steady rate over the next 15 mins
  5. traffic holds steady at 25 writes/s

All the units in the diagrams are of SUM/min, which is how CloudWatch tracks ConsumedWriteCapacityUnits and WriteThrottleEvents, but I had to normalise the ProvisionedWriteCapacityUnits (which is tracked as per second unit) to make them consistent.

Let’s start by seeing how the control group (vanilla DynamoDB auto scaling) performed at different utilization levels from 30% to 80%.

I’m not sure why the total consumed units and total request count metrics didn’t match exactly when the utilization is between 30% and 50%, but seeing as there were no throttled events I’m going to put that difference down to inaccuracies in CloudWatch.

I make several observations from these results:

  1. At 30%-50% utilization levels, write ops are never throttled — this is what we want to see in production.
  2. At 60% utilization level, the slow reaction time (problem 1) caused writes to be throttled early on as the system adjust to the steady increase in load but it was eventually able to adapt.
  3. At 70% and 80% utilization level, things really fell apart. The growth in the actual request count outpaced the growth of consumed capacity units, more and more write ops were throttled as the system failed to adapt to the new level of actual utilization (as opposed to “allowed” utilization measured by consumed capacity units, ie problem 2).

Hypothesis 1 : scaling after 1 min breach

Some observations:

  1. At 30%-50% utilization level, there’s no difference to performance.
  2. At 60% utilization level, the early throttled writes we saw in the control group is now addressed as we decreased the reaction time of the system.
  3. At 70%-80% utilization levels, there is negligible difference in performance. This is to be expected as the poor performance in the control group is caused by problem 2, so improving reaction time alone is unlikely to significantly improve performances in these cases.

Hypothesis 2 : scaling after 1 min breach on actual request count

Scaling on actual request count and using actual request count to calculate the new provisioned capacity units yields amazing results. There were no throttled events at 30%-70% utilization levels.

Even at 80% utilization level both the success rate and total no. of throttled events have improved significantly.

This is an acceptable level of performance for an autoscaling system, one that I’ll be happy to use in a production environment. Although, I’ll still lean on the side of caution and choose a utilization level at or below 70% to give the table enough headroom to deal with sudden spikes in traffic.

Results (Top Heavy)

The test is setup as following:

  1. table starts off with 50 write capacity unit
  2. traffic holds steady for 15 mins at 25 writes/s
  3. traffic then jumps to peak level (300 writes/s) at a steady rate over the next 5 mins
  4. traffic then decreases at a rate of 3 writes/s per minute

Once again, let’s start by looking at the performance of the control group (vanilla DynamoDB auto scaling) at various utilization levels.

Some observations from the results above:

  1. At 30%-60% utilization levels, most of the throttled writes can be attributed to the slow reaction time (problem 1). Once the table started to scale up the no. of throttled writes quickly decreased.
  2. At 70%-80% utilization levels, the system also didn’t scale up aggressively enough (problem 2). Hence we experienced throttled writes for much longer, resulting in a much worse performance overall.

Hypothesis 1 : scaling after 1 min breach

Some observations:

  1. Across the board the performance has improved, especially at the 30%-60% utilization levels.
  2. At 70%-80% utilization levels we’re still seeing the effect of problem 2 — not scaling up aggressively enough. As a result, there’s still a long tail to the throttled write ops.

Hypothesis 2 : scaling after 1 min breach on actual request count

Similar to what we observed with the Bell Curve traffic pattern, this implementation is significantly better at coping with sudden spikes in traffic at all utilization levels tested.

Even at 80% utilization level (which really doesn’t leave you with a lot of head room) an impressive 94% of write operations succeeded (compared with 73% recorded by the control group). Whilst there is still a significant no. of throttled events, it compares favourably against the 500k+ count recorded by the vanilla DynamoDB auto scaling.

Conclusions

I like DynamoDB, and I would like to use its auto scaling capability out of the box but it just doesn’t quite match my expectations at the moment. I hope this post provides sufficient proof (as you can see from the data below) that there is plenty of room for improvement with relatively small changes needed from AWS..

Feel free to play around with the demo, all the code is available here.

Vault Configuration as Code

Here at Space Ape we use Vault extensively. All of our instances authenticate with Vault using the EC2 auth backend which allows us to restrict the scope of secrets any instance has access to.

Behind Vault, we use Consul as a backend to persist our secrets with a good level of durability and make use of Consul’s snapshot feature to create backups, which means we can restore both Consul and Vault from the backup if the worst case occurred.

Where we’ve struggled with Vault is in managing the configuration: which policies, roles, auth backends do we have? Which of our AWS accounts are setup for the EC2 auth and how do we update or replicate any of these configurations? If we had to set up a new instance of Vault, or recover an existing one, how long would it take us to get everything setup? Probably a lot longer than it should.

This isn’t something we accept elsewhere in our estate: We use CloudFormation to manage precisely how out AWS infrastructure looks; we use Chef to manage exactly how our instances are setup and applications are configured. All of this is configuration is stored in Git. In short we treat our configuration as code.

For those looking to manage configuration in Vault, help is at hand. In November 2016 Hashicorp’s Seth Vargo penned a blog post that caught our interest – Codifying vault policies and configuration – in which he describes how to use the Vault API to apply configuration from files. There a few things we can learn from Seth’s post:

  • The API calls are idempotent
  • The script ignores the response as you’ll often get non-200 responses (for instance if a mount already exists)
  • He maps the directory structure to the API, this makes it easy to rewrite the code in any language without having to change your directory structure.
  • API calls need to be applied in the correct order (e.g. An Auth backend must exist before you can apply configuration to it.
  • You can integrate this into your CI lifecycle.

A couple of things that are missing:

  • Code testing
  • Verifying the result of our API call was successful.

Taking Seth’s blog post as our starting point, we set out to implement configuration-managed Vault clusters using the API.

We use a lot of Ruby here so it makes sense to create a gem to apply our configuration for us and we can take the opportunity to apply unit tests. We can use Jenkins to test applying our actual configuration.

Requirements

  • Code should be tested
  • We should verify that our config has been applied correctly
  • We want a CI pipeline for our configuration.

We quickly realised that a lot of the process is repeated for each API endpoint:

  1. Locate files containing the configuration
  2. Parse the files containing the configuration
  3. Apply the configuration
  4. Verify the configuration

We have a Setup class that handles creating an instance of the Vault Client and locating the relevant files for each configuration type.

We created a Base class that our implementation (policies, auths backends etc) classes can inherit that will parse, apply and verify configuration.

Setup class

To create a Vault client it’s as simple as using the  Vault gem and providing the usual configuration details such as the address and a token.

We also have methods to locate the relevant files for any configuration item, such as policies. We simply need to supply the path to the directory in which the configuration files reside.

Base class

In the Base class we start by parsing files that the Setup class located for us. We accept hcl, yaml or json files and parse them into a hash.

We then call apply and verify methods which are implemented in classes specific to the configuration item such as Policies or Auths.

Policies

Applying policies is a good starting point as they represent a lot of our configuration and are referenced by other sections of configuration.

We save time by inheriting the the Base class we discussed above and we have an instance of the Setup class so that we can locate the files we need and have a Vault client to use.

We then implement the apply and verify methods. For a Policy the apply method is very simple, it simply uses the name of the file as the name of the policy and the contents of the file, which we translate into Json, as the body of the policy

client.sys.put_policy(name, hash.to_json)

Next we verify that our Policy was correctly applied. The first step of this is to request the policy from Vault which we can simply ask for by name (also filename).

client.sys.policy(name)

Then we can:

  1. Check that we received a Policy and not an error or an empty blob of Json.
  2. That the Json we receive matches the Json that we sent. We use the JsonCompare gem to verify each key value pair that is returned.
└── sys
    └── policy
        └── admins.hcl

The directory structure in which we store our policies. /sys/policy/admins would be the api path to post a policy if you wanted to use the API directly.

path "*" {
 policy = "sudo"
}

A really bad example of a Vault role that admin.hcl might contain. We parse this as HCL to post to /sys/policy/admins.

Testing

One of our requirements was to write tests. Below are our tests for policies.

require "spec_helper"

describe Spaceape::VaultSetup::Policy do
  subject do
    Spaceape::VaultSetup::Policy.new(
      Spaceape::VaultSetup::Setup.new(
        vault_address: "http://vault:8200",
        ssl_verify: false,
        config_dir: "spec/fixtures/main",
        vault_token: vault_token
      ),
      false
    )
  end

  let(:test_policy) do
    {
      "path": {
         "auth/app-id/map/user-id/*":
           {
             "policy": "write"
           }
        }
     }
   end

  it "applies and verifies a policy" do
    subject.apply("test-policy", test_policy)
      expect { subject.verify("test-policy", test_policy) }
        .to_not raise_error
  end

  it "identifies invalid policy" do
    subject.apply("test-policy", test_policy)
    wrong_role = test_policy.dup
    wrong_role[:path] = "/auth/app-id/map/uuuuuu/*"
    expect { subject.verify("test-policy", wrong_role) }
      .to raise_error(Spaceape::VaultSetup::ItemMismatchError)
  end

  it "applies all policies in config_dir" do
    subject.apply_items(subject.policy_files)
    expect(subject.client.sys.policies)
      .to include("test-policy2", "test-policy")
  end
end

From the test above you can see that can see that we test against a vault server at vault:8200. We run these tests in in Docker and make use of Docker compose so we can create a Vault server in dev mode and then a ruby container, with our code mounted in a volume, to run our tests.

Auths and Mounts

Policies were easy – we parse the file, make a single API call to apply the policy and another to verify it. Auths and Mounts are a bit more complicated. There are essentially three parts to each:

  1. Enable the Auth/Mount
  2. Tune the Auth Mount
  3. Configure the Auth Mount

Enabling is pretty simple you pass the name (this is what you want to call it), the type (such as secret, github or pki) and an optional description.

We store this information in sys/auth/<name>.ext. The API endpoint is sys/auth/<name>.

└── sys
    └── auth
        └── github-spaceape.json

This contents of this file may look like:

{
  "type": "github",
  "description": "spaceape github",
  "config": {
    "max_lease_ttl": "87600h",
    "default_lease_ttl": "3h"
  }
}

Notice it contains the type and description which we covered above. It also includes a config key, this is actually the tuning we can apply to the Auth/Mount. This is applied to the API endpoint sys/auth/<name>/tune so it seems to make sense to store it in this file.

So far so good, but now we come onto configuring the Auth or Mount. There’s no standard pattern here and they sometimes require secrets. We decided to exclude any secrets from the config. These can be applied as manual steps later. We can however apply some configuration.

For example we can set the organisation for the Github auth, but we don’t wouldn’t want to set the AWS credentials for the EC2 auth backend.

The API endpoint for applying configuration to Auths is auth/<name>/config and Mounts is <name>/config/<config_item>. We decided to group our mounts under a mounts directory, veering slightly from the file structure matching the API path.

Our directory structure now looks a little like this:

└── sys
|   └── auth
|   |   └── github-spaceape.json
|   └── mounts
|       └── spaceape-pki
└── auth
|   └── github-spaceape
|       └── config.json
└── mounts
    └── spaceape-pki
        └── config
        |   └── urls.json
        └── roles
            └── example-role.json

This is where mapping the file path as the API comes into it’s own: we can handle any of the Auths or Mounts without having explicitly write code for the exact type, we just have to get the structure correct.

Gotchas

There are a few things to look out for.

  1. When verifying our changes were applied Vault sometimes gives you more back than you expect. We just verify the fields we pass in.
  2. Time based fields (like the various ttl fields) are not always returned in the same format, you may get the time in seconds, or days and hour, etc. We found the chronic_duration gem useful for parsing the times for easy comparison.
  3. You may find some configuration on an Auth or Mount may have to be applied in a specific order, this is where we would have to write custom code to handle that particular type of Auth or Mount. Perhaps a configuration file could define the order in which to apply certain configuration.

Continuous Integration

When we check in code, a Jenkins job is triggered which will run our tests. As mentioned earlier we run our tests inside of Docker containers, this means that we don’t have to worry about having clashing versions of gems from other Ruby based builds we have on Jenkins.

More interesting to us is that we can now test our actual Vault configuration. So when we add a new policy we know it applies correctly. Again we use Jenkins to do this. Each time we commit a change to our Vault configuration git repository we trigger a build which attempts to apply the configuration to an instance of Vault running in Dev mode. If any of the configuration fails we can be prompted through Slack to see what caused it.

It’s still up to us to apply the changes to the production instance of Vault after the Jenkins tests have run successfully. This is mainly because we don’t want to give privileged Vault tokens out to Jenkins.

Final Words

The process we’ve described above for managing Vault configuration is just one way you could go about solving this problem. From our experience, it works – we can test our configuration and apply it in a repeatable and programmatic way.

It is however still a work in progress and the will doubtless be a few problems to overcome as we continue our development. We hope to Open Source the code in the future, but right now we feel there are still some improvements, for instance at the moment we test against Vault 0.6.5 (the latest release is 0.7.3) and we’ve only tested against a handful of Mounts and Authentication backends.

 

De-comming EC2 Instances With Serverless and Go

Nothing is certain but Death and Taxes, goes the old idiom, and EC2 instances are not exempt. You will have to pay for them, and at some point they will die. Truly progressive outfits embrace this fact and pick off unsuspecting instances Chaos-Monkey-style, others wait for that obituary email from Amazon. Either way, we all have to make allowances for those dearly departed instances, and tidy them up once they are gone.

This article describes one way of doing so automatically; using Lambda, the Serverless Framework, and Go.

Why?

Why Lambda? The obvious benefit is that there is no need to run and maintain a host to watch for dying instances. Also it integrates nicely with Cloudwatch Events, which is the best way to get notified of them.

Why Serverless? The Serverless Framework is an open-source effort to provide a unified way of building entire serverless architectures. Originally designed specifically for Lambda it is gaining increasing support for other providers too. Beyond just deploying Lambda functions, it allows you to manage all of the supporting infrastructural components (e.g. IAM, ELBs, S3 buckets) in one place, by supplementing your Lambda code with Cloudformation templates.

Why Go? Aside from it being one of our operational languages (along with Ruby), this is perhaps the hardest one to answer, as AWS don’t actually support it natively (yet). However some recent developments have made it more attractive: in Go 1.8, support was added for plugins. These are Go programs that are compiled as shared modules, to be consumed by other programs. The guys at eawsy with their awesome aws-lambda-go-shim immediately saw the potential this had in running Go code from a Python Lambda function. No more spawning a process to run a binary, instead have Python link the shared module and call it directly. Their Github page suggests that this is the second fastest way of executing a Lambda function, faster even than NodeJS, the serverless poster-boy!

It is this shim that we have used to build our EC2 Decomissioner, and we have also borrowed heavily from this idea (we found that we just needed a bit more flexibility, notably in pulling build-time secrets from Vault, outside the scope of this article).

How?

Cloudwatch Events are a relatively recent addition to the AWS ecosystem. They allow us to be notified of various events through one or more targets (e.g. Lambda functions, Kinesis streams).

Pertinently for this application, we can be told when an EC2 instance enters the terminated state, and the docs tell us the event JSON received by the target (in our case a Lambda function) will look like this:

{
   "id":"7bf73129-1428-4cd3-a780-95db273d1602",
   "detail-type":"EC2 Instance State-change Notification",
   "source":"aws.ec2",
   "account":"123456789012",
   "time":"2015-11-11T21:29:54Z",
   "region":"us-east-1",
   "resources":[
      "arn:aws:ec2:us-east-1:123456789012:instance/i-abcd1111"
   ],
   "detail":{
      "instance-id":"i-abcd1111",
      "state":"terminated"
   }
}

The detail is in the…detail, as they say. The rest is just pre-amble common to all Cloudwatch Events. But here we can see that we are told the instance-id, and the state to which it has transitioned.

So we just need to hook up a Lambda function to a specific type of Cloudwatch Event. This is exactly what the Serverless Framework makes easy for us.

Note, the easiest way to play along is to follow the excellent instructions detailed here, below we are configuring the setup in a semi-manual fashion, to illustrate what is going on. Either way you’ll need to install the Serverless CLI.

Create a directory to house the project (lets say serverless-ec2). Then create a serverless.yml file with contents something like this:

service: serverless-ec2
package:
  artifact: handler.zip
provider:
  name: aws
  stage: production
  region: us-east-1
  runtime: python2.7
  iamRoleStatements:
    - Effect: "Allow"
      Action:
        - "ec2:DescribeTags"
      Resource: "*"
functions:
  terminate:
    handler: handler.HandleTerminate
    events:
      - cloudwatchEvent:
          event:
            source:
              - "aws.ec2"
            detail-type:
              - "EC2 Instance State-change Notification"
            detail:
              state:
               - terminated

This config describes a service (analogous to a project) called serverless-ec2.

The package section specifies that the handler.zip file is the artifact containing Lambda function code that is uploaded to AWS. Ordinarily the framework takes care of the the zipping for us, but we will be building our own artifact (more on that in a moment).

The provider section specifies some AWS information, along with an IAM Role that will be created, that allows our function to describe EC2 tags.

Finally the functions section specifies a function, terminate, that is triggered by a Cloudwatch Event of type ‘aws.ec2’, with an additional filter applied to match only those events that have a ‘state’ of ‘terminated’ in the detail section of the event (see above).  The function is to be handled by the handler.HandleTerminate function, which is to be the name of the Go function we will write.

So lets go ahead and write it. First, run the following to grab the runtime dependency:

go get -u -d github.com/eawsy/aws-lambda-go-core/...

Then we are good to compose our function, create a handler.go with the following content:

package main

import (
	"log"

	"github.com/eawsy/aws-lambda-go-core/service/lambda/runtime"
)

// CloudwatchEvent represents an AWS Cloudwatch Event
type CloudwatchEvent struct {
	ID     string `json:"id"`
	Region string `json:"region"`
	Detail map[string]string
}

// HandleTerminate decomissions the terminated instance
func HandleTerminate(evt *CloudwatchEvent, ctx *runtime.Context) (interface{}, error) {
	log.Printf("instance %s has entered the '%s' state\n", evt.Detail["instance-id"], evt.Detail["state"])
	return nil, nil
}

Some points to note:

  • Your Handle* functions must reside in the main package, but you are free to organise the rest of your code as you wish. Here we have declared HandleTerminate, which is the function referenced in serverless.yml.
  • The github.com/eawsy/aws-lambda-go-core/service/lambda/runtime package provides access to a runtime.Context object that allows you the same access to the runtime context as the official Lambda runtimes (to access, for example, the AWS request ID or remaining execution time).
  • The return value will be JSON marshalled and sent back to the client, unless the error is non-nil, in which case the function is treated as having failed.

Perhaps the most important piece of information here is how the event data is passed into the function. In our case this is the Cloudwatch EC2 Event JSON as shown above, but it may take the form of any number of JSON events. All we need to know is that the event is automatically JSON unmarshalled into the first argument.

This is why we have defined a CloudwatchEvent struct, which will be populated neatly by the raw JSON being unmarshalled. It should be noted that there are already a number of predefined type definitions available here, we are just showing this for explanatory purposes.

The rest of the function is extremely simple, it just uses the standard library’s log function to log that the instance has been terminated (you should use this over fmt as it plays more nicely with Cloudwatch Logs).

With our code in place we can build the handler.zip that will be uploaded by the Serverless Framework. This is where things get a little complicated. Thankfully, the chaps at eawsy have provided us with a Docker image (with Go 1.8, and some tools used in the build process, installed). They also provide a Makefile (with an alternative one here) that you should definitely use, again what follows is just to demystify the process:

Run:

docker pull eawsy/aws-lambda-go-shim:latest

docker run --rm -it -v $GOPATH:/go -v $(pwd):/build -w /build eawsy/aws-lambda-go-shim go build --buildmode=plugin -ldflags='-w -w' -o handler.so

This builds our code as a Go plugin (handler.so) from within the provided Docker container. Next, run:

docker run --rm -it -v $GOPATH:/go -v $(pwd):/build -w /build eawsy/aws-lambda-go-shim pack handler handler.so handler.zip

This runs a custom ‘pack’ script that creates a zip archive (handler.zip) that includes our recently compiled handler.so along with the Python shim required for it to work on AWS. The very same handler.zip referenced in the serverless.yml above!

The final step then is to actually deploy the function, which is as simple as:

sls deploy

Once the Serverless tool has finished doing its thing, you should have a function that logs whenever an EC2 instance is terminated!

Clearly, you want to do more than just log the terminated instance. But the actual decomissioning is subjective. For instance, amongst other things, we remove the instance’s Route53 record, delete its Chef node/client, and remove any locks it might be holding in our Consul cluster. The point is that this is now just Go code – you can do with it whatever you wish.

Note that if you require access to anything inside your VPC as part of the tidying-up process, you need to explicitly state the VPC and subnets/security groups in which Lambda functions will run. But don’t worry, the Serverless tool has you covered.

AWS Lambda – build yourself a URL shortener in 2 hours

An interesting requirement came up at work this week where we discussed potentially having to run our own URL Shortener because the Universal Links mechanism (in iOS 9 and above) requires a JSON manifest at

https://domain.com/apple-app-site-association

Since the OS doesn’t follow redirects this manifest has to be hosted on the URL shortener’s root domain.

Owing to a limitation with our attribution partner they’re currently not able to shorten links when you have Universal Links configured for your app. Whilst we can switch to another vendor it means more work for our (already stretched) client devs and we really like our partner’s support for attributions in links.

Which brings us back to the question

“should we build a URL shortener?”

swiftly followed by

“how hard can it be to build a scalable URL shortener in 2017?”

Well, turns out it wasn’t hard at all 

ape-shortener

Lambda FTW

For this URL shortener we’ll need several things:

  1. a GET /{shortUrl} endpoint that will redirect you to the original URL
  2. a POST / endpoint that will accept an original URL and return the shortened URL
  3. an index.html page where someone can easily create short URLs
  4. a GET /apple-app-site-association endpoint that serves a static JSON response

all of which can be accomplished with API Gateway + Lambda.

Overall, this is the project structure I ended up with:

  • using the Serverless framework’s aws-nodejs template
  • each of the above endpoint have a corresponding handler function
  • the index.html file is in the static folder
  • the test cases are written in such a way that they can be used both as integration as well as acceptance tests
  • there’s a build.sh script which facilitates running
    • integration tests, eg ./build.sh int-test {env} {region} {aws_profile}
    • acceptance tests, eg ./build.sh acceptance-test {env} {region} {aws_profile}
    • deployment, eg ./build.sh deploy {env} {region} {aws_profile}

ape-shortener-project-structure

Get /apple-app-site-association endpoint

Seeing as this is a static JSON blob, it makes sense to precompute the HTTP response and return it every time.

ape-shortener-app-association

POST / endpoint

For an algorithm to shorten URLs, you can find a very simple and elegant solution on StackOverflow. All you need is an auto-incremented ID, like the ones you normally get with RDBMS.

However, I find DynamoDB a more appropriate DB choice here because:

  • it’s a managed service, so no infrastructure for me to worry about
  • OPEX over CAPEX, man!
  • I can scale reads & writes throughput elastically to match utilization level and handle any spikes in traffic

but, DynamoDB has no such concept as an auto-incremented ID which the algorithm needs. Instead, you can use an atomic counter to simulate an auto-incremented ID (at the expense of an extra write-unit per request).

ape-shortener-auto-incr-id

ape-shortener-auto-incr-id-dynamodb

GET /{shortUrl} endpoint

Once we have the mapping in a DynamoDB table, the redirect endpoint is a simple matter of fetching the original URL and returning it as part of the Location header.

Oh, and don’t forget to return the appropriate HTTP status code, in this case a 308 Permanent Redirect.

ape-shortener-redirect

 

GET / index page

Finally, for the index page, we’ll need to return some HTML instead (and a different content-type to go with the HTML).

I decided to put the HTML file in a static folder, which is loaded and cached the first time the function is invoked.

ape-shortener-index

 

Getting ready for production

Fortunately I have had plenty of practice getting Lambda functions to production readiness, and for this URL shortener we will need to:

  • configure auto-scaling parameters for the DynamoDB table (which we have an internal system for managing the auto-scaling side of things)
  • turn on caching in API Gateway for the production stage

Future Improvements

If you put in the same URL multiple times you’ll get back different short-urls, one optimization (for storage and caching) would be to return the same short-url instead.

To accomplish this, you can:

  1. add GSI to the DynamoDB table on the longUrl attribute to support efficient reverse lookup
  2. in the shortenUrl function, perform a GET with the GSI to find existing short url(s)

I think it’s better to add a GSI than to create a new table here because it avoids having “transactions” that span across multiple tables.

Useful Links

Devops at Scale: Videos

A big thank you to everyone who came to our offices to see our speakers last Wednesday night at the Devops at Scale event. A big thank you, also, to everyone involved in organising everything for the big night!

We’ve uploaded the videos of our talks, so if you weren’t able to come or are interested in what folk had to say, here they all are!

Steve Lowe: Devops at Scale: A Cultural Change

Sam Pointer: Smashing the Monolith for Fun and Profit: Telemetry-led Infrastructure at Hive

Louis McCormack: Monitoring at Scale