Space Ape are hiring for Devops

Here on the Space Ape Devops Team, we’ve been busy building out the tech for our next generation of mobile games and now it’s time to bring some fresh faces onto the team to help continue our journey. If you’re a passionate technologist, Devops engineer or infrastructure wrangler then we’d love to hear from you.

Being a Devop at Space Ape is an important role. On our existing titles, you’ll be responsible for maintaining the quality of our players’ experience, working with the team to roll out new features and upgrades and finding new ways to optimise the stacks. On our new titles, you’ll be working with the development team to build out new stacks, solve new problems and prepare for big scale launches.

Along the way, you’ll learn how we use tools to build and update our stacks and roll them out without impacting our players and developers. You’ll also learn how we write those tools in Ruby, Angular and sometimes Go. Eventually, you’ll learn what it is that our teams need and start bringing fresh new ideas for how we can make things better; perhaps improving our containerisation platform, serverless workloads or the security of our platforms.

If you’re interested, have a poke round some of our other posts and drop your details in on our careers page where you can find out a bit more about the Devops role and the technology we use.

Trajectory prediction with Unity Physics

In some recent prototyping work, we needed to display a prediction for a projectile trajectory in the game. You’ve probably seen something similar in many games before, such as Angry Birds:

AngryBirdsTrajectory

The tutorial from Angry Birds 2. Note the dotted line, showing you the predicted trajectory of your bird, if you released the slingshot now.

Our prototype game was in Unity, and the projectile was set up using Unity’s physics engine. We had several requirements for the prediction:

  • Immediate. Player input can change from frame to frame, and the prediction needs to stay in sync with it.
  • Accurate. The time of flight could be several seconds, and any small error will accumulate to product significantly incorrect results.
  • Simulates drag. We’re using drag on our rigidbody, which many solutions do not account for.

I assumed this sort of problem came up often and searched online to see what popular implementations were out there. They generally fell into three groups:

  • Accurate, but slow. These solutions introduce an invisible projectile clone into the world and launch it along the flight path, recording its motion over time. As there’s no way to step the Unity physics simulation along yourself, you have to wait for this prediction in real time. This means that a three-second flight takes three seconds to fully predict. This is far too slow – the prediction would constantly lag behind the player’s changing inputs.
  • Doesn’t include drag. There are some good, accurate solutions, but most will specifically rule out drag.
  • Inaccurate. Some combination of incorrect equations, assumptions, and approximations meant that with longer flight times and more drag (or different gravity) the prediction would be wrong.

Perhaps the perfect solution for us is out there, but I hadn’t found it. By combining existing solutions and running some tests, I came up with my own implementation, which is presented below.

 public static Vector2[] Plot(Rigidbody2D rigidbody, Vector2 pos, Vector2 velocity, int steps)
 {
     Vector2[] results = new Vector2[steps];
 
     float timestep = Time.fixedDeltaTime / Physics2D.velocityIterations;
     Vector2 gravityAccel = Physics2D.gravity * rigidbody.gravityScale * timestep * timestep;
     float drag = 1f - timestep * rigidbody.drag;
     Vector2 moveStep = velocity * timestep;
 
     for (int i = 0; i < steps; ++i)
     {
         moveStep += gravityAccel;
         moveStep *= drag;
         pos += moveStep;
         results[i] = pos;
     }
 
     return results;
 }

This function plots the trajectory of a rigidbody under the effect of Unity’s physics by simulating some FixedUpdate iterations and returning the positions of the projectile at each iteration. It uses the global Physics2D.gravity setting, and takes into account rigidbody drag and gravityScale. Note that the mass of the rigidbody is irrelevant.

float timestep = Time.fixedDeltaTime / Physics2D.velocityIterations;

The code attempts to produce the same results as running the normal Unity physics iterations. To do this, it must also run as an iterative solution. A common error here is to assume that one iteration is run every FixedUpdate(). Instead, the number of iterations to be performed is accessible and tweakable – it’s Physics2D.velocityIterations. This helps us compute the timestep.

Vector2 gravityAccel = Physics2D.gravity * rigidbody.gravityScale * timestep * timestep;

We take into account the rigidbody’s gravityScale property when computing the effect of gravity. We found that we wanted a different amount of gravity and drag on each object, so this per-body setting was really helpful.

float drag = 1f - timestep * rigidbody.drag;

Drag acts as a reduction on moveStep in each iteration. We can compute it upfront and then apply it to each step of the iteration, producing a cumulative effect.

for (int i = 0; i < steps; ++i)
 {
     moveStep += gravityAccel;
     moveStep *= drag;
     pos += moveStep;
     results[i] = pos;
 }

Finally, the main loop. Each iteration, you’ll move due to gravity, reduce the movement due to drag, and then accumulate and store the new position in the results.

This solution worked well for us. While not exhaustively tested, we used it for projectiles that had lots of different velocities and drags, and it proved accurate each time, even after 4-5 seconds of flight.

 

Drawbacks

There’s a lot of computation involved for long trajectories. With default settings, you have to run the loop 400 times for each second of flight you want to predict. We only have one projectile to predict in our prototype, so we’re just running one prediction, which doesn’t cost very much. If you used this to predict lots of projectiles for lots of different launchers in a large scale game, perhaps this would begin to be a problem for you.

Also, it’s only simulating the trajectory, and not actually running the physics engine or simulating anything else in the game. This means it doesn’t predict collisions or collision resolution. If you render this path as-is, it’ll just clip through walls or other obstacles in the world, which obviously isn’t actually what will happen when the projectile is launched.

These drawbacks weren’t a problem for our prototype, so it turned out to be pretty useful code. We share this now in the hope that someone else out there is faced with the same requirements and finds it useful too.

 

Possible Future Upgrades

I have some ideas around the drawbacks of this method. This is the main area for improvement, as the actual functionality is fine.

For performance, no profiling or optimisation work has been performed. I’ve just laid things out in the way that made sense to me. It’s hard to guess at optimisations, but perhaps a little profiling would reveal some simple speedups. The bigger step would be to push this code out to a native dll and get down to nitty-gritty c++ optimisation – perhaps with SIMD instructions. You can’t parallelise the steps (each iteration of the loop depends on the result of the previous) but you could parallelise multiple projectile predictions – e.g. if you have many projectiles, run 4 or 8 predictions in parallel.

The other big upgrade is around prediction. For some games a true prediction would be really valuable – for example, visualising the outcome of collisions and reactions in a pool table game. You’d want to see the predicted path of the ball, even after several bounces. This isn’t going to happen with any simple model if you have any in-depth physics properties. You’d need a big shift in your approach – to run the physics engine yourself. I’d find an appropriate existing physics engine and build it into the game/Unity, which is a shame as it’s duplicating the work that Unity’s already done. But after doing that, you’d have control over the physics simulation and how you update it.

You’d try for a setup where you’d be able to clone the existing simulation and run some update ticks – to, essentially, look into the future – you’ll be tracking the future state of the simulation, assuming no inputs change. This would have to be a separate simulation, as you wouldn’t want the actual state of the pool table to change – just to compute the predicted future state. This will be even more expensive that just running the basic prediction code we had above – it’s the full physics simulation.

Go Wavefront!

Long ago we took the decision to outsource our metrics platform. We generate a lot of metrics, and we came to realise that our solution at the time, Graphite, was not up to the task. Instead of spending in-house resource building a new platform, we decided to find an external partner, so we could focus on our core competency – running mobile games.

We eventually settled on Wavefront, in private beta at the time. Even in these early stages of their development, we were wildly impressed with the product. The responsiveness of the graphs in-browser, and the stability of the metric ingestion platform particularly impressed us.

This was over 2 years ago. Since then we have grown alongside Wavefront and watched as they came out of stealth mode, and continuously added to their bevy of features to offer the world-beating product they have today.

We contributed largely to their Ruby client, which has been open-sourced and continues to improve. But now we’re happy to announce another OSS project, go-wavefront.

Go-wavefront is a set of Golang libraries and a bundled CLI for interacting with the Wavefront API. It also includes a simple Writer library for sending metrics. It was borne out of an itch we needed to scratch to integrate the smattering of Go applications we have with our metric provider. We hope that in opening it up to the wider Wavefront and Golang community we can improve what we have, and be better able to keep up with the new features Wavefront throw at us.

As a cute-but-probably-useless gimmick, the CLI can plot a Wavefront graph live in the terminal window! Check it out, all feedback and pull requests are welcome.

live-graph

Introducing ComposeECS

The DevOps team here at Space Ape have just open-sourced a small Ruby gem that provides a mechanism to convert Docker Compose specifications to AWS EC2 Container Service task definitions – we’ve called it ComposeECS.

We run the majority of our infrastructure on AWS, using source-controlled CloudFormation templates to manage each of our stacks. Over time we’ve built up a toolchain to help us manage changes to our CloudFormation stacks known internally as ApeStack which incorporates CFNDSL and our internal conventions and processes.

Toward the end of 2015 we started to build out a new platform that would support deployments of containerised applications where it made sense. As heavy users of AWS, EC2 Container Service (ECS) was the obvious choice for running containers in the cloud. The potential advantages of deep integration with AWS services like ELB and IAM have significant implications when it comes to integrating the new platform with our existing stack.

As a team, we are huge fans of specifying configuration in YAML. It’s then perhaps no surprise that we much prefer the syntax of Docker Compose definitions over the JSON-based ECS task definitions. We wanted to be able to specify our task definitions in YAML alongside our CloudFormation templates. Furthermore, we wanted to construct ECS services and their supporting infrastructure with a single command. To this end, we wrote ComposeECS.

ComposeECS reads any Docker Compose file and translates supported attributes(including volume definitions) into an equivalent ECS task definition JSON – sanity-checking your attributes and ensuring compatibility with the ECS task definition specification as it does so. The advantages ComposeECS provides include:

  • Container definitions are more readable and therefore easier to maintain.
  • Docker Compose definitions written to run in our local Docker environment now run on ECS with little modification.
  • Unlike services translated with the ECS-CLI, which supports Docker Compose deployment, our services are CloudFormation-managed whilst still taking advantage of the Docker Compose syntax.

We’re very pleased with ComposeECS. However, it is but one of many hurdles on the path to a production-ready Space Ape container platform. For us, many questions remain around service discovery; efficient and reliable deployment; injection of configuration; handling scale and how to effectively monitor our cluster. We look forward to sharing more from our journey.


If you’re interested in integrating ComposeECS into your toolchain, or contributing, head over to the project Github page.

CoreOS London January Meetup 2016

 

On Wednesday 13th January some members of our Devops team attended the CoreOS London Meetup hosted at BlackRock.

At Space Ape we are taking steps towards running some of our services in containers and are very interested in CoreOS and the ecosystem around it.

The first talk of the evening by Ric Harvey from ngineered took us through running CoreOS on AWS, specifically talking about their approach and findings running Kubernetes.

We started off with an overview of Kubernetes installation on AWS and covered some of the problems you may run into. One to note was pulling Kubernetes from GitHub. If you decide to pull from GitHub over a lot of nodes in a VPC you’ll probably run into API limits, to get around this you can host Kubernetes in an S3 bucket (or other endpoint you control)
and pull from there. This approach is also preferable as it allows you more granular control over the version of Kubernetes you are deploying.

Ric talked us through their CoreOS setup and how they’ve worked to implement AWS best practises in their cluster. To start with they run everything in a VPC and ensure that they deploy the cluster over multiple availability zones (AZ). Within each AZ they deploy at least two subnets, one which will contain their Elastic Load Balancers (ELB) and a second which contains the CoreOS nodes and shared storage nodes.

The CoreOS nodes are setup in autoscaling groups which has allowed them to scale the fleet up and down automatically. On top of this they’ve got the Kubernetes replication controller deploying the containers around the cluster, ensuring they’ve always got the desired amount. The autoscaling groups can either be scaled manually or they can
make use of cloudwatch alarms. One improvement they are working on is making use of custom metrics for scaling (e.g. container count) as right now they depend on the metrics from the hypervisors.

Outside of the autoscaling groups they will deploy the Kubernetes master node. This allows them to control the cluster and to ensure that the node will not get destroyed during a scaling event.

The approach to load balancers was very interesting. While Kubernetes can control ELB creation, ngineered opted not to use this and instead use ELBs for public access which forward the traffic to haproxy instances. These haproxy instances run on all of the CoreOS instances and forward the traffic to the right container via IP addresses defined for that service by Kubernetes. It was a good example of controlling costs while still ensuring that the setup was flexible.

It was really good to see a production Kubernetes cluster being deployed on AWS and to get an insight into the challenges that you could face doing it. It was helpful to get an insight into what works and what doesn’t and how careful design will help with cost control.

The slides from this talk are available on Google Docs.

 
The second talk of the evening was by Joseph Schorr from CoreOS who was telling us about the security work CoreOS have been doing at all levels of the stack.

Joseph started off talking about system compromise and asking at what level can you now trust the system? How do you know if the hardware/bootloaders have been compromised?

To combat this uncertainty CoreOS have been working on signing all levels of the system boot process using keys that are stored in a trusted platform module (TPM). At a high level this allows for a set of keys to be embedded in hardware on the system. First the TPM is verified to ensure it’s authentic. Once that has taken place each component in the system can be loaded, with a signature being checked at each stage. If the signature is validated the component will be loaded. If there are any failures the boot will halt.

This system ensures that by the time the OS has been loaded you’ve got a verified trail of each step of the process which can later be audited.

Within your trusted OS environment you can now deploy your containers. It would be expected that you would build your containers from within a trusted environment and have them signed. Using these signed containers you are able to verify and deploy them into your trusted OS environment.

This is a great step forward for security and integrity of the OS and should give administrators a lot more confidence in the systems they are deploying. CoreOS have blogged on this system providing far more technical detail on how it works.

Joseph then went on to talk about Quay.io and the steps they’ve been implementing to provide security insights into the containers they are hosting. Quay.io are using their new tool Clair to scan all of the containers they host against the CVE database (and common vendor databases) and report when your container contains an insecure package. Each layer of the container is scanned so you will get alerted on insecure packages that you might not know are present on the system. Alerts are triggered via webhooks which allow you to receive notifications in a number of different places.

If you’re using Quay.io to host your container images using the Clair scanning seems like something worth using right away.

The slides from this talk are also available on Google Docs.

This talk really showed in todays world with exploits getting more and more sophisticated, security needs to be thought of at all levels of the system from the hardware up. It also showed that through the use of clever tooling it’s possible to be alerted to potential problems in your infrastructure very soon after they are made public and how containers can help you to resolve those problems and quickly prove they are resolved.

Thanks to BlackRock for hosting, to Ric and Joseph for speaking and for the organisers for organising an enjoyable evening. We’re looking forward to the next one.

The Evolution of a CI System

The process of building and maintaining repeatable infrastructure, a process we now know as configuration management, has evolved over the years. It has had to, to keep up with the seismic shifts within the industry. 

In the beginning there were shell-scripts and Kickstart manifests, accompanied by – if you were lucky – lengthy procedural documents. Inevitably some clever folk encapsulated these into tools and frameworks such as cfEngine, Puppet and Chef. With these tools at our disposal we now found we could represent our infrastructure as code and, since it was just code, why not apply some of the principals that our developer cousins had been preaching? Namely, unit and integration tests, code reviews, continuous integration and deployment etc etc.

chef_logo

In keeping with the trend, eventually these configuration management tools were themselves further abstracted. Companies built their own bespoke CI systems to solve their own specific problems. 

This is the story of how Space Ape’s Chef-based CI system evolved. Hopefully it may resonate with others, and even provide inspiration to those facing similar problems.

We started with community cookbooks. A lot of community cookbooks. We had cookbooks wrapping those community cookbooks, we even had cookbooks wrapping those wrapper cookbooks. We had no version constraints; if you pushed some code to the Chef server you pushed it to all environments, instantly.

Versioning cookbooks against environments seemed an obvious place to start, so we did. We used the knife spork tool. Knife spork is a handy knife plugin that will ‘bump’ cookbook versions, and ‘promote’ those new versions through environments. Crucially it leaves your production code running a previous version of a cookbook until such time you decide it is safe to promote.

Now, the community cookbook paradigm is great for getting things up and running quickly. But the long tail of dependencies soon becomes unwieldy: do you really need code to install Java on Windows; or yum repository management, when you’re running Ubuntu? Why do we have a runit cookbook, we’ve never even used runit? The problem is that community cookbooks need to support all manner of operating systems and frameworks, not just the specific ones you happen to use. So we took a policy of re-writing all of our infrastructure code, removing unwanted cruft and distilling only that which we absolutely needed.

Eventually, as the quality of our cookbook code improved, we found that often we would want to promote cookbooks through all environments. What better way to achieve this than a for loop?

for env in $(knife environment list); do knife spork promote ${env} sag_logstash; done

Any time you find yourself using the same for-loop each day, its probably time to write a script, or shell-helper at least. Additionally, the only safeguard we had with the for-loop, in the event of a problem, was to frantically hit Ctrl-C before it hit production.

Enter Space Ape’s first, really, er, rubbish CI system:

Our First CI

Essentially our first tool was that same for loop, with some ASCII art thrown in, and some very rudimentary testing between environments (i.e. did the Chef run finish?). It was still a long way from perfect, but a slight improvement. Our main gripe with this approach (apart from the obvious fact that is was indeed a rubbish CI system) was the fact that it still provided very little in the way of safety, and completely ignored our integration tests.

In time we decided that maybe it was time we made some proper use of those tests. A shell-script just would no longer cut it, ASCII art or not. No, we needed a system we could trust to continuously deploy our cookbook code, dependent on tests, with a proper queueing mechanism and relevant notifications upon failure.

Being decidedly not a ‘not invented here’ Devops team, we investigated some open-source and COTS offerings, but ultimately found them to be not quite suitable or malleable enough for our needs. We decided to build our own.

And so SeaEye was born. OK, it’s a silly name an amazing pun, and we already had another amazing pun, ApeEye, a system we use for deploying code, so it made sense.

SeaEye is a Rails app that runs on Docker, uses Sidekiq as a background job processor and an AWS RDS database as a backend. It is first and foremost an HTTP API, which just happens to have a nice(-ish) web frontend. This allows us to build command line tools that poke and poll the API for various means.

Screen Shot 2015-11-27 at 15.59.21

Beneath the nice(-ish) facade are a hierarchy of stateful workflows, each corresponding to a Sidekiq job and represented as finite-state-machine workflows using the this workflow gem. The basic unit of work is the CookbookPush, which is made up of a number of sub-tasks, one for each environment to be pushed through. The CookbookPush is responsible for monitoring the progress of each sub-task, and only when one has successfully completed does it allow the next to run. It makes use of the Consul-based locks we described in this post to add an element of safety to the whole process. 

A CookbookPush can be initiated manually, but that is only half of the story. We wanted SeaEye to integrate with our development workflow. Like most Chef shops, we use Test Kitchen to test our cookbooks. Locally we test using Vagrant, and remotely using Travis-CI with the kitchen-ec2 plugin. We perform work on a branch and, once happy, merge the branch into master. What we’d traditionally do is then watch for the tests to pass before manually kicking off the CookbookPush.

Screen Shot 2015-11-27 at 15.59.47

We knew we could do better. So we added another stateful workflow, called the CI. The premise here is that SeaEye itself polls Github for commits against the master branch. If it finds one, and there is a specific tag against it, it will manually kick off a Travis build. Travis is then polled periodically as to the success (or otherwise) of the build, and CookbookPush-es are created for each cookbook concerned. The DevOps team are kept informed of the progress through Slack messages sent by SeaEye.

There are many ways to skin this particular CI cat, and many off-the-shelf products to help facilitate the skinning.  Rolling our own has happened to have worked well for us, but every team and business is different. We’ve since built a suite of command-line tools, and even integrated SeaEye with ChatOps. Hopefully our experiences will help inspire others facing similar problems.

Is there such thing as a DevOps Hierarchy of Needs?

In 1943 the psychologist Abraham Maslow proposed the concept of a ‘hierarchy of needs’ to describe human motivation. Most often portrayed as a pyramid, with the more fundamental needs occupying the largest space in the bottom layers, his theory states that only in the fulfilment of the lower-level needs can one hope to progress to the next strata of the pyramid. The bottom-most need is of course physiological (i.e. food, shelter, beer etc); once this is achieved we can start to think about safety (i.e. lets make sure nobody takes our beer) then we start looking for Love and Self Esteem before ending up cross-legged in an ashram searching for Self-Actualization and Transcendence.

maslows_hierarchy

Is this a Devops blog or what? Yes, yes it is. The suggestion is not that we should all be striving for Devops Transcendence or anything, but that perhaps the general gist of Maslow’s theory could be applied to coin a DevOps Hierarchy of Needs, and we could use the brief history of our own Devops team at Spaceape to bolster this idea.

In the beginning there was one man. This man was tasked with building the infrastructure to run our first game, Samurai Siege; not just the game-serving tier but also a Graphite installation, an ELK stack, a large Redis farm, an OpenVPN server, a Jenkins server, et cetera et cetera. At this juncture we could not even be certain that Samurai Siege would be a success. The remit was to get something that worked, to run our game to the standards expected by our players.

Some sound technological choices were made at this point, chief of which was to build our game within AWS.

With very few exceptions, we run everything in AWS. We’re exceedingly happy with AWS, and its suits our purposes. You may choose a different cloud provider; you may forego a cloud provider altogether and run your infrastructure on-premise. Whichever it is, this is the service that provides the first layer on our DHoN. You need some sort of request driven IaaS to equate to Maslow’s Physiological layer. Ideally this would include not only VMs but also your virtual network and storage. Without this service, whatever it might be (and it might be as simple as, say, a set of scripts to build KVM instances), you can’t hope to build toward the upper reaches of the pyramid.

Samurai Siege was launched. It was a runaway success. Even under load the game remained up, functional and performant. The one-man Devops machine left the company and Phase 2 in our short history commenced. We now had an in-house team of two and one remote contractor and we set about improving our lot, striving unawares for that next level of needs. It quickly became apparent, however, that we might face some difficulty…

If AWS provided the rock on which we built our proverbial church, we found that the church itself needed some repairs, someone had stolen the lead from its roof.

Another sound technology choice that was made early was to use Chef as the configuration management tool. Unfortunately – and unsurprisingly given the mitigating circumstances – the implementation was less than perfect. It appeared that Chef had only been used in the initial building of the infrastructure, attempts to run it on any sort of interval led inevitably to what we had started to call ‘facepalm moments’. We had a number of worrying 3rd party dependencies and if Chef was problematic, Cloudformation was outright dangerous. We had accrued what is commonly known as technical debt.

Clearly we had a lot of work to do. We set about wresting back control of our infrastructure. Chef was the first victim: we took a knife to our community cookbooks, we introduced unit tests and cookbook versioning, we separated configuration from code, we even co-opted Consul to help us. Once we had Chef back on-side we had little choice but to rebuild our infrastructure in its entirety, underneath a running game. With the backing of our CTO we undertook a policy of outsourcing components that we considered non-core (this was particularly efficacious with Graphite, more on this one day soon). This enabled us to concentrate  our efforts and to deliver a comprehensive game-serving platform, of which we were able to stamp out a new iteration for our now well-under-development second game, Rival Kingdoms.

It would be easy at this point to draw parallels with Maslow’s second tier, Safety. Our systems were resilient and monitored, we could safely scale them up and down or rebuild them. But actually what we had reached at this point was Repeatability. Our entire estate – from the network, load-balancers, security policies and autoscaling groups through to the configuration of Redis and Elasticsearch or the specifics of our deployment process – was represented as code. In the event of a disaster we could repeat our entire infrastructure.

Now, you might think this is a lazy observation. Of course you should build things in a repeatable fashion, especially in this age of transient hosts, build for failure, chaos monkeys, and all the rest of it. The fact is though, that whilst this should be a foremost concern of a Devops team, quite often it is not. Furthermore, there may be genuine reasons (normally business related) why this is so. The argument here is that you can’t successfully attain the higher layers of our hypothetical DHoN until you’ve reached this stage. You might even believe that you can but be assured that as your business grows, the cracks will appear.

At Spaceape we were entering Phase 3 of our Devops journey, the team had by now gained some and lost some staff, and gained a manager. The company itself was blossoming, the release date of Rival Kingdoms had been set, and we were rapidly employing the best game developers and QA engineers in London.

With our now sturdy IaaS and Repeatability layers in place, we were able to start construction of the next layer of our hierarchy  – Tooling. Of course we had built some tools in our journey thus far (they could perhaps be thought of as tiny little ladders resting on the side of our pyramid) but its only once things are standardised and repeatable that you can really start building effective tooling for the consumption of others. Any software that tries to encompass a non-standard, cavalier infrastructure will result in a patchwork of ugly if..then..else clauses and eventually a re-write when your estate grows to a point where this is unsustainable. At Spaceape, we developed ApeEye (a hilarious play on the acronym API) which is a RESTful Rails application that just happens to have a nice UI in front of it. Perennially under development, eventually it will provide control over all aspects of our estate but for now it facilitates the deployment of game code to our multifarious environments (we have a lot of environments – thanks to the benefits of standardisation we are able to very quickly spin up entirely new environments contained on a single virtual host).

And so the launch of Rival Kingdoms came and went. It was an unmitigated success, the infrastructure behaved – and continues to behave – impeccably. We have a third game under development, for which building the infrastructure is now down to a fine art. Perched as we are atop our IaaS, Repeatabilty and Tooling layers, we can start to think about the next layer.

But what is the next layer? It probably has something to do with having the time to reflect, write blog posts, contribute to OSS and speak at Devops events, perhaps in some way analogous to Maslow’s Esteem level. But in all honesty we don’t know, we’ve but scarcely laid the foundations for our Tooling level. More likely is that there is no next level, just a continuous re-hashing of the foundations beneath us as new technologies and challenges enter the fray.

The real point here is a simple truth – only once you have a solid, stable, repeatable and predictable base can you start to build on it to become as creative and as, well, awesome as you’d like to be. Try to avoid the temptation to take shortcuts in the beginning and you’ll reap the benefits in the long term. Incorporate the practices and behaviours that you know you should be,  as soon as you can.  Be kind to your future self.

Chef and Consul

Here at Spaceape, our configuration management tool of choice is Chef. We are also big fans of Consul, the distributed key-value-store-cum-service-discovery-tool from the good folks at Hashicorp. It might not be immediately clear why the two technologies should be mentioned in the same paragraph, but here is the story of how they became strange bedfellows.

consul - logo-gradient-94098a4aIn a previous blog post, we told the story of our experiences with Chef. That post goes into far greater detail, but suffice it to say that our infrastructure code base was not always as reliable, configurable or even predictable as it is now. We found ourselves in a dark place where Chef was run on an ad-hoc basis, often with fingers well and truly crossed. To wrest back control and gain confidence we needed to be able to run it on a 15-minute interval.

Simple, you say, write a cron-job. Well yes, that is true. But it’s only a very small part of the story. We would find occasions where initiating a seemingly harmless Chef run could obliterate a server, and yet the same run on an ostensibly similar server would reach the end without incident. In short we had little confidence in our code, certainly not enough to start running it on Production. Furthermore – and this applies still today – often we really don’t want to allow Chef to run simultaneously across a given environment. For example, we may push a change that restarts our game-serving process. I don’t need to expand on what would happen if that change ran across Production at 12:15 one day…

Wouldn’t it be nice, we asked ourselves, if we had some sort of global locking mechanism? To prevent us propagating potential catastrophes? Not only would this allow us to push infrastructure changes through our estate, it might just have other benefits…

Enter Consul!

Like all the Hashicorp products we’ve tried, Consul is solid. Written in Go, it employs the Raft consensus algorithm atop the Serf gossip protocol to provide a highly available distributed system that even passes the Jepsen test. It is somewhat of a swiss army knife that aims to replace or augment your existing service delivery, configuration management and monitoring tools.

The utility we decided to employ for the locking mechanism was the key-value store. We built our own processes and tooling around this, as we’ll see, but it should be noted that more recent versions of Consul than we had available at the time actually have a semaphore offering.

Stored in Consul, we have a number of per-tier, per-environment key-spaces. As an example:

chef/logstash/es-indexer

Logstash is the environment, es-indexer the service. Within this keyspace, the only pre-requisite is a value for the maximum number of concurrent Chef runs we wish to allow, which we call max_concurrent:

chef/logstash/es-indexer/max_concurrent

Generally this value is set to 1, but on some larger environments we set it higher.

When a server wishes to run Chef the first thing it does is to retrieve this max_concurrent value. Assuming the value is a positive integer (a value of -1 will simply allow all Chef runs) it then attempts to acquire a ‘slot’. A slot is obtained by checking this key:

chef/logstash/es-indexer/current

Which is a running count of the number of hosts in the tier currently running Chef. It’s absence denotes zero. If the current value is less than the `max_concurrent` value the server increments the counter and registers itself as ‘running’ by creating a key like this, the value of which is a timestamp:

chef/logstash/es-indexer/running/hostname.of.the.box

The sharper amongst you will have noticed a problem here. What if two hosts try to grab the slot at the same time? To avoid this happening we use Consul’s Check-and-Set feature. The way this works is that, upon the initial read of the current value, a ModifyIndex is retrieved along with the actual Value. If the server decides that current < max_concurrent it attempts to update current by passing a `?cas=ModifyIndex` parameter. If the ModifyIndex does not match that which is stored on Consul, it indicates that something else has updated it in the meantime, and the write fails.

With the slot obtained, the Chef run is allowed to commence. Upon success, the running key is removed and the counter decremented. If however the run fails the lock (or the ‘slot’) is held, no further hosts on the tier are able to acquire a slot, and Chef runs are thereby suspended.

Our monitoring tools, by checking the timestamp of the running key, are able to warn us when locks have been held for a certain period of time (i.e. longer than the average Chef run takes) and failures are contained to one (or rather max_concurrent) hosts.

And so… this all works rather well. Many has been the time when we’d look at one another, puff our cheeks, and say, “Thank goodness for the locking system!” Over time it has allowed us to unpick our infrastructure code and get to the smug position we find ourselves. Almost never do we see locks being held for anything but trivial problems (Chef server timeouts for instance), nothing that a little judicious sleep-and-retry doesn’t fix. It also gives us great control over when and what runs Chef as we can easily disable Chef runs for a given tier by setting max_concurrent to 0.

But the purists amongst you will no doubt be screaming something about a CI system, or better unit tests, or something, or something. And you’d be right. The truth is that we were unable to shoehorn a CI system into infrastructure code which was underpinning a live game, in which we did not have complete confidence. Having the backup-parachute of the mechanism described above, though, has enabled us to address this. But that doesn’t mean we’ll be discarding it. On the contrary, it will form the backbone of our CI system, facilitating the automatic propagation of infrastructure code throughout our estate. More on that to follow.