Devops at Scale: Videos

A big thank you to everyone who came to our offices to see our speakers last Wednesday night at the Devops at Scale event. A big thank you, also, to everyone involved in organising everything for the big night!

We’ve uploaded the videos of our talks, so if you weren’t able to come or are interested in what folk had to say, here they all are!

Steve Lowe: Devops at Scale: A Cultural Change

Sam Pointer: Smashing the Monolith for Fun and Profit: Telemetry-led Infrastructure at Hive

Louis McCormack: Monitoring at Scale

Upcoming Event: Devops at Scale

We’re working with Burns Sheehan, Wavefront and Hive to host an event at Space Ape HQ on Wednesday April 12th on the theme of Devops at Scale.

The evening will explore topics focussing on the adoption of DevOps at scale, hearing from businesses and individuals who have successfully driven these new DevOps approaches.

Richard Haigh and Steve Lowe will be speaking from Betfair, telling the story of their shift to DevOps and how pushing for attitudinal change drives effective DevOps implementation. From Hive, Sam Pointer will talk about how they’ve used a telemetry-first approach to break apart a monolithic application and implement infrastructure transformation at scale. Finally, Louis McCormack of Space Ape Games will take a look at the challenges of monitoring everything, when “everything” keeps changing.

You can learn more at the event page, where you can also sign up to attend.

DevOps-At-Scale-Invitation

Space Ape are hiring for Devops

Here on the Space Ape Devops Team, we’ve been busy building out the tech for our next generation of mobile games and now it’s time to bring some fresh faces onto the team to help continue our journey. If you’re a passionate technologist, Devops engineer or infrastructure wrangler then we’d love to hear from you.

Being a Devop at Space Ape is an important role. On our existing titles, you’ll be responsible for maintaining the quality of our players’ experience, working with the team to roll out new features and upgrades and finding new ways to optimise the stacks. On our new titles, you’ll be working with the development team to build out new stacks, solve new problems and prepare for big scale launches.

Along the way, you’ll learn how we use tools to build and update our stacks and roll them out without impacting our players and developers. You’ll also learn how we write those tools in Ruby, Angular and sometimes Go. Eventually, you’ll learn what it is that our teams need and start bringing fresh new ideas for how we can make things better; perhaps improving our containerisation platform, serverless workloads or the security of our platforms.

If you’re interested, have a poke round some of our other posts and drop your details in on our careers page where you can find out a bit more about the Devops role and the technology we use.

CoreOS London January Meetup 2016

 

On Wednesday 13th January some members of our Devops team attended the CoreOS London Meetup hosted at BlackRock.

At Space Ape we are taking steps towards running some of our services in containers and are very interested in CoreOS and the ecosystem around it.

The first talk of the evening by Ric Harvey from ngineered took us through running CoreOS on AWS, specifically talking about their approach and findings running Kubernetes.

We started off with an overview of Kubernetes installation on AWS and covered some of the problems you may run into. One to note was pulling Kubernetes from GitHub. If you decide to pull from GitHub over a lot of nodes in a VPC you’ll probably run into API limits, to get around this you can host Kubernetes in an S3 bucket (or other endpoint you control)
and pull from there. This approach is also preferable as it allows you more granular control over the version of Kubernetes you are deploying.

Ric talked us through their CoreOS setup and how they’ve worked to implement AWS best practises in their cluster. To start with they run everything in a VPC and ensure that they deploy the cluster over multiple availability zones (AZ). Within each AZ they deploy at least two subnets, one which will contain their Elastic Load Balancers (ELB) and a second which contains the CoreOS nodes and shared storage nodes.

The CoreOS nodes are setup in autoscaling groups which has allowed them to scale the fleet up and down automatically. On top of this they’ve got the Kubernetes replication controller deploying the containers around the cluster, ensuring they’ve always got the desired amount. The autoscaling groups can either be scaled manually or they can
make use of cloudwatch alarms. One improvement they are working on is making use of custom metrics for scaling (e.g. container count) as right now they depend on the metrics from the hypervisors.

Outside of the autoscaling groups they will deploy the Kubernetes master node. This allows them to control the cluster and to ensure that the node will not get destroyed during a scaling event.

The approach to load balancers was very interesting. While Kubernetes can control ELB creation, ngineered opted not to use this and instead use ELBs for public access which forward the traffic to haproxy instances. These haproxy instances run on all of the CoreOS instances and forward the traffic to the right container via IP addresses defined for that service by Kubernetes. It was a good example of controlling costs while still ensuring that the setup was flexible.

It was really good to see a production Kubernetes cluster being deployed on AWS and to get an insight into the challenges that you could face doing it. It was helpful to get an insight into what works and what doesn’t and how careful design will help with cost control.

The slides from this talk are available on Google Docs.

 
The second talk of the evening was by Joseph Schorr from CoreOS who was telling us about the security work CoreOS have been doing at all levels of the stack.

Joseph started off talking about system compromise and asking at what level can you now trust the system? How do you know if the hardware/bootloaders have been compromised?

To combat this uncertainty CoreOS have been working on signing all levels of the system boot process using keys that are stored in a trusted platform module (TPM). At a high level this allows for a set of keys to be embedded in hardware on the system. First the TPM is verified to ensure it’s authentic. Once that has taken place each component in the system can be loaded, with a signature being checked at each stage. If the signature is validated the component will be loaded. If there are any failures the boot will halt.

This system ensures that by the time the OS has been loaded you’ve got a verified trail of each step of the process which can later be audited.

Within your trusted OS environment you can now deploy your containers. It would be expected that you would build your containers from within a trusted environment and have them signed. Using these signed containers you are able to verify and deploy them into your trusted OS environment.

This is a great step forward for security and integrity of the OS and should give administrators a lot more confidence in the systems they are deploying. CoreOS have blogged on this system providing far more technical detail on how it works.

Joseph then went on to talk about Quay.io and the steps they’ve been implementing to provide security insights into the containers they are hosting. Quay.io are using their new tool Clair to scan all of the containers they host against the CVE database (and common vendor databases) and report when your container contains an insecure package. Each layer of the container is scanned so you will get alerted on insecure packages that you might not know are present on the system. Alerts are triggered via webhooks which allow you to receive notifications in a number of different places.

If you’re using Quay.io to host your container images using the Clair scanning seems like something worth using right away.

The slides from this talk are also available on Google Docs.

This talk really showed in todays world with exploits getting more and more sophisticated, security needs to be thought of at all levels of the system from the hardware up. It also showed that through the use of clever tooling it’s possible to be alerted to potential problems in your infrastructure very soon after they are made public and how containers can help you to resolve those problems and quickly prove they are resolved.

Thanks to BlackRock for hosting, to Ric and Joseph for speaking and for the organisers for organising an enjoyable evening. We’re looking forward to the next one.

The Evolution of a CI System

The process of building and maintaining repeatable infrastructure, a process we now know as configuration management, has evolved over the years. It has had to, to keep up with the seismic shifts within the industry. 

In the beginning there were shell-scripts and Kickstart manifests, accompanied by – if you were lucky – lengthy procedural documents. Inevitably some clever folk encapsulated these into tools and frameworks such as cfEngine, Puppet and Chef. With these tools at our disposal we now found we could represent our infrastructure as code and, since it was just code, why not apply some of the principals that our developer cousins had been preaching? Namely, unit and integration tests, code reviews, continuous integration and deployment etc etc.

chef_logo

In keeping with the trend, eventually these configuration management tools were themselves further abstracted. Companies built their own bespoke CI systems to solve their own specific problems. 

This is the story of how Space Ape’s Chef-based CI system evolved. Hopefully it may resonate with others, and even provide inspiration to those facing similar problems.

We started with community cookbooks. A lot of community cookbooks. We had cookbooks wrapping those community cookbooks, we even had cookbooks wrapping those wrapper cookbooks. We had no version constraints; if you pushed some code to the Chef server you pushed it to all environments, instantly.

Versioning cookbooks against environments seemed an obvious place to start, so we did. We used the knife spork tool. Knife spork is a handy knife plugin that will ‘bump’ cookbook versions, and ‘promote’ those new versions through environments. Crucially it leaves your production code running a previous version of a cookbook until such time you decide it is safe to promote.

Now, the community cookbook paradigm is great for getting things up and running quickly. But the long tail of dependencies soon becomes unwieldy: do you really need code to install Java on Windows; or yum repository management, when you’re running Ubuntu? Why do we have a runit cookbook, we’ve never even used runit? The problem is that community cookbooks need to support all manner of operating systems and frameworks, not just the specific ones you happen to use. So we took a policy of re-writing all of our infrastructure code, removing unwanted cruft and distilling only that which we absolutely needed.

Eventually, as the quality of our cookbook code improved, we found that often we would want to promote cookbooks through all environments. What better way to achieve this than a for loop?

for env in $(knife environment list); do knife spork promote ${env} sag_logstash; done

Any time you find yourself using the same for-loop each day, its probably time to write a script, or shell-helper at least. Additionally, the only safeguard we had with the for-loop, in the event of a problem, was to frantically hit Ctrl-C before it hit production.

Enter Space Ape’s first, really, er, rubbish CI system:

Our First CI

Essentially our first tool was that same for loop, with some ASCII art thrown in, and some very rudimentary testing between environments (i.e. did the Chef run finish?). It was still a long way from perfect, but a slight improvement. Our main gripe with this approach (apart from the obvious fact that is was indeed a rubbish CI system) was the fact that it still provided very little in the way of safety, and completely ignored our integration tests.

In time we decided that maybe it was time we made some proper use of those tests. A shell-script just would no longer cut it, ASCII art or not. No, we needed a system we could trust to continuously deploy our cookbook code, dependent on tests, with a proper queueing mechanism and relevant notifications upon failure.

Being decidedly not a ‘not invented here’ Devops team, we investigated some open-source and COTS offerings, but ultimately found them to be not quite suitable or malleable enough for our needs. We decided to build our own.

And so SeaEye was born. OK, it’s a silly name an amazing pun, and we already had another amazing pun, ApeEye, a system we use for deploying code, so it made sense.

SeaEye is a Rails app that runs on Docker, uses Sidekiq as a background job processor and an AWS RDS database as a backend. It is first and foremost an HTTP API, which just happens to have a nice(-ish) web frontend. This allows us to build command line tools that poke and poll the API for various means.

Screen Shot 2015-11-27 at 15.59.21

Beneath the nice(-ish) facade are a hierarchy of stateful workflows, each corresponding to a Sidekiq job and represented as finite-state-machine workflows using the this workflow gem. The basic unit of work is the CookbookPush, which is made up of a number of sub-tasks, one for each environment to be pushed through. The CookbookPush is responsible for monitoring the progress of each sub-task, and only when one has successfully completed does it allow the next to run. It makes use of the Consul-based locks we described in this post to add an element of safety to the whole process. 

A CookbookPush can be initiated manually, but that is only half of the story. We wanted SeaEye to integrate with our development workflow. Like most Chef shops, we use Test Kitchen to test our cookbooks. Locally we test using Vagrant, and remotely using Travis-CI with the kitchen-ec2 plugin. We perform work on a branch and, once happy, merge the branch into master. What we’d traditionally do is then watch for the tests to pass before manually kicking off the CookbookPush.

Screen Shot 2015-11-27 at 15.59.47

We knew we could do better. So we added another stateful workflow, called the CI. The premise here is that SeaEye itself polls Github for commits against the master branch. If it finds one, and there is a specific tag against it, it will manually kick off a Travis build. Travis is then polled periodically as to the success (or otherwise) of the build, and CookbookPush-es are created for each cookbook concerned. The DevOps team are kept informed of the progress through Slack messages sent by SeaEye.

There are many ways to skin this particular CI cat, and many off-the-shelf products to help facilitate the skinning.  Rolling our own has happened to have worked well for us, but every team and business is different. We’ve since built a suite of command-line tools, and even integrated SeaEye with ChatOps. Hopefully our experiences will help inspire others facing similar problems.