The process of building and maintaining repeatable infrastructure, a process we now know as configuration management, has evolved over the years. It has had to, to keep up with the seismic shifts within the industry.
In the beginning there were shell-scripts and Kickstart manifests, accompanied by – if you were lucky – lengthy procedural documents. Inevitably some clever folk encapsulated these into tools and frameworks such as cfEngine, Puppet and Chef. With these tools at our disposal we now found we could represent our infrastructure as code and, since it was just code, why not apply some of the principals that our developer cousins had been preaching? Namely, unit and integration tests, code reviews, continuous integration and deployment etc etc.
In keeping with the trend, eventually these configuration management tools were themselves further abstracted. Companies built their own bespoke CI systems to solve their own specific problems.
This is the story of how Space Ape’s Chef-based CI system evolved. Hopefully it may resonate with others, and even provide inspiration to those facing similar problems.
We started with community cookbooks. A lot of community cookbooks. We had cookbooks wrapping those community cookbooks, we even had cookbooks wrapping those wrapper cookbooks. We had no version constraints; if you pushed some code to the Chef server you pushed it to all environments, instantly.
Versioning cookbooks against environments seemed an obvious place to start, so we did. We used the knife spork tool. Knife spork is a handy knife plugin that will ‘bump’ cookbook versions, and ‘promote’ those new versions through environments. Crucially it leaves your production code running a previous version of a cookbook until such time you decide it is safe to promote.
Now, the community cookbook paradigm is great for getting things up and running quickly. But the long tail of dependencies soon becomes unwieldy: do you really need code to install Java on Windows; or yum repository management, when you’re running Ubuntu? Why do we have a runit cookbook, we’ve never even used runit? The problem is that community cookbooks need to support all manner of operating systems and frameworks, not just the specific ones you happen to use. So we took a policy of re-writing all of our infrastructure code, removing unwanted cruft and distilling only that which we absolutely needed.
Eventually, as the quality of our cookbook code improved, we found that often we would want to promote cookbooks through all environments. What better way to achieve this than a for loop?
for env in $(knife environment list); do knife spork promote ${env} sag_logstash; done
Any time you find yourself using the same for-loop each day, its probably time to write a script, or shell-helper at least. Additionally, the only safeguard we had with the for-loop, in the event of a problem, was to frantically hit Ctrl-C before it hit production.
Enter Space Ape’s first, really, er, rubbish CI system:
Essentially our first tool was that same for loop, with some ASCII art thrown in, and some very rudimentary testing between environments (i.e. did the Chef run finish?). It was still a long way from perfect, but a slight improvement. Our main gripe with this approach (apart from the obvious fact that is was indeed a rubbish CI system) was the fact that it still provided very little in the way of safety, and completely ignored our integration tests.
In time we decided that maybe it was time we made some proper use of those tests. A shell-script just would no longer cut it, ASCII art or not. No, we needed a system we could trust to continuously deploy our cookbook code, dependent on tests, with a proper queueing mechanism and relevant notifications upon failure.
Being decidedly not a ‘not invented here’ Devops team, we investigated some open-source and COTS offerings, but ultimately found them to be not quite suitable or malleable enough for our needs. We decided to build our own.
And so SeaEye was born. OK, it’s a silly name an amazing pun, and we already had another amazing pun, ApeEye, a system we use for deploying code, so it made sense.
SeaEye is a Rails app that runs on Docker, uses Sidekiq as a background job processor and an AWS RDS database as a backend. It is first and foremost an HTTP API, which just happens to have a nice(-ish) web frontend. This allows us to build command line tools that poke and poll the API for various means.
Beneath the nice(-ish) facade are a hierarchy of stateful workflows, each corresponding to a Sidekiq job and represented as finite-state-machine workflows using the this workflow gem. The basic unit of work is the CookbookPush, which is made up of a number of sub-tasks, one for each environment to be pushed through. The CookbookPush is responsible for monitoring the progress of each sub-task, and only when one has successfully completed does it allow the next to run. It makes use of the Consul-based locks we described in this post to add an element of safety to the whole process.
A CookbookPush can be initiated manually, but that is only half of the story. We wanted SeaEye to integrate with our development workflow. Like most Chef shops, we use Test Kitchen to test our cookbooks. Locally we test using Vagrant, and remotely using Travis-CI with the kitchen-ec2 plugin. We perform work on a branch and, once happy, merge the branch into master. What we’d traditionally do is then watch for the tests to pass before manually kicking off the CookbookPush.
We knew we could do better. So we added another stateful workflow, called the CI. The premise here is that SeaEye itself polls Github for commits against the master branch. If it finds one, and there is a specific tag against it, it will manually kick off a Travis build. Travis is then polled periodically as to the success (or otherwise) of the build, and CookbookPush-es are created for each cookbook concerned. The DevOps team are kept informed of the progress through Slack messages sent by SeaEye.
There are many ways to skin this particular CI cat, and many off-the-shelf products to help facilitate the skinning. Rolling our own has happened to have worked well for us, but every team and business is different. We’ve since built a suite of command-line tools, and even integrated SeaEye with ChatOps. Hopefully our experiences will help inspire others facing similar problems.
Pingback: CoreOS London January Meetup 2016 | The Ape Grid