Tackling scalability challenges in realtime multiplayer games with Akka and AWS

We hosted a tech event at our HQ last week and welcomed over 200 attendees to join us for an evening of talks and networking. It was an absolute blast to meet so many talented people all at once! We plan to host a series of similar events in the future so keep coming back here or follow us on Twitter (@SpaceApeGames) to listen for announcements.

We had three talks on the night, covering a range of interesting topics:

  • Scalability & Big Data Challenges In Real Time Multiplayer Games, by Yan Cui and Tony Yang, Space Ape Games
  • Advanced Machine Learning For Small Teams, by Atiyo Ghosh and Dennis Waldron, Space Ape Games
  • Serverless: The Next Evolution of Cloud Computing, by Dr. Steve Turner, Amazon Web Services

The recording of me and Tony’s talk on building realtime multiplayer games is now online (see end of the post), with the accompanying slides.

In this talk we discussed the market opportunity for realtime multiplayer games and the technical challenges one have to face, as well as the tradeoffs that we need to keep in mind when making those decisions.

  • do you deploy infrastructure globally or run them from one (AWS) region?
  • do you build your own networking stack vs using an off-the-shelf solution?
  • do you go with a server authoritative approach or implement a lock-step system?
  • how do you write a highly performant multiplayer server on the JVM?
  • how do you load test this system?
  • and many more.

Over the next few weeks we’ll publish the rest of the talks, so don’t forget to check back here once in a while 😉

How to load test a realtime multiplayer mobile game with AWS Lambda and Akka

Tencent’s Kings of Glory is one of the top grossing games worldwide in 2017 so far.

Over the last 12 months, we have seen a number of team-based multiplayer games hit the market as companies look to replicate the success of Tencent’s King of Glory (known as Arena of Valor in the west) which is one of the top grossing games in the world in 2017.

Even our partners Supercell has recently dipped into the genre with Brawl Stars, which offers a different take on the traditional MOBA (Multiplayer-Online-Battle-Arena) formula.

Supercell’s Brawl Stars offers a different experience to the traditional MOBA format, it is built with mobile in mind and prefers simple controls & maps, as well as shorter matches.

Here at Space Ape Games, we have been exploring ideas for a competitive multiplayer game, which is still in prototype so I can’t talk about it here. However, I can talk about how we use AWS Lambda to load test our homegrown networking stack.

Why Lambda?

The traditional approach of using EC2 servers to drive the load testing has several problems:

  • slow to start : any sizeable load test would require many EC2 instances to generate the desired load. Since it costs you to keep these EC2 instances around, it’s likely that you’ll only spawn them when you need to run a load test. Which means there’s a 10–15 mins lead time before every test just to wait for the EC2 instances to be ready.
  • wastage : when the load test is short-lived (say, < 1 hour) you can incur a lot of wastage because EC2 instances are billed by the hour with a minimum charge for one hour (per-second billing is coming to non-Windows EC2 instances in Oct 2017, which would address this problem).
  • hard to deploy updates : to update the load test code itself (perhaps to introduce new behaviours to bot players), you need to invest in the infrastructure for updating the load test code on the running EC2 instances. Whilst this doesn’t have to be difficult, after all, you probably already have a similar infrastructure in place for your game servers. Nonetheless, it’s yet another distraction that I would happily avoid.

AWS Lambda addresses all of these problems.

It does introduce its own limitations — especially the 5 min execution time limit. However, as I have written before, you can work around this limit by writing your Lambda function as a recursive function and taking advantage of container reuse to persist local state from one invocation to the next.

I’m a big fan of the work the Nordstrom guys have done with the serverless-artillery project. Unfortunately we’re not able to use it here because the game (the client app written in Unity3D) converses with the multiplayer server in a custom protocol via TCP, and in the future that conversation would happen over Reliable UDP too.

Akka

Our multiplayer server is written in Scala with the Akka framework. To help us optimize our implementation we collect lots of metrics about the Akka system as well as the JVM — GC, heap, CPU usage, memory usage, etc.

The Kamon framework is a big help here, it made quick work of getting us insight into the running of the Akka system — no. of actors, no. of messages, how much time a message spends waiting in the mailbox, how much time we spend processing each message, etc.

All of these data points are sent to Wavefront, via Telegraf.

We collect lots of metrics about the Akka system and the JVM.

We also have a standalone Akka-based load test client that can simulate many concurrent players. Each player is modelled as an actor, which simulates the behaviour of the Unity3D game client during a match:

  1. find a multiplayer match
  2. connect to the multiplayer server and authenticate itself
  3. play a 4 minute match, sending inputs at 15 times a second
  4. report “client side” telemetries so we can collect the RTT (Round-Trip Time) as experienced by the client, and use these telemetries as a qualitative measure for our networking stack

In the load test client, we use the t-digest algorithm to minimise the memory footprint required to track the RTTs during a match. This allows us to simulate more concurrent players in a memory-constrained environment such as a Lambda function.

AWS Lambda + Akka

We can run the load test client inside a Java8 Lambda function and simulate 100 players per invocation. To simulate X concurrent players, we can create X/100 concurrent executions of the function via SNS (which has an one-invocation-per-message policy).

To create a gradual ramp up in load, a recursive Orchestrator function will gradually dial up the no. of current executions by publishing more messages into SNS, each triggering a new recursive load test client function.

LoadTest function that is triggered by API Gateway allows us to easily kick off a load test from a Jenkins pipeline.

Using the push-pull pattern (see this post for detail), we can track the progress of all the concurrent load test client functions. When they have all finished simulating their matches, we’ll kick off the Aggregator function.

The Aggregator function would collect the RTT metrics published by the load test clients and produce a report detailing the various percentile RTTs.

{
  "loadTestId": "62db5790-da53-4b49-b673-0f60e891252a",
  "status": "completed",
  "successful": 43,
  "failed": 2,
  "metrics": {    
    "client-interval": {      
      "count": 7430209,
      "min": 0,
      "max": 140,
      "percentile80": 70.000000193967,
      "percentile90": 70.00001559848,
      "percentile99": 71.000000496589,
      "percentile99point9": 80.000690623146,
      "percentile99point99": 86.123610689566
    },    
    "RTT": {      
      "count": 744339,
      "min": 70,
      "max": 320,
      "percentile80": 134.94761466541,
      "percentile90": 142.64720935496,
      "percentile99": 155.30086042676,
      "percentile99point9": 164.46137375328,
      "percentile99point99": 175.90215268392
    }
  }
}

If you would like to learn more about the technical challenges in developing successful mobile games, come join us for an evening of talks, drinks, food and networking in our office on the 12th Oct.

We’re running a free event in partnership with AWS where we will talk about:

  • the opportunities and challenges in building a realtime multiplayer game
  • data science and machine learning
  • serverless with AWS Lambda (by Dr Steve Turner from AWS)

Get your free ticket here!