Real-time Kubernetes: How Entain Australia 10x'd throughput with Linkerd

(Note: this transcript has been automatically generated with light editing. It may contain errors! When in doubt, please watch the original talk!)

Hello, my name is Stephen Gray and I’m here today to talk about real-time Kubernetes and how Entain in Australia achieved 10x the throughput with Linkerd.

About Entain [min 00:11]

Entain is a FTSE 100 company and one of the world’s largest sports betting and gaming groups operating in the online and retail sectors. We operate all across the world with brands and teams in dozens of countries globally, and Australia is one of them.

As the Head of Trading Solutions here in Australia, I lead the team of engineers that built and operates our trading platform, pairing our regional business. We’ve transitioned over the last few years from a non-agile release cycle to a lean-agile management practice with ownership and self-support of our solutions. We use chaos engineering, automated tests, continuous deployments, and even spot nodes to optimize our cost base.

The engineering team and the trading platform [min 00:55]

Our team is responsible for managing the torrent of data coming into the business, reacting to it in real-time. Various data providers from all over the world feed us information continuously as sports and races happen all across the world. These come into our trading platform that we’ve built here in Australia.

Our trading team interacts with that platform as their main line of business tool. This then feeds into our transactional platform and our customers interact with that platform. So, specifically, what we do, is this piece here in the middle.

Our software stack is based on very contemporary tools. We use Go as our main programming language. For messaging and distribution of data, we use Apache Kafka, and MongoDB is our main data store. But we do use other technologies such as Python, R, RabbitMQ, and even .NET.core

The entire platform is based on cloud native technologies. We host on Kubernetes, we are using Linkerd as our service mesh, all of our network communications uses gRPC, and we use tracing instrumentation using the Jaeger and Open Tracing projects, and we do our deployments with Helm.

In terms of key metrics about our team: we operate five clusters, there are 250 plus cloud services in that environment, there are zero relational databases, we have hundreds of deployments of our software managed by a team of 13 engineers of which only one is a dedicated DevOps engineer.

Entain’s microservices journey [min 02:35]

It wasn’t always this way. Our microservices journey started off with a single application. It powered the entire business, all of our customer-facing applications but also the entire back of house trading operation. It was so large and unassailable we called it “the monolith.” It was 1.7 million lines of PHP. Almost none of that code was actually testable.

Then we began our microservices journey. In 2018, we had that single platform, our monolith, and it ran on about 100 commodity servers in two data centers that we operated in Sydney. And later that year, we started our journey to improve our data handling pipeline and move away from this monolith. This became three services. These services were focused on taking our sports data away from our legacy platform. It enabled us to break away and start iterating on those components in an agile fashion, away from the release train of the main platform.

At this point in time on our journey, most of the work was actually going into supporting the illusion and keeping up compatibility with the old platform. By mid-2019, these three services had become more than 20 and they were growing as we started to find our feet with microservices design. Incredibly, we were still keeping up the illusion of compatibility with the old platform. Throw in a merger and now suddenly we found we had 40 services but we’re also now driving the same content to multiple customer-facing solutions, our legacy, and new platforms.

That was a flex that was only possible because we hadn’t undertaken this work. In terms of the community’s practice, it was really good for us because the new customer-facing platform was also written entirely in Go and, eventually, the monolith went away.

The current platform [min 04:28]

Today, though, if you look just at our system, we’re around 347 services — just for our team. The wider Australian technology team also has a similar footprint again for the customer-facing side of the show.

In terms of our service mesh design, it’s actually quite chaotic. Here’s a preview from Buoyant Cloud showing the topology map of our services. But, if we zoom in on just a small section of it, you can see that these tiny little blocks are the individual services that make up our fabric.

We process around 10 billion external updates a month and we need to do it reliably, as fast as possible. If our system runs slowly, if our prices are behind the gameplay for sporting events, that’s bad for us. And, if our system is down, effectively, so is our business. So, as a team, we’re in a continuous and rolling fight against latency in all forms. If you look here at this map (courtesy of submarine cable map), you can see that the path from a stadium in Italy to us in Australia is not quite a straight line. And this is the best case. Quite often we find data routes via the United States, even when coming from Europe! So, it’s very much a challenge to keep things sub-second.

Downtime is not an option [min 05:46]

Given the nature of our work, it does mean that we have zero downtime, no planned outages are allowed. We have to have high-performance updating of the prices sub-second, every second, 247.

As you grow a microservices stack, there are a few interesting areas and questions that come up. For us the three major ones were: how do our services find each other in this topology? How do we manage the governance of what service does what? And when does a service need to be deprecated or removed? And also how do we observe and measure our services to make sure that we understand what’s working and what’s not?

Service discovery and connectivity [min 06:37]

Over the years, we actually used a variety of ways to approach service discovery and connectivity between our services. But, as we transitioned to Kubernetes, we transitioned using the Kubernetes logical services. Service A would talk to service B by referencing its service name and route to the individual pods of service B. We can see in this example that pod A is balancing across pods X and Y from service B.

Now we use gRPC for our backbone protocols, and this means that unlike HTTP1, we have fairly long-lived connections. Once gRPC has established a pool of connections, typically these will keep being reused for a fairly long time. What we found was that the individual pods would talk to other pods of other services, because the connections were being reused rather than balancing across multiple instances. With a large enough number of pods on both services, you roughly end up with an even distribution, so it doesn’t sound too bad at this point.

Load balancing with gRPC [min 08:01]

Now, where things got a bit more exciting was, as we rolled out code, we’d start gradually rotating through the instances of service B and we’d shut them down and start a new one. What tended to happen was, the very first service that came up on the new deployment, would start receiving almost all of the connections. As we rotated through the pods, all the connections moved to pod X and that meant that service A was now heavily biased towards one instance.

What would happen here is that we’d see a high load for a period of time after every release. This isn’t insurmountable so, at this point, we’d look at the graphs and go “ah, the traffic’s not balanced” and we would restart service A. Service A would restart and each instance of service A would balance cleanly across all of the instances of service B. The trouble then became that things that now talk to service A themselves, ended up being biased towards particular pods and so we’d roll those and so on and so on. Eventually, it became impractical.

Then we started bulk processing [min 09:00]

So, c’est la vie. We ended up in a situation where we were just letting things settle on their own. We’d do a deployment and suddenly servers would reach for the sky and after a while connections would eventually just balance out.

One day we had to reload a large amount of data, though. We had to process roughly two weeks worth of data in the space for a couple of hours. This went poorly. When we started bulk processing, we suddenly found that there was a lot of pressure in our system. The load was massively imbalanced across our topology and the latency started spiking — this is not a good thing. We tried everything, we rolled services, we restarted deployments, we scaled up, we scaled down. But no matter what we did, we ended up frequently with a chain of hot paths through the system. This would cause our total throughput to vary between near zero and nowhere near fast enough — no amount of restarts would eventually help us.

As we started looking at this issue, the work became more and more critical. Hours became days, we started working late, we start to look at alternatives, including writing our own custom gRPC connection pooling code. Something that would forcibly balance the traffic between the pods of our services. This is something we’ve done before.

Have we considered a service mesh? [min 10:27]

We tried loading in the data during quiter times and, for a week, we basically lived on caffeinated drinks in prayer and then one of our team asked: “Have we considered a service mesh?” We had not. We got to around midnight on day four or five here, and we started looking at the various options for a service mesh.

Our team being quite small, we had to choose things that ticked a couple of different boxes. Given the number of services we had, we had to have a solution that had no code or configuration change required, because making code changes to 300 plus services and then rolling it back out, if it doesn’t go well, is something that’s very hard to do.

It had to be something that was simple to install and easy to operate. We didn’t want to hire an extra person just to look after it. But it also had to balance the traffic better than we could do ourselves. We found an article on the Kubernetes site, after a furious 5-10 minutes of Googling where we found William Morgan from Buoyant talking about gRPC load balancing on Kubernetes and he was talking about our specific problem which was awesome.

And so, after a bit of discussion within the team, some prototype installs of the various options we had and lots of cursing and configuration that we had to do, we kept coming back to that article over and over again. We’d found our front runner, and that was Linkerd.

Giving Linkerd a try [min 12:00]

Before we went into a large-scale enterprise deployment, we took a lot of steps. We joined their Slack channel, talked to a few people. Nobody seemed to think this wouldn’t work but, obviously, we were going from naught to 60 miles an hour very quickly on this.

After a bit of agonizing about the risk of doing this in a production system, we ultimately decided that, because we could uninstall and reinstall very easily, we gave it a shot. In terms of the deployment process, we use Spinnaker for all our deployments. We use the two commands to install Linkerd to the cluster. That was roughly 15 minutes of the process most of which was spent reading the documentation because the installation process was essentially two commands.

The next challenge we had was, for our 340 plus services, we had to redeploy them with some extra annotations to opt them into the service mesh. Now that took us a couple of hours because we had to go through one by one restarting things and doing deployments. But it was far simpler than doing code level changes. Once we got everything installed and running, and we saw our pods start turning up in the mesh, we just let it sit for a few hours because it’d been a long few days and also to see if anything happened under baseline load. We also caught up on some sleep.

The baseline improved dramatically [min 13:20]

After a few hours of letting things sit and without any problems, we decided to kick the tires and start loading the data. From this graph, see if you can see when we started loading. It’s pretty impressive in terms of a change. Our baseline was dramatically improved.

Instead of saturating the 20 gigabit links on just a few of our servers, we were suddenly processing a network line rate on all of our servers. The CPU and load evened out across the cluster and we just kept on trucking. In fact, the first thing we said after we saw the metrics: it’s faster than it’s ever been, enough so that we were worried it might not be correct.

Our hail Mary attempt to fix the performance issue worked — first time. But we didn’t really understand at the time what really had happened or how we’d got these gains. Turns out that because Linkerd was performing the connection pooling and rebalancing for us, our applications were balancing at the request level for each individual call, while at the application level, still believing they had a pool of reused connections. This evened out the workload for all intents and purposes perfectly.

We could suddenly restart services, terminate them, we could go back to our chaos engineering practices, all without interrupting the flow of data in real-time. The way it works is by injecting a proxy into each of your pods. These proxies communicate with each other directly bypassing the other service discovery mechanisms and load balancing that are built into Kubernetes. The Linkerd control plane will watch for your pods being started and stopped using the Kubernetes API and maintain a rolling database of all the endpoints available for a service. Each service, when talking to other services, will use the data from the control plane to intelligently balance and wait towards particular targets.

And suddenly our bills went down [min 15:35]

Our application code was entirely unaware that anything had changed. Like many people, we operate our application in multiple zones within a geographic region. One of the surprising costs for many companies are that the cost of moving data between the individual zones inside a regional city is quite expensive. When we installed Linkerd, one of the things we found within the first couple of days was our bills were coming down. It turns out that because Linkerd is biasing by latency, there is a bias towards traffic staying within the same geographic zone versus spreading out across the geographic area. This alone in retrospect would have been worth the exercise.

Aside from the cost savings, we had a few other small wins. We had service metrics out of the box, so we could get rid of our own metrics collection for the paths from service A to B to C. There was instrumentation and tracing support which was the start of our Jaeger journey and those traces could span across our network proxies and track the flow between services essentially for free.

Additionally, we had mutual TLS support for all communications within the pod enabling us to encrypt everything that was going on inside the cluster, whereas previously we were more or less biased towards doing things at the ingress level.

The reduction in the peak CPU load lets us do things like re-enabling network compression, so our payloads can be further compressed, further reducing bandwidth costs.

What’s next [min 17:07]

In terms of our journey onward, some of the major forward-looking changes we have left to come for our infrastructure include deploying ChaosMesh more fully. We’ve got some stuff working with it now, but we want to take that to the next level. And we’re also continuing our journey with Linkerd. We’re going to switch from having three separate cross zone clusters to having isolated individual clusters per geographic area. We’re also looking at the network policies changes in version 2.11 to secure traffic further within our cluster.

In summary, though, we have a lot of services, we need to do things very fast, and when we’re using gRPC or other long-lived RPC protocols within Kubernetes, it’s very easy to end up out-of-the-box with hot paths in your platform and inefficient load balancing.

We were able to deploy a service mesh to production to hundreds of services with zero prior experience in the space of a few hours, and we reaped immediate operational and cost economy benefits. More information about this is documented over on the CNCF blog where myself and one of my colleagues have written an article talking about this issue.

Additionally, I’m a Linkerd Ambassador and I’m often available on Slack to discuss and help people on their journeys to service mesh. Thank you!