The enterprise architect's guide to the service mesh

Entain Australia, a leading sports betting and gaming company, used Linkerd to tame gRPC. Within several months, the organization reduced max server load by over 50% and increased request capacity by tenfold, all while getting more nights of uninterrupted sleep.

Backdrop: When latency costs money

Entain is a leading global sports betting and gaming company. It operates iconic brands, including Ladbrokes, Coral, BetMGM, bwin, Sportingbet, Eurobet, partypoker, partycasino, Gala, and Foxy Bingo. The company is listed on the FTSE 100, has licenses in more than 20 countries, and employs a global workforce of more than 24,000 people.

For Entain, speed is everything. Latency literally costs money, especially when you’re on the other side of the planet to most of the games taking place. When Messi scores in Barcelona, the data must be processed by its pricing management systems within milliseconds.

To tackle this immense challenge, Entain Australia built a data feed processing platform on modern, cloud native technologies including Kubernetes, gRPC, containers, and a microservices approach. These tools allowed the company to build a high-performance, reliable, and scalable system—but it wasn’t perfect.

Initially, the interaction between Kubernetes and gRPC around load balancing caused some parts of the system to run “hot” while other parts ran “cold”. This disrupted the customer experience, creating financial risk if the updates slowed down, and too many sleepless nights for the team.

Linkerd – Taming gRPC for Entain Australia

Entain Australia’s Trading Solutions Team is responsible for handling the immense amount of data that comes into the business, managing the pricing systems, and making results available to the rest of the platform as quickly as possible.

Hosted on Kubernetes, Entain’s Australian sports trading platform consists of approximately 300 microservices with over 3,000 pods per cluster in multiple geographic regions.

Handling thousands of requests per second, the platform updates the prices and status for live and upcoming sports and racing events. Any delay in processing, even a small one, impacts revenue, user experience, and price accuracy. For instance, a latency hit could have massive consequences. Why? To generate revenue Entain relies on sports prices based on the probability of an outcome and the user experience must be as real-time as feasible. You can’t place bets on events that you already know the result of!

A critical challenge for the Trading Solutions Team was that their environment is constantly changing. Despite running the infrastructure 24x7, engineers use AWS spot instances to keep costs low. Chaos engineering tools and practices also ensure a resilient platform and applications. “With nodes coming and going, we needed a way to ensure reliable application performance, despite all the changes under the hood,” said Steve Gray, Entain Australia’s Head of Trading Solutions.

"Although our developers were quick to seize on the efficiency and reliability gains provided by microservices and gRPC, the default load balancing in Kubernetes didn’t give us the best performance out-of-the-box and left us in a position where all requests from one pod in service would end up going to a single pod on another service," — Steve Gray, Head of Trading Solutions

While this approach worked, it had two negative effects on the platform. First, servers had to be exceptionally large to accommodate huge traffic volumes. Second, the team couldn’t make use of horizontal scaling. This prevented Entain from taking advantage of the available spot instances while still processing a high volume of requests in a timely manner.

Another issue was a lack of intelligent routing. To hit its ambitious availability targets, the company spans Kubernetes clusters across multiple AWS AZs (availability zones), ensuring no one AZ is a single point of failure for the platform.

For more casual Kubernetes users this isn’t a problem,” said Gray. “But at Entain’s scale and request volume, cross-AZ traffic began to represent a tangible source of both latency and cost. Data that needlessly crossed an AZ boundary would slow down performance of the platform and incur additional charges from AWS.

To help address these issues, Gray’s team began looking into a service mesh approach. After considering a variety of solutions, they chose Linkerd, the lightweight, ultra-fast service mesh.

Why Linkerd? Turnkey, easy, and it just works

Linkerd stood out for several reasons.

First, Gray’s team needed a mesh that was Kubernetes-native and would work with their current architecture without having to introduce a large number of new custom resource definitions or force a restructuring of its applications or environments.

Second, the team lacked the bandwidth to learn the ins and outs of a complicated system. “Linkerd was ideal because it is easy to get up and running and requires little overhead,” said Gray. “It took us five to six hours to install and configure and migrate 300 services to the mesh. It’s just the go-get-command and then the install process, and job done! It’s that simple. We installed it and have rarely had to touch it since—it just works.”

Gray’s team also found the Linkerd Slack community and Linkerd’s docs to be extremely helpful as they moved into production—a process that happened overnight.

With Linkerd, the technology and business gains came quickly

Once Linkerd had finished rolling out to all the Kubernetes pods, the Linkerd proxy took over routing requests between instances. This allowed the trading platform to immediately route traffic away from pods that were failing or being spun down.

Immediate improvements in load balancing were also realized. Linkerd’s gRPC-aware load balancing immediately fixed the issues with gRPC load balancing on Kubernetes and started balancing requests properly across all destination pods.

This allowed Gray’s team to achieve two key business gains. “We increased the volume of requests the platform could handle by over tenfold and now use horizontal scaling to add more smaller pods to a service,” explained Gray. “The latter gave us access to a broader range of AWS spot instances so that we could further drive down our compute costs—while delivering better performance to our users.”

In that same vein, Gray realized an unexpected side benefit. Kubernetes’ load balancing natively chooses endpoints in a round-robin fashion, basically rotating through an endpoint list, distributing load. It’s an arbitrary process sending requests to any node on a cluster without considering latency, saturation, or proximity to the calling service.

With Linkerd, however, proxies consider all potential endpoints and select the “optimal” traffic target based on an exponentially weighted moving average or EWMA of latency. When the team introduced Linkerd to its clusters they began to see faster response times and lower cross-AZ traffic costs. The service mesh’s built-in EWMA routing algorithm automatically keeps more traffic inside an AZ, cutting bandwidth costs by thousands of dollars a day. All without Entain’s platform team needing to configure anything!

“The difference was night and day,” said Gray. “Our applications began spreading the load evenly between application instances and response times went down. We could see it right away in our monitoring. The bandwidth decreased and the CPU just leveled out everywhere—an improvement across the board. We were so excited; we were literally jumping around. With Linkerd, we went from a fraction of our servers going flat out at their line speed (20gbit/s) to an even balance, with no server hitting above 9gbit/s sustained. Linkerd really made a difference.”

Linkerd day-to-day—runs like a utility service

Within a week of trying Linkerd, Entain Australia was able to take it into large-scale and highly performant Kubernetes clusters. The trading solutions team now runs over 3,000 pods in a single namespace in Kubernetes—a massive deployment that usually requires a team to manage. Yet only one person manages the entire infrastructure.

As Gray explained: “We’ve adopted Linkerd without needing to become experts in service meshes or any particular proxy. In fact, Linkerd just sits in the background and does its job. It’s like with the electricity or water supply. When it’s working well, we really don’t think about it and just enjoy the benefits of having electricity.”

Linkerd has met all of Entain’s needs while keeping it really simple. It solved the gRPC load balancing problems and augmented standard Kubernetes constructs in a way that allowed Gray and his team to move forward without reconfiguring applications. “It solved our problems including some of those we didn’t realize we had.”

Entain never stops taking bets (and the team can sleep again)

Everything that Gray’s team does goes back to reliability, availability, scalability, and cost reduction or control. That what matters and by which they measure everything. Whenever there is a service or application failure and Entain stops taking bets, their business suffers a direct financial impact. “We have to make sure that we’re available, reliable, and rock-solid and Linkerd is a part of our strategy for achieving that,” said Gray.

"Because of Linkerd, we’ve been able to easily increase the platform capacity by over 10x, reduce operating costs, and better hit our availability targets—while continuing to iterate on our business differentiators." — Steve Gray

Linkerd also helped the team from a personal standpoint: they finally get to sleep again. “No more being woken up in the middle of the night is perhaps the most valuable feature of the service mesh.”

Want to give Linkerd a try? You can download and run the production-ready Buoyant Enterprise for Linkerd in minutes. Get started today!