High Availability Zonal Load Balancing (HAZL) Whitepaper

Download whitepaper

The enterprise architect's guide to the service mesh

Download whitepaper

Relevant articles:

William Morgan

July 3, 2024

Buoyant Enterprise for Linkerd

In Part I of this post, The trouble with Topology Aware Routing: Sacrificing reliability in the name of cost savings, I describe a Topology Aware Routing (TAR), a Kubernetes feature that restricts traffic within an availability zone from crossing to other zones. This feature is relevant to Kubernetes users deploying multi-zone clusters to the cloud, because it can cut cloud spend, sometimes dramatically. Cloud providers like AWS and Google charge for traffic across zones, and since Kubernetes balances traffic as evenly as possible, high-traffic clusters can rack up cross-zone traffic charges very quickly. By restricting traffic from crossing zone boundaries, TAR can cut this aspect of cloud spend.

But TAR is not without drawbacks: it can lead to worse overall system reliability. Because it restricts traffic from crossing zones, when it is enabled, the pods in one zone cannot be used to compensate for failures in other zones. In Part I, we saw an extreme example. I deployed a basic app running on a multi-zone cluster, induced a failure in one zone, and showed that with TAR enabled, application pods in the zone were unable to recover. Unable to reach functioning pods in other zones, the failure persisted and success rate in the zone dropped to 0, even though there were healthy pods elsewhere in the cluster.

If using TAR can reduce reliability, what does this mean for Kubernetes adopters who want the best of both worlds—minimal cloud costs and maximal reliability? In this post, I'll describe a way to accomplish exactly that. But first, we need to discuss the different ways things can fail.

The follies of frivolous failures

There was some good discussion in response to Part I about the different types of failure in a distributed system. I demonstrated one type of failure which TAR couldn't handle. But TAR is resilient to other types of failures. Which types? Generally speaking, like Kubernetes itself, TAR is resilient to the types of failures that can be captured by liveness and readiness probes. This includes pods crashing, pods not starting, and pods reporting themselves as unhealthy or unready to handle traffic.

For example, TAR will automatically disable itself when the pod count for a zone drops below 3. If TAR is enabled on a cluster and the pods in a zone die, or report themselves as unhealthy so that Kubernetes takes them out of the pool, then TAR will disable itself and allow cross-zone traffic. This is one example of TAR's "safeguards"— you can read the full list in the TAR docs—which are designed to allow cross-zone traffic in situations when the system may be unhealthy.

The use of liveness and readiness probes is called out-of-band health checking, as the check happens outside the actual work of the application. Out of band health checking is central to Kubernetes, but it is limited in the types of failures it can detect. Out-of-band health checks cannot help with other types of failures. For example:

A buggy deploy is shipped, and the app starts returning errors to its callers.
The app talks to a database, but the database is broken or non-responsive, and the app relays that error to callers.
More generally, the app talks to another service, and that dependent service is broken or slow, and so the app relays the error (or times out, and returns an error) to callers.

These are all situations where the application will report (correctly) that it is "healthy" via health checks, but the system as a whole, of course, is unhealthy, and the end-user will see errors.

To handle these situations, you need in-band health checks. In-band health checking involves monitoring the actual response between client and server and determining whether these responses are successes or failures. Linkerd, for example, does in-band health checking—because it sees the actual response between client and server and can parse protocols like HTTP and gRPC, it can calculate the health of an application (typically as a real value called a "success rate"), and can use this measure to make decisions about application health. (Note that health in this case also becomes a continuous value rather than a binary measure.)

*An example difference between in-band and out-of-band health checks, showing a situation where the application is "healthy" from the Kubernetes perspective but "unhealthy" from the Linkerd perspective*

‍

In Part I, we saw that TAR couldn't handle the situation we created because it was a class of failure that out-of-band health checking couldn't capture. Is the reliance on out-of-band health checks a failure of TAR? Or a failure of Kubernetes as a whole? I would argue no. If anything, it's a consequence of one of Kubernetes's strengths: a well-defined, clearly-delineated scope. In scope for Kubernetes? L4 network handling and out-of-band health checks for pod lifecycle management. Out of scope: L7 network handling, in-band health checks, and deeper traffic management.

That said, the failures that require in-band health checking to detect are not academic or obscure. They are very real situations commonly encountered in distributed systems, and any resilient system must be able to handle them. This is part of what makes service meshes so valuable.

Side note: liveness and readiness probes

At this point you might be asking: could out-of-band health checks capture in-band failures? For example, could the liveness endpoint of an application exercise the code involved in constructing a response to the caller? Even better, could it make network calls to validate that its dependencies are healthy, only reporting itself as healthy if the entire call graph is healthy?

Unfortunately, in all but the simplest cases the health check can't exercise every possible input and code path that might result in an error response. And exercising the call graph as part of the health check is a recipe for disaster: when a dependent service is slow or, does the health check hang? When the dependent service fails, does the health check fail, and the pod get booted by Kubernetes, eventually going into CrashLoopBackoff? What if it was a transient failure? Do you end up with pods going in and out of existence as your database struggles?

Because of these issues and more, health checks should be designed to summarize the state of the application itself, and no more, or risk catastrophe. The Kubernetes liveness probe documentation starts with a list of cautions and warnings about this, including: "Liveness probes must be configured carefully to ensure that they truly indicate unrecoverable application failure, for example a deadlock.", and "Incorrect implementation of liveness probes can lead to cascading failures." (eep!)

So in practice, out-of-band health checks cannot capture these types of failures, and must be combined with in-band health checks to handle them.

If you liked it then you shoulda put a mesh on it

To address the limitations of topology aware routing (and out-of-band health checking as a whole), we'll need to move to the world of service meshes. Like TAR, a service mesh knows whether the source and destination of traffic lie within the same zone and can make a decision about whether this should be allowed.

Unlike TAR, however, a service mesh can make the decision about whether to allow cross-zone traffic in a more powerful way. This is because it can take advantage of not just signals but features that are unavailable to Kubernetes itself, including:

In-band health checking. A service mesh can parse protocols like HTTP and gRPC to determine whether the responses are failures or successes. This allows it to do in-band health checking and detect the class of failure we're interested in (at least for systems using these protocols).
Fine-grained traffic control. A service mesh can route individual requests, not just connections, and so can take a much more nuanced approach to cross-zone traffic. TAR operates in binary fashion: it either denies all connections across all zone boundaries entirely, or it permits them entirely. A service mesh, by contrast, can "ease" traffic between zones on a per-request basis, allowing a few requests across zones if a service starts to struggle, then further opening or closing the valve depending on whether the pressure increases or decreases.
Identifying precursors of failure such as increases in latency. A service mesh can take into account indications of impending failure, such as increases in latency, which could signal a struggling system which is on the verge of failing. A service mesh could start allowing cross-zone traffic in a preventative manner, heading off the failure before it escalates.
Handling cross-cluster traffic. A service mesh can handle cross-zone traffic, and so it can address another limitation of TAR: it cannot keep traffic across clusters within the same zone boundary. For Kubernetes adopters who have multiple multi-zone clusters with high volumes of traffic between them, TAR cannot prevent this traffic—but since service meshes can also handle cross-cluster traffic, they can.

With these capabilities in hand, we can produce a substantially more flexible and reliable mechanism for handling cross-zone traffic than TAR. And, in fact, we've recently introduced a new feature called High Availability Zonal Load balancing (HAZL) that does exactly that.

HAZL is an advanced HTTP and gRPC load balancer in Buoyant Enterprise for Linkerd which is designed to reduce cost while preserving high availability in multi-zone clusters. Just like TAR, HAZL will keep all traffic within the same zone when conditions are stable. Unlike TAR, HAZL will dynamically send traffic to other availability zones when necessary to preserve overall reliability, and works for cross-cluster traffic as well as in-cluster traffic.

(If you want to play around with it, you can get your hands dirty with HAZL in about 5 minutes on your own cluster.)

In keeping with Linkerd's design philosophy of "it just works", HAZL works on any Kubernetes cluster regardless of:

Distribution of pods across zones (i.e. allowing for fewer than 3 pods per zone)
Imbalanced traffic across zones
Use of horizontal pod autoscaling

... and other factors that can make it hard to use TAR. Most importantly, HAZL is able to determine while a system is failing and, as we described above, ease traffic into other zones as the failure escalates, then reduce it once the failure subsides.

Here's an example screenshot of what happens when we take HAZL and put it in the exact same situation as we did for TAR in Part I. In contrast to TAR, which dropped success rate to 0% in the zone when the failure was induced, HAZL automatically opens up cross-zone traffic as the system starts to fail, keeping the system at almost 100% success rate.

Grafana dashboard showing cross-zone traffic

‍

(If you want to try this yourself, we have a repo here with all the relevant code. We're using our bb application to generate the client and server, and the very cool oha load generator to generate the load.)

Note that this demo also makes use of one other Linkerd feature: circuit breaking. As the failure requests start to apply pressure to the system, HAZL starts opening up cross-zone traffic and adding endpoints in other zones. At the same time, circuit breaking starts shutting off the original endpoints which are consistently generating failures. As a result: traffic is sent to endpoints in other zones, and only the occasional request is sent to the original endpoints to check whether they have recovered.

Not shown in the graphs above (but feel free to replicate at home)—recovery from failure is just as smooth. When the original endpoints in the zone are allowed to respond successfully, then circuit breaking closes the circuit, and HAZL automatically removes the cross-zone endpoints as the original endpoints can absorb the full load. In under a minute, everything is back to normal and all traffic is back in-zone, where it belongs.

Cracking open the HAZL-nut

Under the hood, HAZL is conceptually simple: it is based on a metric called load. Linkerd balances at the request level rather than the connection level, so it knows exactly how many outstanding requests are waiting for a particular endpoint. We call the average of this number over a time period the load for an endpoint. When load is low, requests are being processed right away, and the system is healthy. When load is high, that means many requests are waiting to be processed, and the system is unhealthy—things are backing up.

So the HAZL algorithm is simply: when load is high, add additional endpoints; when load is low, remove the additional endpoints. And prefer local endpoints whenever possible, only adding cross-zone endpoints when necessary.

*Two services under different levels of load. On the left, the system is under heavy load, and 10 requests are pending. On the right, the system is under light load with only 1 pending request.*

There are two key benefits to this design:

First, the use of load as the core metric means it's easy to provide sensible defaults, since load has an intuitive meaning that is independent of the current number of endpoints and the inherent latency of a service. In other words, regardless of whether a service is normally returning responses in 20ms or 20 seconds, or whether it's deployed on 3 pods or 300, a load of 0.5 is low and a load of 100 is high.
Since the change that HAZL makes is simply adding and removing endpoints from the pool, it works in conjunction with all the existing reliability features of Linkerd: circuit breaking can remove failing endpoints; retries can smooth over temporary failures; latency-aware load balancing can shift traffic away from slow endpoints; etc. We saw exactly this in the demo above.

As a result, HAZL ends up being not just powerful but also incredibly easy to drop in to any existing service. It doesn't require your application to be deployed in a particular way at a particular scale; it works with systems like horizontal pod autoscaling that dynamically add and remove endpoints; and it pairs with Linkerd's existing rich feature set of reliability tools like circuit breaking, retries, and timeouts. In short: for the vast majority of services, HAZL works out of the box.

Finally, as expected, HAZL can also preserve zone locality for cross-cluster calls. Since Linkerd knows the zone of remote-cluster endpoints as well as local-cluster ones, HAZL can use that information just as easily. If you have two multi-zone Kubernetes clusters that are making use of Linkerd's multicluster communication capabilities, HAZL can eliminate cross-zone traffic across these clusters! And just like in the single-cluster case, any cross-cluster communication that becomes unhealthy will trigger the ability to cross zone boundaries in order to reduce load to the system.

Try it today!

HAZL is a great example of the kind of transformative capabilities at service mesh can provide just by making use of rich signals that are simply unavailable to Kubernetes. (Again, this is not a failing of Kubernetes, or of features like Topology Aware Routing, but rather a reflection of where it sits on the stack and what information it has available to make decisions.) I believe HAZL is the most advanced load balancer available today in any service mesh, and its potential extends beyond the cost savings example we've given today.

HAZL is also dirt simple to try. if you have a functioning multi-zone Kubernetes cluster, you can get it up and running in about five minutes. (And even if you don't have a multi-zone cluster handy, we've got instructions on how to set up a three-zone k3d virtual cluster on your local machine.) Just start here: Enabling HAZL in Buoyant Enterprise for Linkerd.

HAZL is also just the tip of the iceberg of some of the incredibly powerful enterprise features we're delivering in BEL. Stay tuned for lots more along these lines.

‍