The development of distributed systems is full of strange paradoxes. The reasoning we develop as engineers working on a single computer can break down in unexpected ways when applied to systems made of many computers. In this article, we’ll examine one such case—how the introduction of an additional network hop can actually decrease the end-to-end response time of a distributed system.
Speed matters. Studies have shown that the introduction of as little as 100ms of additional latency can have a marked impact on human behavior. Generally speaking, we’d expect the introduction of a new component in a request path to increase latency.
In this light, we’re often asked about the latency impact of adding linkerd to a system. The simple answer is “network time, plus about 1ms” (our goal is <1ms p95 at 1000qps). The more complex answer is: in most situations, linkerd actually reducesend-to-end latency of your system (and improves success rate).
How is this magic possible? In short, linkerd’s latency-aware load balancing can rapidly detect when instances are slow and shift request traffic away from them. By combining this behavior with circuit breaking and retries, linkerd can dramatically decrease end-to-end latency in a distributed system—despite being “yet another component”.
Let’s take a look at a dramatic example of this behavior with some poorly-behaved services.
First let’s boot up a test environment:
git clone https://github.com/linkerd/linkerd-examples.git cd linkerd-examples/add-steps docker-compose build && docker-compose up -d open http://$(docker-machine ip default):3000 # or equivalent docker ip address
If everything worked correctly, you should see this:
What’s going on here?
You have just booted up parallel test clusters, a linkerd cluster, and a baseline cluster:
We are using our load tester, slow_cooker, to generate traffic. Each cluster consists of ten identical Go servers that serve a simple http endpoint. Instance response times vary between 0 and 2 seconds. The slowest instances also return an error response for a certain percentage of requests.
Comparing these two clusters, we observe the following results:
|requests per second||316rps||109rps|
Note that p50 latency is more than twice as fast with linkerd. p99 latency is 2.8x faster with linkerd. linkerd’s latency-aware load balancer is favoring the faster instances while the baseline cluster’s latency is severely degraded due to the slower instances. Success rate is 100% with linkerd vs. 88.08% without, thanks to linkerd’s automatic retry behavior.
All of these metrics have improved despite adding an extra network hop via linkerd. It’s clear that round robin does not handle degraded hosts well. Host degradation like this is extremely common in the real world for many reasons, including failing hardware, network congestion, and noisy neighbors. These issues get even more common as you scale up.
To achieve these results, we tuned linkerd to favor fast and healthy instances. Here is the relevant snippet from our linkerd config:
responseClassifier: kind: io.l5d.retryableRead5XX client: loadBalancer: kind: ewma failureAccrual: kind: io.l5d.successRate successRate: 0.9 requests: 20 backoff: kind: constant ms: 10000
responseClassifier section, note the use of io.l5d.retryableRead5XX. We have configured linkerd to retry 500s. This enables linkerd to achieve 100% success rate even when some instances are failing. Of course, retrying a failing request adds latency for that request, but this is often a good tradeoff to make, since the alternative is a failed request.
loadBalancer section, note the use of EWMA. This algorithm computes an exponentially-weighted moving average over the latency of each instance, where recent latency performance is upweighted heavily. In our testing, this load-balancing algorithm responds rapidly to changes in instance latency, allowing it to perform well when latency is inconsistent.
failureAccrual section, note the use of io.l5d.successRate. This is also computes an exponentially-weighted moving average, this time based on the success rate of each instance. In our testing, this failure accrual algorithm performs well when success rate is inconsistent.
For more information on configuration options available in linkerd, have a look at ourConfiguration Reference.
In this example, we’ve seen how linkerd can improve system throughput in the presence of failing and slow components, even though linkerd itself adds a small amount of latency to each request. In our experience operating large-scale systems, this test environment demonstrates the types of performance issues and incidents that we have seen in production. A single request from the outside can hit 10s or even 100s of services, each having 10s or 100s of instances, any of which may be slow or down. Setting up linkerd as your service mesh can help ensure latency stays low and success rate stays high in the face of inconsistent performance and partial failure in your distributed systems.
 A Note About Round Robin
It was surprisingly difficult to find a DNS client that would correctly round robin IP addresses. You may have noticed that this demo uses a slow_cooker built on golang 1.8rc2. This was required because golang 1.7 does not do round robin correctly. We also found that curl, ping, and anything relying on glibc’s getaddrinfo employ a sort function based on rfc6724. This results in a preference for certain IP addresses, or in some cases a single ip address.