Dec 10, 2016
Linkerd, our service mesh for cloud-native applications, needs to handle very high volumes of production traffic over extended periods of time. In this post, we’ll describe the load testing strategies and tools we use to ensure Linkerd can meet this goal. We’ll review some of the problems we faced when trying to use popular load testers. Finally, we’ll introduce slow_cooker, an open source load tester written in Go, which is designed for long-running load tests and lifecycle issue identification.
As a service mesh, Linkerd acts as a transparent proxy, taking requests destined for a particular service and adding connection pooling, failure handling, retries, latency-aware load balancing, and more. In order to be a viable production system, Linkerd needs to handle very high request loads over long periods of time and a variety of conditions. Happily, Linkerd is built on top of netty and Finagle, some of the most widely tested and production-vetted network code known to man. But code is one thing; performance in the real world is another.
To assess production behavior, Linkerd needs to be vetted by extensive and rigorous load testing. Furthremore, since Linkerd is a part of underlying infrastructure, Linkerd instances are rarely stopped or restarted—a single Linkerd instance may see billions of requests over a variety of client and service behaviors. This means we must also test for lifecycle issues. For high-throughput network servers like Linkerd, lifecycle issues include memory leaks, socket leaks, bad GC pauses, and periodic network or disk saturation. While these things happen infrequently, if they aren’t handled properly, the results can be catastrophic.
Early on in Linkerd development, we used popular load testers like ApacheBench and hey. (Of course, these are HTTP-specific, and Linkerd proxies a variety of protocols, including Thrift, gRPC, and Mux—but we needed to start somewhere.)
Unfortunately, we quickly found that while these tools were great for getting a quick read on performance, they weren’t great for identifying the lifecycle issues we wanted to capture. These tools would provide a single end-of-run summary, which could mask real issues. They also relied on means and standard deviations, which we knew was a problematic way to characterize system performance.
For capturing lifecycle issues, we wanted both better metrics and to the ability to see how Linkerd was performing over long tests runs of hours or days rather than minutes.
Since we couldn’t find a tool that did what we needed, we built one: slow_cooker. slow_cooker is a load tester designed explicitly for long-running load tests to identify lifecycle issues. We use slow_cooker extensively to find performance issues and test changes in our products. It features incremental progress reports, change detection, and comprehensive metrics.
Today, we’re open sourcing slow_cooker for others to use and contribute to. You can check out the source on GitHub or try out the recently released 1.0 version.
Let’s take a look at some of slow_cooker’s features.
(For the sake of simplicity, we’ll show the output of slow_cooker when we change performance characteristics of the downstream services. In practice, of course, we use slow_cooker primarily to identify problems with Linkerd, not the services it’s talking to.)
slow_cooker has an incremental reporting approach, motivated by our focus on finding lifecycle issues over a long period of time. Too much can get lost when looking at an aggregate report over a very large amount of data—especially for transient issues like GC pressure or network saturation. With incremental reports, we can see throughput and latency trends or changes in a running system.
In the example below, we show slow_cooker output from load testing Linkerd. Our test scenario has Linkerd load balancing across 3 nginx backends, each serving static content. The latencies given are in milliseconds, and we report the min, p50, p95, p99, p999, and max latencies seen during this 10 second interval.
In this report, good% measures throughput: how close we’re getting to the requested RPS (requests per second).
This report looks good—the system is fast and response times are stable. When things go bad, however, we need that fact to come across clearly. We designed slow_cooker’s output to make it easy to visually scan for issues and outliers by using vertical alignment and a change indicator helps us to spot outliers in latency. In the example below, we have a backend server suffering from a catastrophic slow down:
As you can see, the system is fast and responsive except for a hiccup at 2016-11-14T20:58:43Z. During this hiccup, our throughput dropped to 34% and then returned to normal. As a service owner, you’d want to look into your logs or performance metrics and investigate the root cause.
In order to demonstrate how incremental reporting can provide benefits over a single final report, let’s do a simulation of a backend service having GC trouble. In this example, we’ll test directly against a single nginx process serving static content, and in a loop we’ll continually pause and then unpause the process at 5 second intervals (using kill -STOP $PID and kill -CONT $pid). For comparison, let’s start with a ApacheBench’s report:
Here we see mean latency is 1.5ms, but some outliers have high latency. It would be easy to misread this report as healthy even though the backend service is unresponsive for fully half of the test run. If your target SLA is 1 second, then your service is out of SLA for more than half of the test run—but you might never suspect that from this report!
With slow_cooker’s incremental results, however, we can see that there’s a consistent throughput bottleneck that needs deeper investigation. Also, it becomes much more clear that the 99.9th percentile is consistently high; this is not just a few outliers, but a persistent and ongoing problem:
As we see from the ApacheBench example above, some load testing tools will only output average and standard deviation. However, these metrics are usually inappropriate for system latencies. Latency does not follow a standard distribution, and often has very long tails.
With slow_cooker, we discard mean and stddev entirely, showing instead the minimum, maximum, and a handful of higher-order percentiles (50th, 95th, 99th, and 99.9th). This approach has seen increased adoption in modern software systems, where a single request can result in dozens or even hundreds of queries to other systems. In these situations, metrics like the 95th and 99th percentiles represent the dominant latency for end users.
Although writing a load generator is not ultimately particularly difficult, especially with modern, concurrent, network-oriented languages like Go, the details of reporting and measuring can make a significant difference in the utility of the tool.
Today, we use slow_cooker extensively to test Linkerd as well as other projects in the ecosystem (e.g. nginx). We currently run 24x7 tests against Linkerd in the context of complex multi-service software. slow_cooker has helped us not only keep buggy code from being deployed, but it has also identified performance problems in existing code. Usage of slow_cooker has become so pervasive at Buoyant that we refer to load testing a piece of software as “slow cooking” it.
You can get started using slow_cooker today by visiting the Github releases page. Download the tool and fire it at your favorite backend to start vetting it for performance issues. We hope you’ll find it as useful in your setup as we have in our tests of Linkerd.