The enterprise architect's guide to the service mesh

Download whitepaper

Relevant articles:

William Morgan

June 7, 2022

Linkerd

eBPF is cool technology with a lot to offer the cloud native world. It’s been a popular choice for the CNI layer of Kubernetes clusters thanks to projects like Cilium. Service meshes like Linkerd are often deployed with CNI layers like Cilium, combining Linkerd’s powerful L7 processing with Cilium’s super-fast L3/4 handling.

But just how powerful of a networking technology is eBPF? Could it allow us to, for example, replace Linkerd’s sidecars proxies entirely, and just do everything in the kernel?

In this article, I’ll do my best to evaluate this possibility—especially as it pertains to impact on the user. I’ll describe what eBPF is and what it can and cannot do. I’ll dig into sidecars vs other models and contrast them from operational and security perspectives. Finally, I’ll lay out my conclusions about what we, the Linkerd team, believe is the future of the service mesh as it pertains to eBPF.

Who am I?

Hi there, I’m William Morgan. I’m one of the creators of Linkerd—the first service mesh, and the one to define the term itself. I’m also the CEO of Buoyant, which helps organizations around the world adopt Linkerd. You might remember me from such long and nerdy missives as The Service Mesh: What every software engineer needs to know about the world’s most over-hyped technology and A Kubernetes engineer’s guide to mTLS: Mutual authentication for fun and profit.

I care a lot about Linkerd, and that’s my bias. But I’m also happy to be pragmatic about its implementation. Ultimately, Linkerd’s goal is to be the simplest service mesh for our users, and how Linkerd achieves that simplicity is an implementation detail. For example, today Linkerd uses sidecars, but earlier 1.x releases of Linkerd were deployed as per-host proxies, and we made that change for operational and security reasons. The idea that eBPF might allow us to further simplify Linkerd—especially operationally—is the kind of thing that gets my attention.

What is eBPF?

Before we get into gory service mesh details, let’s start with eBPF. What is this hot new technology that’s sweeping the Tweet-o-sphere?

eBPF is a feature of the Linux kernel that allows applications to do certain types of work in the kernel itself. eBPF has its origins in the networking world, and while it’s not restricted to networking, that’s where it shines: among other things, eBPF unlocks a whole class of network observability that was simply not possible in the past due to their impact on performance.

Let’s say you want your application to process network packets. You can’t access the host machine’s network buffer directly. This buffer is managed by the kernel, and the kernel has to protect it—for example, it has to ensure that one process cannot read the network packets of another process. Instead, the application can request network packet information via a syscall, which is essentially a kernel API call: your application calls the syscall, the kernel checks whether you have permission to get the packet you requested; and if so, returns it to you.

Syscalls are portable—your code will work on non-Linux machines—but they are slow. In a modern networking environment, where your machine might process tens of millions of packets per second, writing syscall-based code to do something with every packet is untenable.

Enter eBPF. Rather than our code calling syscalls in a tight loop, passing back and forth between “kernel space” and “user space”, we instead give our code directly to the kernel and tell it to execute it itself! Voila: no more syscalls, and our application can run at full speed. (Of course, as we’ll see below, it’s not this simple.)

eBPF is one of a crop of recent kernel features like io_uring (which Linkerd uses heavily) that change the way that applications and the kernel interact. (ScyllaDB’s Glauber Costa has a great writeup on this: How io_uring and eBPF Will Revolutionize Programming in Linux.) These features work in very different ways: io_uring uses a specific data structure that allows the application and the kernel to share memory in a safe way; eBPF works by allowing applications to submit code to the kernel directly. But in both cases the goal is to deliver performance gains by moving beyond the syscall approach.

eBPF is a big advancement, but it’s not a magic bullet. You cannot run arbitrary applications as eBPF. In fact, the things you can do with eBPF are highly limited, and for good reason.

Contended multi-tenancy is hard

Before we can understand why eBPF is so limited, we need to talk about why the kernel itself is so limit-ing. Why do things like syscalls exist? Why can’t programs just access the network (or memory, or disk) directly?

The kernel operates in a world of contended multi-tenancy. Multi-tenancy means that multiple “tenants” (e.g. people, or accounts, some other form of actor) share the machine, each running programs of their own. Contended means these tenants aren’t friends. They shouldn’t have access to each others’ data, or be able to interfere with each other. The kernel needs to enforce that behavior, while allowing them to execute arbitrary programs. In other words, the kernel needs to isolate the tenants.

This means that the kernel can’t really trust any program it’s instructed to run. At any point, the program of one tenant could attempt to do something bad to the data or programs of another tenant. The kernel must ensure that no program can stop or break another program, or deny it resources, or interfere with its ability to run, or read its data from memory, or the network, or disk, unless given explicit permission to do so.

This is a critical requirement! Almost every software-related security guarantee in the world ultimately comes down to the kernel’s ability to enforce these kinds of protections. A program that can read another program’s memory or network traffic without permission is an avenue for data exfiltration, or worse. A program that can write another program’s memory or network traffic is an avenue for fraud, or worse. Kernel exploits that allow programs to break the rules are a very big deal. And one of the best ways to break these rules is to get access to the kernel’s internal state—if you can read or write kernel memory, then you can get around the rules.

That’s why every interaction between applications and the kernel is so highly scrutinized. The consequences of failure are extremely high. Kernel developers have poured collective millennia of effort into this problem.

This is also why containers are so powerful—they take these same isolation guarantees and apply them to arbitrary packages of applications and dependencies. Thanks to relatively modern kernel magic, we can run containers in isolation from each other, taking full advantage of the kernel’s ability to handle contended multi-tenancy. Earlier ways of achieving this isolation using virtual machines were slow and expensive. The magic of containers is that they give us (most of) the same guarantees in a way that’s dramatically cheaper.

Almost every aspect of what we think of as “cloud native” relies on these isolation guarantees.

eBPF is limited

Back to eBPF. As we discussed, eBPF allows us to hand the kernel code and say “here, please run this in the kernel”. From the perspective of kernel security, we know that’s an incredibly scary thing to do—it would bypass all the barriers between applications and the kernel (like syscalls) and put us directly into security exploit territory.

So to make this safe, the kernel imposes some very significant constraints on the code that it executes. Before they can be run, all eBPF programs must pass through a verifier, which checks them for naughty behavior. If the verifier rejects the program, the kernel won’t run it.

Automatic verification of programs is difficult, and the verifier has to err on the side of being overly restrictive. Thus, eBPF programs are very limited. For example, they cannot block; they cannot have unbounded loops; and they cannot exceed a predefined size. They’re also limited in their complexity—the verifier evaluates all possible execution paths, and if it cannot complete within some limit, or if it cannot prove that every loop has an exit condition, the program does not pass.

There are many perfect safe programs that violate those constraints. If you want to run one of those programs as eBPF, too bad! You need to rewrite it to satisfy the verifier.¹ The good news, if you’re an eBPF fan, is that these restrictions gradually get looser as the verifier gets smarter in each kernel release. There are also some creative ways of working around these limits.

But despite this, the nature of how eBPF is constrained (and must be constrained, for the model to work) means that eBPF programs are extremely limited in what they can do. Even buffering data across multiple packets is non-trivial in eBPF. Serious processing—the kind required for handling the full scope of HTTP/2 traffic example—is far outside the scope of pure eBPF, and terminating TLS’d data is impossible.

At best, eBPF can do a fraction of this work, calling out to user-space applications to handle the portions that are too complex to handle directly in eBPF.

eBPF vs the service mesh

With the basics of eBPF under our belt, let’s get back to the service mesh.

A service mesh handles the complexities of modern, cloud-native networking. Linkerd, for example, initiates and terminates mutual TLS; retries requests across connections; transparently upgrades connections from HTTP/1.x to HTTP/2 between proxies for improved performance; enforces access policy based on workload identity; sends traffic across Kubernetes cluster boundaries; and lots, lots more.

Linkerd, like most service meshes, does this by inserting a proxy into each application pod, which intercepts and augments the TCP communication to and from the pod. These proxies run in their own containers alongside the application container—the “sidecar” model. In Linkerd’s case the proxies are ultra-light, ultrafast, Rust-based micro-proxies, but other approaches exist.

Ten years ago, the idea of deploying hundreds or thousands of proxies on your cluster and wiring them up to pair with every instance of every application would have been an operational nightmare. But thanks to Kubernetes, it’s suddenly very easy. And thanks to Linkerd’s clever engineering (if I might toot our own collective horn) it’s also manageable: Linkerd’s micro-proxies do not require tuning and consume a bare minimum of system resources.

In this context, eBPF has been playing nicely with service meshes for years. Kubernetes’s gift to the world is a composable platform with clear boundaries between layers, and the relationship between eBPF and service meshes fits right into that model: the CNI is responsible for L3/L4 traffic, and the service mesh for L7.²

And the service mesh is great for platform owners. It provides functionality like mTLS, request retries, “golden metrics”, etc, at the platform level, which means they no longer need to rely on the application developers to build out these features. But at the cost, of course, of adding lots of proxies everywhere.

So back to our original question: can we do better? Can we get the functionality of the service mesh without the proxies with some form of “eBPF service mesh”?

The eBPF service mesh still requires proxies

Armed now with our understanding of eBPF, we can jump into these murky waters and explore what may lurk within.

Unfortunately, we hit bottom pretty quickly: eBPF’s limitations mean that the full scope of service mesh features such routing HTTP/2 traffic based on headers, initiating and terminating mutual TLS, and so on, are very, very far outside the realm of technical feasibility with an eBPF-only approach.

And even if they were feasible, it still wouldn’t make sense to implement them in eBPF! Writing eBPF is hard; debugging it is extremely hard; and these operations are complicated enough already, without having to implement them in a limited programming model that requires us to jump through hoops to even accumulate data across packets.

So, for reasons of both technical limitation and software engineering practices, the idea of a “pure” eBPF service mesh is a non-starter.

What does make sense, however, is to pair eBPF with user-space code that can handle the complex bits. In other words, to use a proxy. And that’s what every “eBPF service mesh” approach extant today does, whether it advertises it or not: eBPF for the bits that make sense, and a user-space proxy for the rest.

Per-host proxies are significantly worse than sidecars

So our eBPF service mesh requires proxies. But does it require sidecar proxies specifically? What if we use per-host proxies—could that give us a sidecar-free, eBPF-powered service mesh?

The answer is yes, but… it’s a bad idea. Unfortunately, we learned this the hard way in Linkerd 1.x. (Sorry, early adopters!) Compared to sidecars, per-host proxies are worse for operations, worse for maintenance, and worse for security.

Why? In the sidecar model, all traffic to a single instance of an application is handled through its sidecar proxy. This allows the proxy to act as part of the application, which is ideal:

Proxy resource consumption scales with load to the application. As traffic to the instance goes up, the sidecar consumes more resources, just as the application does. If the application is taking very little traffic, the sidecar doesn’t need to consume many resources. (Linkerd’s proxies have a 2-3MB memory footprint at low traffic levels.) Kubernetes’s existing mechanisms for managing resource consumption, such as resource requests and limits and OOM kills, all continue to work.
The blast radius of proxy failure is limited to a pod. Proxy failure is the same as application failure, and is handled by existing Kubernetes mechanisms for failed pods.
Proxy maintenance, e.g. upgrades of proxy versions, are done via the same mechanisms as for the application itself: rolling updating of Deployments, etc.
The security boundary is clear (and small): it’s at the pod level. The sidecar runs in the same security context of the application instance. It’s part of the same pod. It gets the same IP address. It enforces policy and applies mTLS to traffic to and from that pod, and it only needs the key material for that pod.

In a per-host model, these niceties go out the window. Instead of a single application instance, the proxy now handles traffic to an effectively random set of pods of whichever application instances Kubernetes has decided to schedule on the host. The proxy is now completely decoupled from the application, which introduces all sorts of subtle and not-so-subtle issues:

Proxy resource consumption is now highly variable: it depends on what Kubernetes has scheduled on the host at any point in time. This means you cannot effectively predict or reason about the resource consumption of a specific proxy, which means it will eventually break and the service mesh team will be blamed.
Applications are now vulnerable to “noisy neighbor” traffic. Because all traffic through the host flows through a single proxy, a single high-traffic pod can consume all proxy resources, and the proxy must ensure fairness or the application risks starvation.
The blast radius of a proxy is large and ever-changing. Proxy failures and upgrades now affect a random subset of pods across a random set of applications, meaning that any failure or maintenance task has hard-to-predict effects.
The security story is now far more complex. To do TLS, for example, a per-host proxy must contain the key material for whichever applications have been scheduled on the host, making it a new attack vector that’s vulnerable to the confused deputy problem—any CVE or vulnerability in the proxy is now a potential key leak.

In short, the sidecar approach keeps the isolation guarantees we’ve gained from moving to containers in the first place—the kernel can enforce all the security and fairness considerations of contended multi-tenancy at the container level, and everything just works. The per-host model moves us out of that world entirely, leaving us with all the problems of contended multi-tenancy.

Of course, per-host proxies do have some advantages. You can lower the number of proxies that a request must traverse from two per hop in the sidecar model to one per hop, which saves on latency. You can have fewer, larger proxies, which may be better for resource consumption if your proxies have a high baseline cost. (Linkerd 1.x was a good example of this—great at scaling up to large traffic volumes; bad at scaling down). And your network architecture diagram is “simpler” because you have fewer boxes.

But these advantages are minor compared to the operational and security problems you incur. And, with the exception of fewer boxes in your network diagram, we can mitigate these differences with good engineering—making sure our sidecars are as fast and as small as possible.

Can we just improve the proxy?

Some of the issues we’ve outlined with per-host proxies come down to our old friend, contended multi-tenancy. In sidecar land, we use the kernel’s existing solutions to contended multi-tenancy via containers. In our per-host proxy model, we can’t do that—but can we fix these issues by having our per-host proxy itself able to handle contended multi-tenancy? For example, one popular proxy is Envoy. Could we address the problems with per-host proxies by adapting Envoy to handle contended multi-tenancy?

The answer is no. Well the answer is yes, in the sense of “it would not contradict the physical laws of the universe”, but no in the sense of “this would be a huge amount of work that would not be a good use of anyone’s time”. Envoy is not designed for contended multi-tenancy and it would require a massive effort to change that. There is a long and interesting Twitter thread exploring some of what would have to be done if you want to get into the details—it would require a tremendous amount of very tricky work the project, and a huge amount of change that would have to be continually weighed against “just running one Envoy per tenant”—i.e. sidecars.

And even if you did this work, at the end, you’d still have the issues of blast radius and security to contend with.

The future of the service mesh

Putting it all together, we’re left with one conclusion: eBPF or no eBPF, the foreseeable future of the service mesh is built from sidecar proxies running in user-space.

Sidecars are not without their problems,³ but they are by far the best answer we have today for handling the full scope of the complexities of cloud native networking while keeping the isolation guarantees provided by containers. And to the extent that eBPF can offload work from the service mesh, it should do that by working with sidecar proxies, not per-host proxies. “Making existing sidecar-based approaches much faster while retaining the operational and security advantages of containerization” doesn’t quite have the same marketing ring as “solve service mesh complexity and performance by getting rid of sidecars”, but from the users’ perspective, it’s a win.

Will eBPF’s capabilities eventually grow to the point where no proxy is needed to handle the full scope of L7 work provided by service meshes? Perhaps—but for reasons outlined above it is unlikely even then that it would make sense to abandon user-space proxies. Will the kernel be able to absorb the full scope of service mesh work through some other mechanism? Perhaps—but there is little appetite today for “service mesh kernel modules” and it’s unclear what would make that a compelling prospect.

So, for the foreseeable future, the Linkerd project will continue our efforts to make our sidecar micro-proxies as small, fast, and operationally negligible as possible, including by offloading work to eBPF when it makes sense. Our fundamental duty is to our users and their operational experience with Linkerd, and it is through that lens that we must always measure every design and engineering tradeoff.

(Social embed photo by Andrew Wulf on Unsplash)

Footnotes

Alternatively, you can get your verifier patches merged into the kernel that allow your program to run. This is probably a bit harder.^[return]‍
Or possibly L5/L7, or L5-L7… the OSI model, never particularly accurate, now requires footnotes and punctuation gymnastics to approach precision.^[^return^]
^‍Believe me, there are plenty, especially when it comes to that perennial bugbear of Kubernetes, container startup ordering—great fodder for another blog post.^[^return^]

‍

eBPF, sidecars, and the future of the service mesh