A Kubernetes engineer’s guide to mTLS

Mutual authentication for fun and profit

William Morgan

Introduction

Mutual TLS, or mTLS, is a hot topic in the Kubernetes world, especially for anyone tasked with getting “encryption in transit” for their applications. But what is mTLS, what kind of security does it provide, and why would you want it?

In this guide, I’ll do my best to answer those questions. I’ll cover what mTLS is, how it relates to “regular” TLS, and why it’s relevant to Kubernetes. I’ll also talk about some of the pros and cons of mTLS and its alternatives. At the end, I’ll show you how to add mTLS to your Kubernetes cluster with Linkerd.

A word of caution before we start: I am essentially a mediocre Kubernetes-aware engineer who understands mTLS at a practical level. In this guide I’ve done my best to be accurate and to point out the nuances when I’m aware of them, but this is not an in-depth analysis of mTLS from the security perspective, nor would I be qualified to write such a thing. Corrections and clarifications are welcome—please reach out to me on Twitter or drop me an email at my first name @ buoyant.io.

What is mTLS?

mTLS is simply “regular TLS” with one extra stipulation: the client is also authenticated. But what does that mean and why would we want that?

Before we can answer those questions, we need to start with a basic understanding of TLS. TLS is a connection-level protocol designed to provide security for a TCP connection (we’ll see exactly what security means here below). Since TLS works at the connection level, it can be combined with any application-level TCP protocol without that protocol needing to do anything different. For example, HTTPS is HTTP combined with TLS (the “S” in HTTPS refers to SSL, the predecessor of TLS), and nothing about HTTP needs to change to accommodate TLS.1

If you’re a security expert—and I’ll point out again that I am definitely not one—you may have something of a “complex” relationship with TLS. TLS has all sorts of issues and quirks that make it suboptimal from the security perspective. The spec is complex and underspecified, there are bits that don’t really make sense, and implementations are never 100% up to the spec anyways.

Despite those concerns, TLS is everywhere. You’re using TLS right now: this page is served over HTTPS, and you probably see a little lock icon in your browser’s URL bar that gives you some soothing words when you click on it.

What kind of security does TLS provide?

Most people associate TLS with encryption. But TLS is more than that. TLS provides three guarantees for a connection:

  • Authenticity: the parties on either side can prove that they are who they say they are.
  • Confidentiality: no one else can see what data is being exchanged.
  • Integrity: the data received is the same data that was sent.

So while TLS does give you encryption—that’s how it achieves confidentiality—from the TLS perspective, that’s not enough for secure communication: you need all three properties. If you don’t have authenticity then someone could spoof being on the other end of the connection. If you don’t have integrity then someone could modify crucial bits of the communication. And if you don’t have confidentiality, then anyone can listen in.

Of these three guarantees, the most interesting for this discussion is authenticity. We’ll be talking a lot about authenticity and authentication through the rest of the guide.

When is mTLS useful?

Back to our original definition: mTLS is simply “regular TLS” with the extra stipulation that the client is also authenticated. With our basic understanding of TLS, we can now parse this statement. TLS guarantees authenticity, but by default this only happens in one direction: the client authenticates the server but the server doesn’t authenticate the client. mTLS makes the authenticity symmetric.

Why would TLS’s default be to only authenticate in one direction? Because often the client’s identity is irrelevant. For example, in loading this page, your browser has validated that buoyant.io is who it claims to be, but buoyant.io hasn’t validate the identity of your browser. It doesn’t actually care about the identity of your browser. (Frankly, buoyant.io is just happy you’re reading this article.)

Of course, not validating client identity makes sense for serving web pages, but there are plenty of types of communication where the identity of the client is important. API calls are one example: if you’re calling a service like Twilio, then Twilio needs to know who you are—among other reasons, so that they can send you the bill. You can’t make a Twilio API call without providing it with some kind of client identity.

But Twilio doesn’t use mTLS. Instead, you authenticate yourself to Twilio by giving it a secret “authentication token” that you were assigned when you created your account. Twilio could use mTLS, but mTLS is complicated and frankly annoying to set up (lots more on this later), so if you offer a public API like Twilio you will probably and just use an auth token.

Authentication with mTLS, however, has some really powerful characteristics that our auth token approach doesn’t. For one, mTLS authentication can be done entirely outside of the application without requiring any app-level features for creating, registering, or managing identities. Before you can make your first Twilio call, you need to log into the website, create an account, and get your token.2 The Twilio API has to know about this auth token and provide ways to pass it to API calls and to manage it. But with mTLS, a brand new client can authenticate itself right off the bat, even if no one has seen it before. And the application doesn’t need to know anything about authentication or provide endpoints to manage it.

Putting that all together, we see that mTLS is great for situations where a) you need secure communication; b) you care about client identity; and c) you don’t want to build app-level flows for managing identities. And also, practically speaking, when d) you can manage the complexity of actually implementing it.

One situation with all those characteristics is… microservices!

Using mTLS to secure microservices

mTLS is a great way to secure the cross-service communication between microservices, for all the reasons we outlined above.

First, you want secure communication. When we implement our application as multiple services, we end up sending sensitive customer data across the network between these services. Anyone who gets access to the network can potentially read this sensitive data and forge requests.

Second, you care about client identity. For one, you want to make sure that you can tell when your calls are coming from for diagnostics purposes and so that things like metrics are recorded properly. Moreover, you probably want do authorization with these identities (is B even allowed to call A?). We’ll talk more about authorization later.

Third, you don’t really want to build app-level flows for managing service identity. It’s not business logic and developer time would be better spent elsewhere.

Finally, you can actually manage the complexity of implementing mTLS if you control the platform. Or at least, better than Twilio can. In our Twilio example, every user has to solve the challenge of authenticating themselves to Twilio. The harder that challenge is, the worse for the user (and the worse for Twilio’s bottom line). But if we can implement mTLS at the platform level, we’ll pay the cost once rather than for each service or for each user.

In summary, mTLS is a great fit for securing the communication between microservices. But there’s a catch.

The hard part of TLS: certificate management

So far we’ve painted a rosy picture of mTLS. Clients and servers merrily authenticate each other and then security happens. In practice, the massive practical challenge standing in the way of making mTLS work is certificate management.

Authentication in TLS works through public key cryptography and public key infrastructure. These are both tremendous topics in their own right, and in this article we won’t go into the details. But in short, they involve a whole lot of certificates.

TLS authentication is based on something with the delightful name of the X.509 certificate. A X.509 certificate contains, among other things, an identity and public key. The public key also has a corresponding private key, which is not part of the certificate. The first half of authenticating yourself in TLS is showing your certificate to the other side and then using the private key to prove that the identity inside the certificate is yours. (The magic of public key cryptography is that anyone who copies the certificate can’t do this proof because they don’t have the private key. So you can be very free with the certificate, including sending it across plaintext channels or storing it in the open.)

X.509 certificates are signed by a certificate authority (CA) denoting that the CA “trusts” the identity in that certificate. This is used for the second half of TLS authentication: if someone shows you their identity and proves they own it, you now have to decide whether you trust that identity. TLS uses a simple rule here: if the certificate is signed by a CA, and you trust that CA, you also trust the identity. How do you verify the CA’s signature of a certificate? By using the X.509 certificate of the CA itself. How do you know if you trust that CA? Well, basically, you’re told to trust it in some way outside of the TLS protocol.3

The CA also issues certificates. To get a certificate, you first create the public key and private key pair. You keep the private key, well, private—we’ll never send it over the network—and you send the CA a certificate signing request (CSR) that contains the public key and your identity. If the CA approves the request, it creates the certificate, signs it, and sends it back to you.

Certificate management, then, is the challenge of creating and distributing all these certificates. We need to make sure there is a CA, that every service has its certificate, that every service can send it a CSR, and that the CA can send the certificates back to the service. We also need to make sure our CA is secure, and that no one ever has access to the private keys of each service, and that every service knows its own identity in a way that can’t be altered.

This certificate distribution challenge is compounded by the fact that in environments like Kubernetes, a “service” is in fact an ever-changed set of replicas that can be created or destroyed on the fly, each of which needs its own set of certificates.

And that is further compounded by the fact that, in practice, the best way anyone has found to mitigate certificate loss (i.e. what happens when someone unauthorized gets access to a secret key) is by certificate rotation: you give certificates very short lifetimes and re-issue them before they expire. This means we need to repeat the whole CSR and certificate flow, for each replica, every n hours.

And even that is further compounded by the fact that if we want to extend our secure communication across clusters, we need a way to ensure that the identities generated in one cluster can be consumed by the other cluster, but that if a whole cluster gets compromised, we can disable it without disabling every other cluster. We do this—you guessed it—with more certificates.

In summary, doing mTLS involves a whole lot of certificates, all the time. The complexity of this challenge can be daunting. But despite that, mTLS has seen something of a renaissance in the world of Kubernetes. This is because Kubernetes unlocks a particular type of technology that makes mTLS feasible: the service mesh.

Kubernetes, mTLS, and the service mesh

A service mesh is an amazingly good mechanism for adding mTLS to your cluster. Why? Because something like Linkerd can actually do all of the work for you. It can handle not only the challenge of certificate management, but also the making and receiving the TLS connections themselves. Linkerd makes “add mTLS to my cluster” a zero-config operation: the moment you install Linkerd on a Kubernetes cluster, all communication between meshed pods is automatically mTLS’d. For something as complex as mTLS, that’s pretty incredible.

This is all possible because Kubernetes makes some things that would otherwise be incredibly complex, like sidecar deployments, tractable. Thanks to the magic of Kubernetes, Linkerd can:

  • Transparently inject a “micro-proxy” into each application pod and route all TCP communication to and from the pod through this proxy.
  • Ship an internal CA as part of its control plane that can issue TLS certs, and securely distribute the certificate for this CA to all proxies.
  • Use this CA to issue short-lived certs to each proxy, tied to the Kubernetes ServiceAccount identity of the pod.
  • Re-issue those certs every n hours.
  • Have each proxy enforce mTLS on all connections to the pod with those certs, ensuring that clients and servers have valid identities on both sides.
  • Apply authorization policy using those identities before the connection hits the application.

That’s all a simplification, of course. For example, Linkerd actually uses two levels of CAs, one at the cluster level and one at the global level, in order to allow for cross-cluster communication. And Linkerd can use multiple trust roots, so that you can rotate your CAs as well. And so on.

But you don’t have to worry about those details. You install Linkerd, mesh your pods, and voila: you have mTLS.

Here’s a 75-second video we made in the early days of Linkerd 2.3, showing what it looks like to use tshark to sniff packets on a GKE cluster before and after mTLS.

Neat, right? Now let’s see exactly how easy it is to set this up for yourself.

Tutorial time

Ok, let’s add mTLS to our Kubernetes cluster in about five minutes with Linkerd. We’re also going to install the Buoyant Cloud extension and its fancy dashboard to allow us to see which traffic is TLS’d and which isn’t, but that part is optional.

Step 1: Install the CLI

Download the Linkerd’s command-line interface (CLI) and its Buoyant Cloud extension onto your local machine:

curl -fsL https://run.linkerd.io/install | sh
curl -fsL https://buoyant.cloud/install | sh

(Feel free to inspect the content of these scripts first. This is a security-themed blog post, after all!)

Step 2: Validate your Kubernetes cluster

Validate your Kubernetes cluster is prepared for Linkerd:

linkerd check --pre

If there are any checks that do not pass, follow the provided links and fix the issues. (And if the linkerd command is not found, be sure you followed the $PATH instructions output by the commands above!)

Step 3: Install the control plane onto your cluster

Install the control plane onto your cluster:

linkerd install | kubectl apply -f -

Now wait for the control plane to be ready:

linkerd check

Install the Buoyant Cloud extension. We’ll use this to inspect which connections are mTLS’d:

linkerd buoyant install | kubectl apply -f -

Validate the control plane one last time:

linkerd check

Finally, let’s install our “emojivoto” sample application:

curl -fsL https://run.linkerd.io/emojivoto.yml | linkerd inject - | kubectl apply -f -

Now our emojivoto application is installed, and thanks to Linkerd, we actually already have mTLS between these components! Congrats!

What traffic does a service mesh actually mTLS?

Funnily enough, one “problem” with Linkerd’s mTLS is that because it is totally transparent, and it can be hard to tell whether it’s working. (Our tshark example video is one way of doing this, but it’s not a particularly convenient way.)

But we can spin up our fancy dashboard Buoyant Cloud dashboard and take a look:

linkerd buoyant dashboard

Click on the TLS tab on the left, and select the emojivoto namespace. This will take you to a list of all the different types of traffic that Linkerd has seen in your system. You’ll see something that looks like this:

Buoyant Cloud mTLS screenshot

Buoyant Cloud mTLS screenshot

Voila! This is all the TCP traffic in the emojivoto namespace, broken down by destination and TLS status. We can see some interesting things here:

  • Health checks in plaintext. We still have plaintext traffic! But it’s only going to Linkerd’s admin port, 4191. Because this is plaintext traffic we don’t have any identity, but based on the port and the frequency we know that this is simply Kubernetes doing its health checks, which Linkerd handles for us.
  • Metrics scrapes. We also have mTLS traffic to that same (!) admin port. In this case, because it’s mTLS we do get identity, allowing us to see that this is traffic from the linkerd-viz Prometheus instance and separately from the Buoyant Cloud agent. Both of these systems scrape metrics from Linkerd’s metrics endpoint.
  • Application traffic. Finally, we see our application traffic, happily mTLS’d. We see HTTP calls from our ingress (which just has an identity of “default”) as well as gRPC calls from the web service to the emoji and voting services. (You’ll also see that in pictorial form if you click on the Topology tab, but that’s another blog post.)

This is a pretty standard snapshot of an mTLS’d Kubernetes application: there is still plaintext traffic on the cluster in the form of health checks, but that’s ok. Our sensitive application traffic is still encrypted and secured by Linkerd.

One thing not shown in the above screenshot, but that is often a part of a production cluster, is client-initiated TLS. This is often seen with ingress traffic: the user’s client creates a “regular” TLS call and your ingress controller terminates it. In this case, Linkerd treats that TLS connection as an opaque TCP stream and proxies it without doing anything special. Linkerd really can’t do anything else: the whole point of TLS is that something like Linkerd, which sits in between the client and the server, can’t intercept this traffic and decrypt it (remember our definition of confidentiality above?) At any rate, the traffic is still secure, even if Linkerd can’t give you HTTP-level metrics for it. And Buoyant Cloud will still be able to identify this traffic and count it as secure, even if it can’t tell you much more than that.

You’ve made it!

Congratulations, you’ve made it to the end, and hopefully added mTLS to your Kubernetes cluster! Now go forth and mTLS the world. And if you are planning to go to production, don’t forget to check out our Buoyant’s Linkerd production runbook, especially the sections on rotating issuer credentials and rotating webhook credentials—these are common gotchas for long-lived clusters.

Thank you

A big thank you to Alex Ellis, Oliver Gould, and William King for reviewing this. Needless to say, all mistakes are mine.

FAQ: What does mTLS actually protect against?

In truth, mTLS basically only protects against one specific attack vector: unauthorized access to the network. Any such intruder is prevented not just from sniffing the contents of the network calls, but also from impersonating a service and making their own calls. That’s great!

But there’s a lot that mTLS does not protect you against, including unauthorized access to a node. All the mTLS in the world won’t help you if someone can get into a node: you could read the secret keys and sniff or spoof connections, subvert the CA and cause havoc, or any number of other nefarious activities.

In the big picture of “what are the security vulnerabilities that I have to worry about with Kubernetes”, mTLS really addresses only a small fraction. Securing Kubernetes is not easy, and while Linkerd can play a part of it, there’s a lot more to be done.

FAQ: Kubernetes mTLS vs IPSec vs Wireguard

How does mTLS compare to network-layer encryption like IPSec or Wireguard? In Kubernetes, these are often implemented with at the CNI level, with a plugin like Calico or Cilium. Like a service mesh, these plugins provide encryption in transit without the application needing to do anything special. And they have one big advantage over the service mesh approach: they don’t require adopting a service mesh. But they also have some downsides.

From the practical standpoint, these CNI solutions typically require specific kernel and networking support that is not always available. A service mesh, by contrast, works on pretty much any conformant Kubernetes cluster.

The bigger downside of the CNI approach, however, is around identity: it doesn’t provide service identity. Connections are authenticated as being “part of the cluster” but no further.

This lack of service identity has two big implications. The first is that we can’t apply authorization policy on top of this identity. If you want to ensure that service A cannot talk to service B, or that only certain types of calls can happen, then with a CNI solution you have to determine the identity of A and B somewhere else in the stack. As we’ve seen above, doing authentication in a way that’s secure (i.e. beyond looking at source IP address, which is fragile and tied to implementation details) is far from trivial.

The second implication is that you can’t do hierarchical identity. Hierarchical identities are core to TLS and allow you to define different scopes of trust. Linkerd makes heavy use of this feature: Linkerd’s pod identities are signed by a per-cluster issuer cert, which in turn is signed by a cross-cluster trust anchor. This means, among other things, that if a particular cluster is compromised and someone steals the issuer keys, invalidating all the cluster’s identity only requires invalidating one key, and the remaining clusters can continue to function. This sort of thing is simply not possible with cluster-wide identity: either you can’t transmit identity across cluster boundaries; or you have to reset all identities if any cluster is ever compromised.

Finally, philosophically speaking, the security principle of zero trust suggests that we move our security boundaries to sit at the most granular level possible. In Kubernetes, that unit is the pod. With service mesh mTLS, your security boundary is at the pod level: encryption, authentication, and authorization all take place there. With network-layer encryption, you have the opposite: your security boundary is enforced at the cluster level.

Which is not to say that the two can’t work together—defense in depth—or that either approach couldn’t be used to satisfy the base requirement of “encrypted in transit”. However, in my semi-professional opinion, mTLS will give you a significantly better security posture for all the reasons outlined above.

Footnotes


  1. At least, at the protocol level. In practice, the way HTTP is used has certainly changed with the introduction of HTTPS. For example, features like HSTS are now used to prevent certain types of attacks that can occur with HTTPS. [return]
  2. And even though this client token flow doesn’t use TLS client authentication, it still relies on TLS server authentication for security: TLS ensures that the token is coming from Twilio and not an imposter. [return]
  3. For example, your web browser ships with the certificates of well-known public CAs like Verisign, Digicert, etc, which are packaged with it at release time. When you download Firefox, you’re trusting Mozilla to have put the correct certs in the browser. For in-cluster communication, we’ll create our own CAs—more on that below—which means we also have to distribute the certificate of this CA to every part of the cluster that is expected to make a trust decision, in a way that’s secure. [return]