Locking down your Kubernetes cluster with Linkerd
In this hands-on workshop, we cover the basics of locking down in-cluster network traffic using the new traffic policies introduced in Linkerd 2.11. Using Linkerd’s ability to authorize traffic based on workload identity, we cover a variety of practical use cases, including restricting access to a critical service, preventing traffic across namespaces, and locking down traffic while still allowing metrics scrapes, health checks, and other meta-traffic.
You can view the slides here.
(Note: this transcript has been automatically generated with light editing. It may contain errors! When in doubt, please watch the original talk!)
Welcome and intros
William: … yeah, let’s do it. Hi folks, welcome to locking down your Kubernetes cluster with Linkerd, it’s going to be a hands-on workshop. I’m going to have a bunch of slides at the beginning where I’m just going to try and educate you as best I can, about what we’re going to be talking about, and then after that, we’ll have a little follow along section. I don’t have a specific repo for this. We’re going to be using stock Linkerd and Emojivoto, and I’ll kind of walk you through exactly what I’m doing and hopefully, you can follow along. So while you are listening to me, if you do want to follow along, get yourself a fresh, clean Kubernetes cluster.
And if you want to be very advanced, you can go ahead and install Linkerd on there. And then we can save ourselves a little time. Speaking of which I’m going to do that on my own cluster. So give me one second here, just to make sure I’ve got a cluster being created because I’m going to be using Kind. So I’ll be running this on my very own laptop. Okay, I’ve got my Kind cluster starting here. You can’t see it, but know that it’s being started, and then I’m going to Linkerd.
Okay, great. While that’s going on in the background, let’s go ahead and get started. I’m William Morgan. I am the CEO of a company called Buoyant. We created Linkerd, Linkerd of course is the CNCF project and fully open source. It’s a community project that Buoyant created. And today Buoyant spends a lot of its time helping other people run Linkerd, we provide some management tools like Buoyant Cloud, which I’ll show off a little bit, support training, and a bunch of other stuff. So that’s me at the top, Jason Morgan is our MC today, we are not related and I just want to make it absolutely clear, he just had to change his name in order to work on Linkerd.
And then, of course, we have the workshops channel in the Linkerd Slack. So if you’re not there already, feel free to go to slack.linkerd.io, and there should be lots of other friendly Buoyant folks there to help you out. Okay, so with that, let’s dive right in. What I’m going to be talking about is a new feature that was introduced in the latest release of Linkerd, called authorization policy.
Linkerd 2.11: Authorization policy
The 2.11 release came out at the very, very end of September last year. And network authorization policy was the headline feature. You can see there’s a blog post that I wrote. The goal of this feature basically is to give you some control over the types of communication that are allowed in your Kubernetes cluster. So, up until now, Linkerd has always allowed communication. And in fact, it’s done its best to make communication happen, right? If you are a service A, and you’re talking to service B, and Linkerd is in there, Linkerd is going to do its best to make that happen. And we suddenly have now a mechanism for denying that communication under certain circumstances. So we’re going to talk about exactly what those circumstances are.
And one important thing to know, of course, is that Linkerd’s authorization policy’s built on top of its mutual TLS, and mutual’s TLS feature uses the same mTLS identity. It’s enforced at the pod level. So if you are a security person and you want to talk about zero-trust and all that stuff, it’s all zero-trust compatible. The enforcement boundary for everything that Linkerd is doing at the pod level, so the pod doesn’t trust anything. The pod trusts no one. And there’s a lot more to say on that. We’re only going to get into kind of the surface of mutual TLS here. We’ve had other workshops and I think we’ll have more in the future because that’s a big topic.
What is Authorization policy?
So hopefully so far, so good. Feel free to drop a comment in the chat, if anything I say is confusing. So what do we mean by authorization policy? So as I said, Linkerd and Kubernetes, by default, allow all communication to and from any pod in the cluster, right? That’s the default behavior in most circumstances. And authorization policy refers to restricting some types of communication. We call it authorization policy because we’re talking about authorization. Connection is not allowed to happen unless it’s properly authorized. And authorization policy is giving Linkerd, as I said, the power to say no to some type of communication.
So what types? And here I’m going to tell you what’s in place today and what’s going to be in place in future versions. So what is in place today is server-side policies, which means these are policies about traffic coming into a meshed pod. The pod is acting as a server or the Linkerd proxy in that pod is acting as a server, as opposed to a client. Now when you install Linkerd, typically you have meshed, between two meshed pods. Linkerd is acting both as a client and as a server. So, right now, we’re only talking about the server enforcing traffic, so that’s part one. And then part two is, it is only going to authorize connections, right?
Connection level authorization
As of 2.11, we’re not authorizing individual requests. So to summarize that, we are only able to restrict traffic that goes into meshed pods, and we’re only going to be able to restrict connections, not individual requests. Now 2.12, which is currently deep in the design phase, we’re going to start adding a lot more, this is just the tip of the iceberg. We want to be able to restrict traffic that comes from mesh pods, which is how we’ll do things, like ingress control and a bunch of other features. We want to restrict traffic in a much more fine-grained way. So allow you to say, hey, only these HTTP verbs are allowed on this port or this path, or these gRPC requests. There’s a lot more, but today we’re only going to talk about what Linkerd can do in 2.11, which is traffic two meshed pods at the connection level.
Network policies vs workload policies
And that’s already going to be complicated enough. That’s already more than one workshop worth of stuff. Okay. So if you’ve been in the Kubernetes land for more than 30 seconds, you’ve probably heard of network policies. This is a very rough comparison of the two, network policies are a default feature of Kubernetes, they work by using network identity. So basically the IP address. There’s no encryption component, they’re not built on top of mutual TLS, then enforced typically at the CNI layer, so these are network-level constructs. There’s no layer seven semantics. So we can’t do anything with the verb or the TLS identity or anything. We’re really only going to be making rules about IP addresses. And they’re hard to use, that one obviously is a bit of an opinion, but if you’ve tried to use them for kind of sophisticated things, like, hey, I just want to lock down on traffic to a Namespace, so that everything in the Namespace can communicate and everything outside of the Namespace is not allowed with certain exceptions, it gets complicated very quickly, right?
So that’s network policies. Linkerd authorization policies, instead of using network identity, you use workload identity. In Linkerd’s case, that means the service account of that workload is the identity of the workload. And that has all sorts of important and interesting security implications that we’re not really going to go into today other than to say, that’s a lot better because it’s a property of the workload rather than a property of whatever IP address the workload happens to live on.
And you don’t have to trust a network and all this other stuff, but suffice to say, they use workload identity. It does include sort of encryption, it’s built on top of mTLS, so everything we’re doing with authorization policy, it’s going to be built on top of encryption and all the other guarantees about that mTLS gives us, they’re enforced at the pod level. So it’s not the host that’s making the decision, but individual pods, which is what you want for zero-trust.
You can capture layer 7 semantics. We have just a tip of the iceberg there in 2.11, but we’ll see a lot more of that in 2.12, and they’re ergonomic, or at least as ergonomic as we can make them. Now, kind of the interesting thing here is, this is a first feature we’ve really introduced in Linkerd, where we’ve given you enough rope to hang yourself. We’ve finally given you a gun where you can shoot off your toe or whatever, you shoot yourself in the foot. Linkerd has been very user friendly, not user friendly, still user friendly, it’s been very safe to use.
Connection vs request
Now we have ways where you can get yourself into hot water. So we’ll go through some of those scenarios. Okay, so hopefully that makes sense so far. Now, I’m just taking a look at the questions here. What’s the difference between connection and request? Okay. So Jason’s got a good answer there. So yeah, you’ve got to establish a connection and then you can make requests over that connection. But once the connection is authorized, then all requests can happen, any request can happen over that connection until the connection stops being authorized.
We also had someone in Slack ask about, when would slides or recordings be available? So everything will be available after the webinar. And we’ll post that in the workshops channel on Slack. If you don’t see it, you can go to slack.linkerd.io to sign up for the Linkerd Slack. And we’ll probably send you a nice little email too.
Okay, and then the other thing I’ll say is, everything that I’m saying here is also in the docs and we’ve written some blog posts, I’ll talk about one of the blog posts later on. We’ve written blog posts that go through a lot of this in more detail. So if I’m going too fast, or if you miss something or whatever, do not panic, you’ll get the recording. You’ll have docs, you’ll have blog posts. There’s a lot of material around this topic because I love it. This is a really cool topic.
Configuring policy: annotation vs CRDs
Okay, so with that, let’s get into some of the actual mechanics here. So in Linkerd, how do you actually configure policy? There are two core mechanisms. First is there’s an annotation, and the second is there’s a pair of CRDs.
And that annotation configures what we call a default policy. Okay. And then the two CRDs are ways that we specify exceptions to the default policy. And we’re going to see how these things work together, the kind of two parallel mechanisms. All right. And this brings the total number of CRDs in Linkerd to four. So apologize, we’re trying to keep those things to a minimum, but we’re up to four. Well, one of them’s in the multi-cluster extension. So maybe it’s just three in the core control.
All right, let’s take a look at how these things work together. So let’s start with default policies. Every cluster has a cluster-wide default policy, and you can override that at the namespace level or at the workload level with annotations. When you create your cluster, you set this value called a default allow policy. By default that is our lack of policy, called all unauthenticated, so by default we’re not changing anything. And we’re not restricting anything, but you can actually control stuff at the cluster level. You can also override this policy using annotations at the namespace level, at the workload level with this default inbound policy annotation. And I’m going to show you in the next slide, I think I have the list of all the different default policies that are available to you.
Proxy default policy
Now the one important thing to know here, there’s a big difference between annotations and CRDs in Kubernetes land, which is that when you change annotation, at least the way that Linkerd works, Linkerd typically doesn’t reflect that until you restart the pod. So the proxy default policy is going to be fixed at startup time. If you want to change a default policy, you not only need to set an annotation, but you then need to restart that pod. And if you need to, you can inspect the environment variables for the proxy container and you can see exactly what default policy it’s using. That’s different from CRDs because the CRDs are going to be read dynamically.
Okay, so we’ll talk about CRDs in a minute, but first, let’s take a look at all of the default policies that are available. We already saw all unauthenticated, that means all traffic is allowed. We’ve got cluster authenticated, which means your source IP addresses, if they are in the clusters network space, they’re allowed, otherwise, they’re denied. The next one is all authenticated, which means only allow traffic from clients that are using Linkerd mTLS — it doesn’t matter what identities they have or anything like that.
And then cluster authenticated, which is kind of a combination of the previous two. So they have to be mTLS by Linkerd and they have to be within the clusters network space. And then of course we’ve got deny, deny everything. So if you want to turn off all traffic there’s deny. And then for all of these default policies, these are very core screened, we’re going to see out to modify them and add exceptions with the server and the server authorization CRDs. So I’ll pause there for questions or concerns.
Jason: All right. So we have one from Veren, are these default policies ones that can be applied at a pod service level, or how are they applied?
William: Yep. Yep. So all of these can be applied at the cluster level or at the pod level basically, and then to apply it at the pod level, you’re either adding the annotation on the workload itself, so on the deployment stack, or you’re adding it at the Namespace level.
Jason: So this is Josh, and thank you, Josh. Can you change the cluster-wide policy at any time? And if so, does that restart any control plane components? Great question.
William: Oh Josh, now you’re asking for the dark secrets. So you can change that thing. Does it restart? I don’t think it restarts anything by default, but you can change that config. And if you change that config, you actually get a little bit of dynamic behavior there, because if you change that config while the cluster is running, then any pod that doesn’t have an annotation set specifically giving it a default policy will use the clusters default policy. So my recommendation is don’t change that thing. Set it once at cluster creation time and don’t mess with it afterwards because it’s weird. Thank you for asking the question. I was hoping no one would ask.
Jason: So we actually got a bunch of really good questions now. And so thank you, everyone, for this. So Jonathan asks, is it possible to add a service outside of Kubernetes to the mesh and use the all authenticated policy?
William: Yeah. So that’s a great question, Jonathan. In 2.11, we don’t have the ability to run the proxy outside of the mesh. So the answer is no, you can’t use… I mean, so if you have a service outside of Kubernetes, and you want to connect to something in Kubernetes you’d go through ingress, or you’d go through the service IP, load balance for IP, or whatever directly. And at that point, it’d be treated as non cluster traffic if you went directly and you didn’t go through an ingress layer. If you went through an ingress layer, then it would be treated of course as cluster internal, because it’d be going through the ingress as a hub.
Jason: Oh, sorry.
William: No, I see another question about the Kubernetes version. I don’t know, off the top of my head, it’s whatever 2.11 supports. And is it CNI dependent? No, it’s not CNI dependent. Does it conflict with CNI or with network policies? No. They’re two orthogonal implementations and you can use them together if you want, with the giant caveat, of course, that Linkerd assumes that every proxy can talk to the control plane. So there’s ways where you can screw that up using network policies and Linkerd can’t really help you there. But in principle, you should be able to use these two things in a complimentary way.
Jason: We have a couple of great questions over in Slack. One of them I’m going to go to first. So this is Chad, thank you, Chad. Can you give an example of when it would be appropriate to use all authenticated versus say something like cluster authenticated?
William: Yeah, that’s a great question, Chad. I don’t know that I have a good answer to that.
Jason: My general thought is multi-cluster versus single cluster, right?
William: Yeah. That’s a really good question. Yeah, presumably it’s related to multi-cluster. Yeah. Good question.
Jason: We have from Rama Krishna, and I’m sorry if I said your name incorrectly there, Linkerd takes care of SSL termination, in that case… Oh, so let me reread that question and come back to it. Are there any concerns to keep in mind when using something like OPA with Linkerd?
William: No, I don’t think so. The way that we’ve primarily seen OPA used with Linkerd’s policy stuff is through something like Gatekeeper where you say, okay, I can’t create a workload in this namespace, unless I have some policy objects created for it or something like that, or I can’t create a namespace unless I have the annotation that sets the default deny policy or something like that. But no, those two, Linkerd and OPA should basically work together.
Jason: Okay. And then Preiag, and again, I’m sorry if I said your name incorrectly, talking about authentication policy makes me wonder whether JWT authentication is supported or not, and this is just that a feature that they’ve been looking for?
William: I’d love to have that feature, it’s not in there today, but I think as soon as we get into the next iteration here where we’re doing fine grained policy, and we’re authorizing individual requests, then suddenly that sort of thing becomes possible. We wanted to do connection-level first and then we can do request level. Yeah. A great question. All right. Let me move on in the next section, because I want to make sure we have enough time to cover everything. Okay. So we’ve talked about the annotation that’s set to a default policy. Now we’re going to talk about the two CRDs.
Server and server authorization CRDs
And there’s two, first one’s called server and the second’s called server authorization and they work together. So the server CRD basically selects a port over a set of pods within a Namespace, everything is within a namespace.
You pick a port and then you pick the set of pods that correspond to that port. Now that set of pods could be an individual workload or it could be everything in the namespace. One common example here is you have a server that selects over the admin port, Linkerd’s admin port, which exposes health checks and metrics for everything in the namespace. And then you’re not tying it to individual workloads. Alternatively, you could tie it to an individual workload, so here’s an example, server CRD, called voting-gRPC, that selects over every pod matching this voting service app label and then picks this voting-gRPC port from the pod manifest. And so this is an example. And actually, we’ll see this one a little bit later, that’s specific to an individual, probably specific to an individual service.
And you can see, we actually have a little protocol hint here. So this is optional, but it’s a way of giving Linkerd some information so that you can bypass the protocol detection, it makes things a little faster.
Okay, so that’s server. So remember, server’s not doing anything other than selecting one port over a set of pods. Servers can match multiple workloads. All right, yeah. Okay, here’s an example I was thinking of. So here’s a server that we’re just calling admin. Whoops, let’s go back here. We’re just calling admin, and we’re selecting over this port and we’re just going to match every pod in the namespace. So it can be decoupled from workloads. That’s servers. Now, if you just create a server by itself, you basically turn off traffic to that port. So if you want to then allow traffic to that port, you have to create a server authorization CRD. So let’s talk about that next. So now that I have a server defined, which gives me a port, I’m now going to create a server authorization, which is going to select over that server.
And it’s going to describe the types of traffic that are allowed to that server, i.e., to that port on those pods. So in this example, we have unauthenticated traffic to the admin server. So here we have a server authorization, his name is admin-unauthed, and it’s going to select over servers with name admin, that’s the one that we saw in the previous one, in the previous slide, and the client definition here who’s allowed to actually talk to this is just unauthenticated. So anyone is allowed to talk to this admin port.
Okay. Server authorizations can match multiple servers. A server can match multiple pods, kind of however you define that selector, and server authorization can also match multiple servers. So we’ve got two levels of selection in here. In this example, we have a server authorization, called internal-gRPC. We’ll see this one in the demo. And you can see that it matches every server that has a label of Emojivoto/API, equal to internal-gRPC. And then it allows, in this case, this client block, previously we had client unauthenticated too, this client block says, you must have mTLS from the mesh, from the service mesh, and you must have a service account, named web.
So here we’re fixing this, we’re authorizing only a particular service account. So basically, to put this together, when a connection comes into a port, that’s on a meshed pod, what does Linkerd decide to do? Basically, if that pod and port is selected by a server, and if that server is selected by a server authorization, then it’s going to follow the server authorizations rule. If it’s selected by a server, but it’s not selected by a server authorization, that’s going to deny the connection. And if it’s not selected by a server, then it’s going to use a default policy for that pod.
Q&A break 2
Jason: All right. So we got some great questions in the chat, if you don’t mind.
William: Yeah. Let’s do a couple.
Jason: So Rodrigo asked, and this will be a quick one. Is it possible to do cluster wide policy instead of on a namespace by namespace basis?
William: Sadly, no. CRDs have to sit in a namespace so we just make them apply to the namespace, with the one exception of the cluster-wide default policy.
Jason: Then we have another one from Troy, can Linkerd use spiffy workload identity instead of service accounts?
William: On the road now, I’d love to do that. I’d love to do that. Linkerd right now is very Kubernetes-focused. In that world, you don’t really need spiffy because service accounts give you enough. As soon as we start adding mesh expansion, or as soon as we start having the proxy run outside of Kubernetes, then suddenly we need an identity system that is portable to non-Kubernetes environments, and spiffy is the obvious candidate there. So that’s on the roadmap. I actually wrote a roadmap blog post, I think it’s the latest blog post on linkerd.io. So you’ll see a little bit about that and some of the other stuff we’re planning to get into Linkerd this year.
Jason: I’ll put that in the chat and in the workshop right now.
Denying a connection
William: Great. All right. Anything else or should I move on? Okay. So how does it feel to be rejected? Jason, this probably is very familiar to you. So what does it mean when we deny a connection? So it depends on what Linkerd knows about this connection. So if Linkerd knows that this is a gRPC connection, then a denial is actually going to be a response with the gRPC status permission denied, right? If Linkerd knows this is an HTTP connection, that’s not a gRPC connection, then denial is a 403 response. So Linkerd is actually going to establish the connection, and it’s going to start returning 403s. And if Linkerd doesn’t know that this is an HTTP connection, if Linkerd’s treating this as a TCP connection, then we’re just going to refuse the connection.
So different denial behaviors based on what type of connection Linkerd is treating it as. And these policies are read dynamically, CRDs are read dynamically, so if you update your policies, Linkerd will happily start returning 403s, it’ll happily terminate the TCP connection, it’s going to do its best to enforce that policy as rapidly as it can.
Okay, so we’re almost at the hands-on portion because I want to show this off in practice. I have a couple of gotchas that are important to know. The first is:
Gotcha #1: deny by default policy setup
If you are building a deny by default policy setup, so you want to deny everything and then kind of explicitly white list stuff in, or allow lists stuff in, you need to make sure that the Kubelet probes, so things like liveness checks and health checks are authorized. Otherwise, your pod will never start. So I showed a demo of this, I think in our 2.11, we had like a welcome to 2.11 workshop. But if you just add a deny by default policy and you restart your pods, the new pods are not going to spin up because unless they can get the readiness check through to the admin endpoint, they’re never going to start.
So make sure those Kubelet probes are authorized, and those probes are happening in plain text. And actually we’ll see an example of that in the hands on section.
Gotcha #2: default policy is not read dynamically
All right. Gotcha number two. So that was number one. Gotcha number two is, default policy is not read dynamically. Okay, so I actually said this with the one minor exception… Oh yeah, here’s the answer, right? You can change the cluster-wide default with Linkerd update. But don’t… I don’t know, I guess you could do this. Just yeah, it’s non-trivial, the result is non-trivial, it’s a compute.
Gotcha #3: ports referenced in service CRD must be in the pod spec
Okay, number three, this is an important one and I’ve seen people run into this in Slack, for dollar sign reasons, all the ports that are referenced in your service CRD have to be in the pod spec.
And I don’t know why that is, but if you don’t do that, then Linkerd is going to ignore them, and in the log messages it’s going to say something about unknown port. So if you see a log message about unknown port, that’s probably what’s going on. All right, oh, and that was it, those are three gotchas.
Okay. So let’s do a little hands-on stuff, we’re already halfway through, and I want to make sure we have time for this. So give me two minutes to share a different portion of my screen. And what we’re going to do is, we’re going to take a look at getting our good old Emojivoto app into a high-security namespace, where we deny everything by default. And there’s a really good blog post on locking down network traffic that I’m basing this on, which actually goes into much more detail by Alejandro. So again, if I’m moving too fast or if you get stuck, or if I’m explaining things in a way that doesn’t make sense, which is very possible, please read the blog post and you’ll get a lot more detail. Okay. So give me just a second to get my screen share set up for hands-on time.
Jason: And for the folks listening, I’m going to share that namespace jail blog in the chat and over in the Linkerd Slack. And thank you, Vikas, for putting that out.
William: All righty. So hopefully you can now see my screen. Jason, can you see it? See my terminal window and Alejandro’s blog post?
Jason: I do. Yeah.
William: Great. Is this terminal text big enough?
Jason: It looks good to me. I’d love to get a thumbs up from any folks in the chat, if you can.
William: All right. So what I’ve done here and I’ll take a look at the chat in a second is, I have a fresh Kubernetes cluster, I’m using kind, as I said in the very beginning, I’ve installed Linkerd on it. The check-
Jason: A couple of folks would appreciate if you could make it a little bit bigger but in general…
William: A little bit bigger. You got it.
Jason: Thank you.
William: Okay. So I’ve got myself a fresh clean Kubernetes cluster here with Linkerd installed and nothing else on it. And then one thing I’m going to do, because we actually have a lot of policy diagnostic stuff in Buoyant Cloud, is I’m going to go ahead and install the Buoyant Cloud extension. So we’re going to go for, let’s see, the Buoyant install kubectl, it’s going to hook this cluster up, let’s call it… What are we going to call it? Workshop. It’s called workshop. Okay. So that’s going to apply a little manifest in here. And Buoyant Cloud is going to start computing and doing whatever else it is.
Now, while we’re doing that, let’s also install emojivoto.yml Let’s see, I have it in here in my history. Okay, so I’m going to take this emojivoto.yml from run.linkerd.io. I’m going to pipe it into here. We’re going to modify this in a second. So I’m just going to put it in a file here and then we’re going to do linkerd inject emojivoto.yml, and kubeclt apply - f. Okay. So that should give us an Emojivoto deployment in here.
And we can see that spinning up. Okay, and then the next thing I’m going to do, and we’re going to start doing policy in here pretty soon, is the way that Emojivoto works, actually, I think I can just show it right here. The way that Emojivoto works is, let’s go to… we’ll give it a second for all the metrics to come in and I’ll show you what the actual topology looks like.
All right, so while we’re doing that, let’s copy Emojivoto into emojivoto-bad.yml. I’m going to edit that file. And Emojivoto has this component called vote bot in here, that’s generating traffic. And it’s just sending traffic to the Emojivoto app, I’m going to make a copy of this and I’m going to put it in a different namespace.
Creating band namespace
I’m going to put it in the bad namespace. And then once we have that setup, we’re going to start playing around with policy. So far no policy, right? So I’m just going to edit this. We’ll make a bad namespace, and then I’m going to go down, I’m just going to delete everything in here till we get to vote bot, oh, here it is.
Okay. Delete all that, and then delete everything else. And then let’s just make sure we put this in the namespace bad. Okay. So for those of you, if that was a little too fast, I now have this YAML file that’s just creating this namespace called bad. And then I’m taking that same vote bot deployment, and I’m just making a replica of it or a copy of it in the bad Namespace. Does that make sense?
Okay. I’m going to Linkerd, inject that thing. Tap completion set up. See, I don’t use Kubernetes enough to even alias kubectl I’m typing it out by hand, the old-fashioned way.
Okay. Namespace, bad, skipped. Oh, Namespace bad created. And vote bot created. Okay, great. So we should see the vote bot here. All right, good. Now let’s go to our topology. Finally, we should start seeing some traffic. So on this topology, I just want to show you what this thing looks like. So this is Emojivoto now. We have in the Emojivoto namespace, we’ve got this vote bot, which is talking to the web deployment, which is talking to the Emojivoto, and to the voting deployment. And then we also have this vote bot over here in the bad namespace. Okay, so hopefully visually this helps you understand what’s happening. Now, so far, no policy involved. So everything’s fine. And in fact, what we can do is, well, no, we’ll do that later. Let’s get to some policy in here.
So that same location where you have a emojivoto.yml, we have a emojivoto-policy.yml. So let me see if I have that in my history somewhere, policy.yml. Okay, so run.linkerd.io/emojivoto-policy.yml, so I’m going to curl that and we’re going to also modify this later on. And let’s kubectl apply it. This is a set of pre-configured policies that actually gives Emojivoto kind of the permissions it needs to run. It’s taking a long time. There we go. And then finally, I’m going to annotate the namespace.
I’ll pause for a second. So let’s just review what we’ve done. Okay? We’ve added Emojivoto, stock Emojivoto. We’ve made a copy of the vote bot deployment and put it into this new namespace called bad. And then we’ve applied the default Emojivoto policy. So we’ve kind of jumped to the end. Now, let’s actually take a step back and let’s understand what has happened in here. So let’s take a look at this policy object or this policy manifest. So if we go to policy, we’re going to see a bunch of servers, of course, and a bunch of server authorizations. Now, there’s one thing that I remember just, that I’ve forgotten to do, and I want to do it real quick so that everything works. Does anyone know what I’ve forgotten to do to actually apply this policy?
Jason: I do, but it doesn’t feel fair to guests. Anyone in the chat?
William: Anyone besides Jason know what I’ve forgotten to do?
Jason: We now have some guesses.
William: Yeah. Oh, I’m looking at the chat. Look at this, everyone’s on the ball.
Jason: Also drinking water. Yeah, great one, Alexandros.
William: Okay. Yeah, that’s right. So we’ve got to restart all this stuff. So let me do a rollout restart. I’m going to restart all of my Emojivoto deployments. Why? Because I just changed the default policy. Okay, so we’re going to restart and pick it up. So let’s look at this policy thing and I’m not going to go into a lot of detail here, but we’ll just run through it quickly.
We have a server. So the very first thing is a server Emojivoto gRPC oh, and look, we’ve got all these really nice comments in here. It’s going to match the gRPC port of the emoji service. Here’s our spec, right? It’s gRPC, it’s a port. And then we’re matching every pod that has this app label, which actually, if we look at our Emojivoto manifest, this is the emoji service.
And it also has, the server itself has a label called Emojivoto/API equals internal-gRPC. And this is going to be used by a server authorization that we have later on. So hopefully this looks very familiar because we either saw this exact thing or a version of it in the slides. So that’s the emoji-gRPC server. We’ve got another similar server for voting-gRPC. And I’ve got a server authorization. And this thing is called internal-gRPC. It’s going to match both of those servers by using this label.
And it’s going to allow only authenticated traffic from the web account. So again, hopefully, this is very familiar to you. We saw a variant of this in the slot, and then finally, down here we’ve got a server and server authorization pair corresponding to the web service, because the web service is a little different. This actually has to take traffic from the outside world. And what vote bot is doing is we can see here, vote bot is talking to the web service. That’s kind of like what the manifests looks like. Let’s actually take a look at the web service. And what I’m going to do here is I’m going to look at this policy tab. So one thing we’ve done here is we’ve given you kind of the interpretation of the policy.
So if you’re actually looking at an individual workload, we’re going to tell you, here’s all the servers that apply. Here’s all the server authorizations that apply, and here’s the default behavior. And so here’s what you should expect for this workload. So here for this web workload, we see… Actually, you know what, let’s go back. Before we look at web, let’s look at vote, no, let’s look at emoji, because that’s the one we were just looking at. So let’s look at policy here. All right. So this admin port has been allowed. So all access is allowed to the admin port, and it’s using the server admin and the server authorization admin, everyone. So at the very bottom of the file and we didn’t go over them.
Then we’ve got our Emojivoto-gRPC server and the internal-gRPC server authorization. So this is saying port 8080. We only allow mTLs traffic from the web service account, and we deny all other traffic. We have another thing that allows Prometheus traffic and then all other ports deny traffic. So this is our kind of one stop view of the policy for the emoji service. Actually, since I increased the font, oh, that’s going to make it look bad. Huh? Let me do this. Let’s do just the right one. There we go. Okay.
So this is like our sanity check, right? And hopefully, this makes sense. Now let’s go back to our web policy here. So the web policy is a little different. We’ve got that same admin server and server authorization pair. And then we’ve got our web server, which matches at port 80⁄80, and then we’ve got our web public. And this is very simple. We’re allowing all traffic for that. And any other traffic is denied. Okay, that makes sense. So far I see 33 new messages in the chat. So Jason, maybe you tell me whether this is making sense to everyone.
Jason: I think so, so we’ve got one question about passing a list to proxy protocol. I think the answer is no, but I asked some of the folks that are a bit stronger with policy just to give you a sure answer.
William: Ah, interesting. So if you have a port that does multiple protocols?
Jason: I’m not quite sure what the context is. And then we also had someone ask for the command, the annotate command that you use, but Amory actually pasted that already.
William: Yeah. Okay, good. Now actually it’s in this blog post as well. So it’s setting the default inbound policy config.linkerd.io default inbound policy equals deny. Okay. Yeah. Sorry, I went through that part pretty fast. Okay. So now I want to start screwing around with policies, and now I actually want to start denying traffic from this bad namespace. Before we do that, I just want to take a quick peek at kind of our traffic list here now. So this is a list of every… All right, let me make this window a little bigger. Can I make it, there we go. This is a list of all traffic going to any meshed pod. All right? So let me filter this to just bad and Emojivoto namespaces, we’ll sort by port and this will be a little easier to interpret.
So here you can see, it’s a whole bunch of traffic to the Linkerd admin port, both from the Buoyant Cloud agent, which is doing metric scrape, also these unauthenticated things, these are the Kubelet probes. We’re going to ignore this for now. There’s a lot more to say about it, but that’s not important. Here’s our actual application traffic, and the way to read this is, the Emojivoto workload is taking traffic with this TLS identity of web done Emojivoto. So that’s the web service talking to the Emojivoto service. Here’s a web service talking to the voting service. And then here are the two sources of traffic that the web service itself is taking, right? There’s the default.bad. And there’s the default.emojivoto. We were a little sloppy when we made the Emojivoto manifest and we just used the default service account for the vote bot, right?
So both vote bots just have the default… The literally the service account named default there. If we were a little fancier, we would’ve given them their own service accounts, but these are basically the two sources of traffic. And you can see the policy that’s allowing this traffic to happen, and way up the top here, we can see that we don’t have anything denied. So now let’s actually start denying stuff, right? So now comes the fun part, and what I’m going to do is I’m going to go to another tab. Let me do the logs. Let’s read the logs for that service. So let’s see, kubectl -n bad. Let’s see, what’s our pod here, right kubectl n bad.
Editing our policy
Let’s do logs, and let’s do -f, and then here’s our pod. And this pod actually has two containers in it. Hopefully, you know what one of them is, and the other one’s called vote bot. Okay, so here is our vote bot. So I’m just going to leave that running, because we’re going to watch it cry. We’re going to hurt it and we’re going to watch it cry. All right, so let’s go into policy. Now let’s start editing our policy. So let’s go down and the policy that I want here, this is web-public server authorization, that’s currently letting in this traffic here, and I don’t want that to happen. So let’s find our web-public server authorization, and let’s delete… So this is the unauthenticated bit. I’m going to delete that. And in here, I’m going to put, I’m actually going to, let’s steal this from Alejandro’s lovely blog post. Where are we here?
There’s something… Here we go. Okay. Yeah. So we haven’t seen this yet in the workshop, but here is a wild card, fix up the spacing here. So this is a wild card identity match. So this says traffic to the web service is only going to be allowed if your identity comes from basically this Emojivoto namespace. Okay, all right. So now we’re going to apply this policy.
Okay, nothing’s changed except for this thing. Right? Web-public is configured. If we go to our logs, that vote bot in the bad namespace is now giving us this error message, which we should really improve, because this is not a good error message, but it’s crying. It’s crying for help, and if we go back to our traffic tab here, what we’re seeing is we actually have a denial now. Right? So this traffic here, this is the last one minute. So we’re still seeing some vestigial traffic from… Oh no, it’s gone. All right, great. Yeah. So the default.Emojivoto traffic is still allowed. And then in our denial table, we have default.bad. So we’ve successfully cut off that traffic. And the only workloads that are allowed to talk to the web service now are things that have meshed TLS identity in that same namespace.
So if we click on, just to validate that, let’s go back to our workload. Sorry, I know I’m clicking around really fast, but we’re almost out of time and want to make sure we have time for questions at the end. So here’s our web service. Okay, let’s look at our policy. Here is a new policy, this is the thing we changed. And this webserver authorization is the thing that we changed. And we give you a bunch of other information here. Here are the workloads that match it. Here are the servers that match it and so on. So hopefully if you get yourself really tangled into a policy thing because this stuff gets complicated, you’ll be able to untangle it a little bit. Now, there’s one last thing. And then we’re going to stop with the demo.
…and another gotcha: no success rate change despite denied traffic
One last thing, which I think is interesting, maybe should be a gotcha. So let’s go back to our workload, let’s look at this vote bot in bad. And let’s look at its metrics. So these are the metrics over the past 10 minutes, right? And the success rate, something has changed dramatically in our system. We’ve denied traffic. We can see from the log lines like terrible things are happening, but nothing in the success rate has changed. So here’s my challenge question for the audience, why is that?
I see Josh typing, uh oh, Josh has been hanging out in the Linkerd Slack for a long time. He might know the answer, and he’s got it. He’s got it. That’s right. That’s exactly right. Yeah. So a 403 is considered a successful response from Linkerd. It’s not an error. It’s not like the server didn’t die. It’s not a 5XX, it’s not a 500. It’s a 400. You asked the server a question, the server said, no. So that constitutes a successful response, success rate actually doesn’t change. Now if we had a breakdown here, which we should add, in my opinion, of success rate of response codes rather than just showing you was it successful or not, then you would actually see a big change. You’d see all those 200s, suddenly be replaced by force, in this view, we don’t actually see that, we just see success.
Okay. So hopefully this whirlwind tour of policy made a little bit of sense to you. Like I said, the blog post that Alejandro put up has a lot more details and will actually walk you through constructing this policy, actually does it in a nicer way than what is in emojivoto-policy.yml. This I think is a little bit overdone. So I definitely recommend reading through the blog post and trying to make sure that you understand it. The Buoyant Cloud aspect is totally optional. It’s a bunch of visibility in there, but none of the policy actually requires that of course, it’s all done internally.
Linkerd off Z
And we have an interesting command called Linkerd off Z, Linkerd off Z that will tell you some information as well, which I just realized today is not actually documented on their website. So we’re going to fix that real quick and we’ll have docs for that. Okay. So with that, I’m going to go… I think we’ll pause the follow-along section and we’ll spend the last couple minutes, we’ve got a good seven minutes to just do Q&A.
Jason: We have a good one from Veren, and apologies if I got your name wrong there, but it’s basically, why does policy not include a built-in exception for liveness checks on something like 41 91? And is that something we’re considering?
William: Give me one second while I get the slide shared and then I’ll be able to think about that.
Okay. So here, we’re going to stare at the ad for the next workshop while we do this. Yeah, so why doesn’t it have a default exception? That’s a good question. That’s a good question. And is that something we would consider changing? Potentially. So I think we have a couple of options. I don’t think the current state is that great, right? The fact that you have to explicitly make that annotation, otherwise like pods don’t start up, is a little screwy, it’s logical, but it probably hurts more than it helps. We’ve got a couple of options that we’re considering for the next iteration of this.
One, of course, is just to have a built-in kind of way of disabling that. The other option, which I like is, if we have a fine-grained authorization, we can actually have a default policy that allows you to get the health method and doesn’t necessarily allow you to get the metrics endpoint. So right now both things are served in that same port, but we could actually differentiate the policy based on whether you’re looking for metrics or whether you’re looking for health. And the third thing we could do is, we could actually split it out into two ports, and one port could be open by default, which just does the health checks and another could be metrics, which would be more locked down. So there’s a bunch of stuff that we’re considering. But yeah, this is definitely something that we’re looking at. Great question, Veren.
William: What else you got?
Jason: Looks like it’s clear, folks. This is a great time to ask any questions you might have around policy.
William: If anyone wants me to revisit any of those commands or what I did in the terminal, I’m happy to do that. Hopefully, you kind of understood it, but I know I was moving a little fast.
Jason: So Lekesh asked a question that I think is really good, which is, essentially where do we see off policies in reference to network policies and how are they different?
William: Yeah. So I tried to touch on that in one of the slides. I think they work in conjunction because I think there are a lot of reasons why you want to have layer three/four control. And there’s a lot of stuff that that level of control can do that Linkerd can’t. You might want to make sure that all traffic goes through an egress, and actually, Linkerd could do that. They’re complimented, let’s put it that way. They’re complimentary. The big difference is going to be that Linkerd will be able to express palsy kind of at a higher level of abstraction. So you’ll be able to say, okay, just a lot of traffic from everything in this namespace, or a lot of traffic from everything that has an identity that matches this wild card and the next version allow traffic from differentiating authorization based on methods and verbs and stuff like that. So it’s a lot richer, but they can be used together.
Jason: And then we had a great question from Peter about multi-cluster communication. Can we do policy that works between clusters?
William: So yes, you can do policy that works between clusters, I think the big caveat there is that… And here, I’ll be frank, I’m kind of stretching my understanding a bit, but the identity from the source cluster is currently not preserved when it goes through the gateway. So the identity that you get is the gateway identity on the destination side, it’s not the source cluster identity. So yes, you can do policy, you lose something. And I think that’s part of what we want to address in the future. You lose the actual source identity, but you’ll know that it comes from the gateway and so you’ll know it’s multi-cluster traffic.
Jason: Okay. And then Martin, sorry to hear, I think it’s Martin says that they’re unable to get Linkerd to work on GK autopilot. We’ll take a look at that and see what’s happening there. For anyone who’s been asking, if you’re looking for a link to the recording or the slides that’ll be available in the workshop channel later today. And that looks like most of it.
William: Okay, great. Well, we’ve got two minutes remaining. So for the remaining two minutes, Jason will serenade us with the latest popular music hit.
Jason: Yeah. We actually had, just one thing I’d add on the network policy, another place where network policy is handy is communication that your service mesh won’t handle, right? Like we’re going to do TCP traffic and not UDP traffic as an example. Right? Anything that you want to handle at the network level is the right place to go. And please, now we posted the links, and thank you, I forgot who asked, but thank you for asking us to paste it in. But anyway, you can click and register for the next workshop directly in either Slack, any issues with quick traffic, now I know I’ve heard about that.
William: There are no issues because quick is over UDP and Linkerd doesn’t do anything with UDP right now. So no issues whatsoever, Martin. Works great.
Jason: Yeah. And thanks to everyone for attending. We’re super grateful for your time, and hopefully, check out the recording later if you have any further questions or check out some of the articles that we posted.
William: Yeah. Thank you, everyone. We’ll all be hanging out in the Linkerd Slack so feel free to keep asking us questions. Okay, and with that, I think we’re done.
Jason: All right. See you folks.
William: Thanks everyone.