Debugging an application with a service mesh

Microservices are great, no doubt about that. But, when it comes to troubleshooting issues in a distributed system, it’s rarely easy. Getting multiple independent teams to agree to use a standard set of metrics or debugging strategies, makes it even harder. Issues turn into a blame game where DNS, Kubernetes, the network, or the mesh all take turns.

Linkerd has a built-in ability to tap into and analyze traffic to quickly identify and isolate problems. To do that, no code changes are required, nor do app teams need to expose their own metrics or become experts in Kubernetes or the mesh. When a problem occurs, Linkerd users can rely on the mesh as a single source of truth to help quickly identify issues and drive down MTTR.

Transcript

Note: this transcript has been automatically generated with light editing. It may contain errors! When in doubt, please watch the original talk!

Hey folks! Hello, and welcome to ServiceMeshCon EU. Today, we’re going to talk about debugging an application with your service mesh — tap tap.

Speaker and agenda

My name is Jason Morgan — this person here, also that person there. I am a Technical Evangelist at Buoyant, and it’s my job to talk to folks about the Linkerd project, encourage them to use it and evaluate it, and help folks as they move from development through to production with Linkerd. You can find me on Twitter @rjasonmorgan, GitHub @jasonmorgan, and on the Linkerd Slack @jmo. This is about the end of our slides for today. We’re going to do everything as a live demo — or as close to a live demo as we can get, considering the circumstances.

The setup: two apps running on Kubernetes

Let’s talk about what we’ll cover. I’ve got two applications running in my Kubernetes cluster that are having some issues, and we want to diagnose and remedy those problems with our mesh. I’ve got one application, a web frontend backed by two gRPC services, and another that is a series of web applications talking to each other over REST. Emojivoto is the gRPC application. It gives us a web frontend that displays a number of emojis, allows us to vote on them, view the leaderboard, and see what the current voting state is. It seems to be working as designed but we’re having a problem with it. I’m getting reports from my users and want to fix it.

Let’s go ahead and take a look at the service map for emojivoto. What I have here is a vote bot that generates some traffic. It talks to our web service and then the web service makes a gRPC call to voting or emoji. To be clear, I’m not getting that gRPC thing from looking at this graph, I knew that already because I work with the emojivoto app a lot.

Checking communication of cluster components

I want to debug and get things going, so the first thing I’m going to do is talk to my Kubernetes cluster, ensure that I can communicate and that things are running as expected. I can ask about the nodes and see that I’ve got my three nodes in my k3s cluster — this k3s cluster is provided by Civo, who are going to announce something here at KubeCon, so please stay tuned for that. Now that we’ve checked our nodes, we’re going to check the state of our mesh. The Linkerd CLI bundles this check command, which will check the health of the control plane and all of its components. It will also look at any installed Linkerd extensions and run their health checks as well. We see that the results are green check-marked for both — which sure seems good to me — so we’re going to move on and actually troubleshoot our application.

Checking cluster namespaces

First, I want to troubleshoot my app and take a look at the namespaces in the cluster and see what Linkerd sees in terms of the golden metrics for those namespaces. Golden metrics being success rate, volume of requests, and then latency. We can see we’ve got a bunch of namespaces here, pod info is seeing a ton of requests but is responding great and has a 100% success rate — Linkerd viz and Linkerd dashboard also both have a 100 % success rate. One nice thing with Linkerd, the control plane components are also part of the data plane, so we can use the same debugging techniques we’re going to use today on our various applications to check on the health of our mesh or debug issues should we run into anything.

Checking application success rate

We have emojivoto and booksapp that are both seeing a sub 100 % success rate, so let’s dive into emojivote and see if we can’t get this problem fixed. Now that we know we have two namespaces with issues — books app and emojivoto — let’s look at the deployments in emojivoto and see what their relative health is. I can see right away that vote bot and emoji seem to be succeeding all the time and have low latency — that’s great — but web and voting are both reporting sound problems, so let’s dive in a little bit further.

Now that we’ve got the statistics on the deployment objects overall, we can actually dive into each one of those deployments and see what live calls are going in and what they can tell us about the health of the application. Now to be clear, I haven’t instrumented anything inside these apps — these are normal gRPC apps — I don’t have tracing enabled or anything in particular. This is what the Linkerd proxy is able to tell me about my application traffic natively with no configuration. I can see what pod a given request is coming from, what pod it’s going to, the method, as well as the path that’s being called, and the number of requests coming in. I can see right away that my web pod is talking to emoji and those should all be successful. We can see the call as ListAll and FindByShortcode, and, if we scroll over to the right, we see that those are both 100 % successful — great, in line with our expectations based on the statistics we’ve seen.

Votebot talking to web has two calls: api/list and api/vote. Let’s see how they’re doing. API list is at 100 %, but API vote is at around 85 to 90 %, so it’s not succeeding all the way. Web is talking to voting a lot, and it’s hitting these individual vote URLs. Let’s look at their success rate. It looks like it’s good for most things but vote doughnut — which should be our most popular emoji — is actually seeing a zero percent success rate.

Zeroing in on the emojivoto issue

We probably have our problem solved already or diagnosed. What we’d like to do now is just triangulate a little bit. Let’s go check out the voting service and see what it reports. I do a top on that voting deployment and I can see that all the calls to voting — like we saw in our traffic map — are actually coming from the web service over to voting. We see the paths they’re taking inside the API, how many there are, and the success rate. So far everything seems to be going great — we don’t have any failures, but let’s give it a minute and see what else pops up.

We have vote doughnut coming in and starting to receive some traffic. We see that it’s actually hitting a zero percent success rate — let’s go ahead and check on that. I can go over to my emojivoto app and click on doughnut, and I see that I’ve got a 404 — so we have a real problem. I have more than enough to package over to my development team so they know where to begin looking for the problem.

Digging deeper with Linkerd’s viz tap command

Let’s continue to dive a little deeper and see if we can’t grab some of that live traffic over to this voting service and see what we get. We’ve got the Linkerd viz tap command here, and tap is going to snoop in on the calls between the two proxies and get a bunch of metadata around them, so we can see the current state of our environment. We run our command and see that we have calls like VoteRunningMan. It indicates that it’s an mTLS call which is the default for Linkerd. When you install it, you get mTLS between all services. We have a status code of 200 — which is great — and a gRPC status.

Let’s look for doughnut. We’ve got a doughnut call here. Now I’ve got a path of emojivoto.v1.VotingService/VoteDoughnut. We can see that, while we still have TLS is true and status code 200, we also get our gRPC status of unknown, which is actually a gRPC error.

What we got out of that last call is the particular path to look for voting doughnut errors. Let’s update the path that we’re calling to (instead of the base URL) to the full voting service vote doughnut — let’s see what calls come in. This is going to populate as soon as our traffic generator votes for doughnut or I can actually pop over to my web app and try voting for the doughnut again. We see some of that traffic coming in. We have still the TLS true status code 200 and that gRPC unknown. I could save this and pass it to my developers — which would be handy for them — but we can actually get way more detailed information.

Rich data for devs to solve the issue

I’m going to run that same tap command again, but I’m going to change the output format from default terminal to a full JSON output. We’ll see the same basic thing but with a ton more detail. We could save this, bundle it with a message for our developers on the voting service, and let them start solving the problem.

We see the output of one of these calls right now is JSON. I’ve got a bunch of metadata about the source, the destination, the request information, and all the headers involved in this call. If I look a little further, I can look at the headers for another request and see that we’ve got our gRPC status and gRPC error message right there. I save that and send it to my devs. I’ve discharged my duty to the team, in terms of debugging emojivoto and now it’s on the developers to actually get it fixed.

This troubleshooting process actually took longer than it needed to take because I was looking around and waiting for things to come up. When I went to the voting service I didn’t see anything about vote doughnut until a fair bit of time and I had to do that kind of aggregation myself. One of the things that’s really nice about Linkerd is that you install it and don’t have to build or configure a bunch of custom resource definitions in order to make the mesh work — you get all the value with a very simple install and inject on your applications.

Custom resource definition: service profile

One of the two custom resource definitions that Linkerd does use is this service profile. Let me create this and we’ll talk about what we did. I use the Linkerd CLI to look at the proto-file for my emoji service — emoji is a gRPC service — it uses these protobufs and we can actually look at them and see what the actual valid calls are for this service. We get this service profile object, which will allow Linkerd to do some more intelligent stuff with the traffic for this application. It will be able to do things like collect and maintain the data about the given routes, right there inside the service, so that we can more quickly debug issues like this when they come up.

I’m going to create and apply a service profile for the emoji service and do the same thing for the voting service — again just using that protobuf file. Once those are done, I’m about ready, but there’s a third service that I care about here, which is the web frontend. Now the web frontend is just the REST service, so I don’t have a gRPC file or profile to work with, and I also don’t have a swagger file — I need to take a look. I can either write my own service profile, based on what I know of the application, or I can use Linkerd’s tap functionality to watch the traffic that’s coming to this web service and decide how to build a service profile based on what we see. So, for the next 10 seconds, we’re going to watch the web service, see what comes in and auto-generate a service profile.

Debugging even faster with service profiles

We have our new web service profile created and will take a look at how we can debug this even faster if we were using service profiles. Looking at the emojivoto namespace, I can click on web, get my route metrics, and see that, as we’re coming in, API list is staying at 100%. Over time, we’re going to see some of those failed transactions come in and this API vote service is going to start to degrade. If I’d left this running for hours, it would be a lot more obvious what the failure rate actually is. This gives us an indication that we’ve got a problem with voting. We look at our voting service, and instead of waiting for a vote doughnut call to come in — which it did right away, but you saw it doesn’t necessarily do that — we can instead look at our route metrics and filter on the success rate.

Imagine the situation, I’ve been paged because there’s an issue and I come in — it’s some amount of time later — and can immediately look for the pass inside my API and see what is either responding slowly or is starting to see errors. I can then get the right ticket to the right team so we can get this thing fixed. That is emojivoto, and that’s as far as we’re gonna go because we have some fundamental problem with our application.

Troubleshooting a REST-based application

Let’s change scenarios a bit. Let’s talk not about a gRPC application, but a REST-based application instead. We’ve got our Buoyant booksapp. Books is a bit more complicated. We have a traffic generator — just like we did with emojivoto — and three services but webapp is talking to both authors and books, and books and authors are talking to each other, so there’s some additional dependency. My failures with books are more intermittent so it’s a little bit harder to get a sense of where my problem is.

Luckily, I’ve already had the profile set up, so we’re going to use these service profiles because these are all REST APIs with swagger files. We’re going to use those profiles to get a sense of where our problem is, and we’re actually going to do a little bit to resolve it before our devs have to get involved.

Let’s take a look at the routes for the web app service because that’s our entry point to this app. We can see a bunch of the calls are seeing 100 % success rate but post the books, and post the books editing and individual ID, are problematic. Let’s see if we can’t get this a bit more triangulated — I’m going to see how webapp is doing in its conversations to both authors and books. When I look from this webapp over to authors to see how the routes look, it looks like every single call from the webapp over to authors is succeeding 100 % of the time. So we’re pretty good on that route, but still have problems with that app, so clearly it’s a problem with webapp talking to books.

Zeroing in on the booksapp issue

We look from webapp to books, post books.json and put on a particular books{id}.json, and see it is failing somewhere around 50 % of the time — so we’re already getting a lot closer to our root cause. Our goal here is to drive down that meantime — the detection of the problem — so that we can get this resolved as quickly as possible.

Now that we’ve seen this, we also have that dependency between booksapp and authors, let’s just go in and check what books is talking to authors about, and see how those requests are doing. All booksapp is doing is a head request on authors by a particular ID and we’re seeing that about 50 % successful — so we clearly have an issue. If we look — and that was quick to diagnose with the routes in place — you’ll see it’s a little bit harder when we look at the author service directly.

We look at the live calls for the author service and can see that, sometimes, we’re seeing failures, but it’s not clearly aggregated. I’ve got a particular author ID that’s seeing some failures and some that are succeeding. But, when I view it from a route perspective where Linkerd’s been aggregating the traffic based on the API calls defined in the swagger dock, I see that the particular head request to any author ID is failing about 50 % of the time — so I’ve got the problem identified and that’s good. I can go ahead and alert my authors that this head request is failing half the time and that they need to look at the code that’s responsible for that response.

Because it’s failing about half the time and succeeding about half the time, I can actually use my mesh to solve some of this problem. We’re going to look at that service profile for the author service, and change it so we fix the problem tonight and they’re able to take time and deal with it in the morning.

Letting the mesh take responsibility

Our service profile has various routes and we’re going to go into this head request we saw before — so HEAD / authors/{id}.json is failing. We know that it’s safe to retry this call so we’re going to add a field that says ‘is retriable, true.’ We’re telling the mesh, “hey when you see these calls, proxy, go ahead and just retry it.” No app logic has changed and no fundamental code change has been created, but instead, I’m letting the mesh take responsibility for trying to solve some of this problem. Now, we look at the routes from books to authors and can see that the success rate — while the actual success rate is staying right around 50 % — the effective success rate is going to steadily climb. It’s going to climb all the way up to 100 % because through those retries, it’s going to succeed eventually. We’re also going to see our latency go up, but as long as it stays within reason — it’s overnight, we’ll page them first thing in the morning — they can respond to it in a timely fashion.

We’ll break out of this — and again, now I can look at the routes from the webapp to books where we originally saw the issue — and see what this looks like. Now our success rate has gone up to 100 % and we’re feeling good about that result.

Let’s pop back into the slides…

That’s the end of my talk. Thank you so much for staying and listen. Like I said earlier, you can find me on Twitter @rjasonmorgan. If for some reason you want to see my GitHub contributions, you can find me @jasonmorgan and I’d love to hear from you over on Linkerd Slack @jmo. Thanks so much and have a great day, goodbye.