In 2020, the software engineering team at Penn State was tasked with building a system for scheduling the testing of students, faculty, and staff for their arrival back to campus in the fall. In this HIPAA-compliant testing and scheduling system, Linkerd was installed because of its reputation for security and observability.
The engineering team sent out over 68,000 invites to those returning to campus to log into the system and schedule COVID tests upon return for the spring semester. In this Kubecon talk, members of that team discuss how they used Linkerd to secure, monitor, and troubleshoot this critical system.
(Note: this transcript has been automatically generated with light editing. It may contain errors! When in doubt, please watch the original talk!)
Okay, let’s go ahead and get started. So this is a slightly interesting title for this particular presentation. We didn’t really use Linkerd to schedule the test but Linkerd greatly facilitated our ability to schedule six to eight thousand COVID tests in a very short period of time. We’re gonna talk you through how it helped us troubleshoot some problems and get us over some humps.
Introductions first. Dom, do you want to say hi.
Sure, I’m Dom DePasquale. I’m the DevOps architect at Penn State University in the department of software engineering. I do all things Kubernetes, pipelines, and all that fun stuff.
I’m Shaw Smith, the director of software engineering and we built software for Penn State University.
A little bit of background. Last March, like everybody else, we were all affected by what happened with the COVID outbreak and, since we work in higher education, that meant that we had to find a way to send all of our students home very quickly while trying to keep them engaged and coming up with a plan to bring them back safely in the fall.
We had a bunch of vendors for testing, we had on-site testing and no way to tie all these things together. So we quickly built a system to pull all the pieces together to go all the way from test results to contact tracing. Unfortunately, we’re built on top of a microservice infrastructure. Dom has done a really great job of terraforming a lot of our actual backend infrastructure. So we were able to turn things very quickly.
We changed directions in the spring semester of 2021. It was decided by the university that all the students would have to be tested 72 hours before they could come back and then again within 10 days of return. For those of you who aren’t familiar with Penn State, we’re a pretty large institution. For those returning to campus that equated to about 68,000 scheduled tests in a very short period of time.
We have a bunch of commonwealth campuses and we have the main campus. We were sending out invitations for tests a thousand at a time.
Somebody, who Dom was kind enough not to mention, wanted to see how quickly we could push the system and see what it could take. And again, this is something that we didn’t really plan for large scaling, we had to build it very quickly. We weren’t entirely sure how it was going to work. We had half of the infrastructure on-premise and half the infrastructure in the cloud. We had to figure out what the heck was going to happen when we did this.
Why Linkerd? We had tried other service meshes previously and found some challenges in the configuration. We struggled with some of the tools we wanted. Then, I was at KubeCon a couple of years back and I went to a presentation on Linkerd and saw how easily it installed, how smoothly things went, and immediately texted Dom. The person aforementioned in the previous slide was also in the presentation and also texted Dom.
What we recognized is that, with Linkerd, we get a lot of capabilities with far less complexity. Mutual TLS is great, I mean, who doesn’t like security. Free retries are great because we get tired of building that into our code. But the real bang that we got from Linkerd is the observability. Observability really gave us the opportunity to go in and visualize and see things at a new level.
Okay, let’s get started with the demo. On my laptop, I have two clusters of minikube running. An east and west cluster and I will show what that means here in the next slide. Linkerd 2.8 is running since, at the time of the event, we had Linkerd 2.8 and I didn’t want to change anything.
The load test will be run via K6 and I’ll be quickly stepping up to 200 virtual users to really drive load on my laptop. We’re just doing simple gets for this endpoint. I will mention quickly that, since we are doing both: running two clusters on my laptop and the load test tool on my laptop, there will be some resource contention and there’s a good chance that some of the performance issues we see aren’t necessarily induced by latent services. It’s just a system resource contention problem. But the way the east and west clusters are laid out on my laptop are similar to what our environment was during the real production outage, well partial outage.
East represents what we had running in AWS and west represents what is running on-prem at Penn State. And the unhappy stick figure here with header browser launched from the invite, taking you to the scheduling system. The scheduling app in the browser was calling the top-level backend service demo service X. Demo service X depends on the on-prem demo servers A, B, and C.
So X depends on A, B, and C. You’ll also notice that A and B depend on C, and then C depends on two simple HTTP bins and demo service D. Now demo service C is actually our RBAC service which is why everything depends on it.
From demo service C (it depends on what we’re simulating here) it’s just three random little services. But in reality, it would have been authentication and authorization databases or services that are out of our software engineering’s control.
We’re going to jump over to the terminal so I could show you how this is all set up. First, I have nothing magic, just a quick script to start the minikube clusters, as I have an east and west cluster start. I have port ranges specified so we don’t overlap port ranges on my laptop and we just do a basic install of Linkerd on both of those east and west clusters.
We can see on my current context, which is the west cluster, we have Linkerd installed…Linkerd installed and running on east…and we have the applications that were in the diagram: simple service definition, deployment definition, and the application is dependent via configuration and environment variables on three service running in the west cluster.
0.4 is the IP for the west cluster’s ingress on my laptop. 0.3 is the IP for the east cluster. I won’t spend too much time looking at all of the definitions in the west cluster since there are a lot. But demo service C, which is on the diagram, the RBAC simulator, depends on D and the two simple HTTP bins.
Here’s that setup: the demo service C is down here. The definition for the deployment for demo service D has one replica and an injected delay of 200 milliseconds.
All right, I just wrote another quick little script, nothing fancy again, just to make sure that I apply the right configuration to the right cluster. I just run that and it will deploy those pods as needed. So, in the west cluster, we have all the components…in east, we have the components. I’m going to switch tabs one more time. A quick test here, just to make sure it’s still running. Yes, so when I call 0.3, I’m calling demo service X and that’s then making calls to these three services and demo services calling these three services. That’s the way that the traffic is flowing.
The load generator script, like I mentioned earlier, it’s just going to ramp up quickly to 200 virtual users calling demo service X. It will do that right now.
While that’s launching in another terminal over here, I’m going to start my monitoring. I’m just doing quick simple port forwards to each cluster for the Linkerd dashboard. Then, in this browser here, I will launch this localhost AP. Now, I’ll increase the font size because I don’t want it to be too small.
Localhost 8081. So we’ll see here, in our deployments, we have on the west cluster demo service A, B, D, C. And in our east cluster demo service X. Already we’re seeing p95 latencies of 28 seconds. So it’s going bad already and this is what it was like once we sent out a thousand invites. And then, all of a sudden, those thousand people or so decided to click on the link to start signing up for scheduling their test.
What I would like to do now, is show you what we used to see, what was going on. We were big in the Grafana dashboards that day, watching the performance of everything. So let me just launch these two dashboards. Here we have demo service X, and we can see now we don’t have a very high request per second. But our latency is just terrible over here. The success rate panel seemed to be okay. In reality, when we had our problems, the success rate was not 100 the whole way across, and our latency was terrible. We had a mix of the best of both worlds as far as failure was going.
What we wanted to share here is, we could see in demo service X the outbound traffic had a high latency, so that was something that would help us troubleshoot later. Okay, so we have this terrible outbound latency, but there’s nothing down here in this dashboard telling us that we’re connected to anything. That’s because we have dependencies in another cluster.
This is the other cluster where we just connect via simple ingress. It’s not a Linkerd multi-cluster and, even then, a few of us in the Linkerd community were chatting. Currently, or last time I checked, there wasn’t a way to aggregate metrics between multiple clusters Prometheus to then render these other outbound deployment dependencies. In other words, to have (back to my diagram here) the Linkerd metrics for service X, A, B, and C in this separate cluster all in one dashboard. That’s something we would love to see in the future and maybe we’ll try to figure it out another day.
This is what gave us our first indication that it must be happening on-prem. So, whatever we’re talking to in our other cluster, must be the problem. We jumped into the separate Linkerd dashboard and separate Grafana dashboards to dig into. We started poking around and looking at all the different services our service X was dependent on, and we could see that these are all failing miserably. Latency is really high. And then, of course, we check service C, or demo service C, because it’s our box system, so we check it usually first and we see that it has terrible latency. Down here, we would have seen our dependent services for service C. This service call was okay and this service call was okay, but it was really this one, demo service D, that was the problem.
If we go to demo service D, we see that this thing has no outbound traffic, it’s super latent. Of course, the first thing we did was to scale that guy up. I have to restart my load test. We’re going to go to demo service D and we’re going to scale it up because, well, maybe it’s a single-threaded app and it just needs some more replicas. We’re going to apply that. This should be the west cluster. Starting up right now. And we’re going to restart that load test. It’s going to ramp up pretty hard here. We’ll watch the demo service X to see what kind of picture we get here. I’m going to change the refresh rate to 30 seconds on these.
Apologize for the pop-ups. It seems like you can’t actually stop everything. We’re at 54 simulated users and it’s going and going, going… So we can see here, our latencies were pegged at 50 seconds but, in reality, if we go back to the load test screen… Well, it’s not letting me scroll up right now.
We had failed requests coming from the load test tool where we were having timeouts. That’s probably exactly the type of thing that the students were feeling when they were trying to schedule their tests.
Let’s go back here and refresh one more time. We have a 20 second p90-p95. So far, it is better, we can see over here that demo service D…its latency is better but still 10 seconds is not ideal since everything depends on service C and service C depends on service D which has this latency problem.
While this was happening, we were trying to scale out components and we were staring at these graphs trying to understand what was happening. We had teammates looking at the code of service C to determine if there was any inefficient logic in the application and, it turned out, there was. We were constantly checking demo service D with every request that came from the user here, all the way through.
Well, it turns out, we didn’t need to do that check. It was…well, I won’t go into the details about why the check was there and why it’s no longer important. But the moral of the story is, it helped us understand that we had this extra code in demo service C checking D for no good reason. And demo service D ended up being a service that was single-threaded and wasn’t meant to handle this type of load. And it was also out of our control, so our load test is currently scaling down, I believe. Yes, it is. Let’s see what our pictures look like from that last run. Now see, it’s still crept up to a 40 second response time. However, we didn’t have any failures this time.
You know from a user point of view, you waited for 40 seconds and that’s totally unacceptable. However, we didn’t have any timeouts. So I’m going to make one more change. We modified the code and I’ll just turn this back down to one replica because we don’t need it. We modified the code to not depend on demo service D anymore.
Apply that…check the pods…rest. And apologize, I’ve been doing aliases this whole time. So all KGPO is “kubectl get pods.” Once this guy’s ready to go… it is, all right. We’re gonna run this load test one last time.
Well, I’ll tell one little story about how this went. So, the way this happened in real life, there weren’t breaks like they’re happening with my load testing, right? This was just constant load, constant users, and a team of us frantically trying to figure out what was happening. And lots of stress and worrying.
If we didn’t have these pictures, those dashboards that Linkerd has pre-made for us, right? This is all out-of-the-can. I showed a little bit ago the installation of Linkerd was default, with no custom configuration at all.
I mean, without these metrics that the Linkerd system’s giving us, we would have been in trouble for a much longer period of time. I’m sure we would have figured it out eventually by looking at metrics coming out of ingress logs or something like that. But, because we had Linkerd, and what it gives us out-of-the-box, we were able to troubleshoot fairly quickly where the bottleneck was. Because of the latency graph and the outbound traffic latency could tell us: hey this is upstream or downstream, depending on how you tell your stories.
We’ll see here, demo service D now has no traffic because I turned it off, the call for it. So, here’s demo service C again, demo service C our RBAC service. Its latency is super low now because it’s no longer dependent on that problem service. And this is the same type of experience we had that day we got rid of that extra check. And, all of a sudden, everything just started zooming right along and we were able to get through the rest of our testing in a reasonable amount of time — our invites to schedule the testing.
This load test is almost done. Why don’t we just wait to see it complete? There we go. So, this is demo service X, our top-level service that the user’s browser is directly connecting to.
We can see our latency now is much lower, in a much more reasonable range. In fact, our p99 is two seconds which is, well compared to 60 seconds, it’s super good.
I’m looking at my load test app here in the background, seeing that we’re still scaling up to 200 users, so why don’t we let it go the whole way before we end the demo part of our presentation. Plus, it’s always fun to see what happens if you let it run long enough, maybe my laptop goes out of resources. Oh, we’re scaling down.
So, just to zoom in on this time frame: we can see that our p99 was at two seconds or 95 was even lower. So this is much more acceptable as far as real-time-feel for the human trying to schedule their testing.
All right, with that, we’ll move on. In summary, without the visibility that Linkerd was giving us, we would have been troubleshooting that problem for hours, trying to dig down to where the real performance bottleneck was. And as I mentioned in the demo, there was the ability to do multi-cluster performance metrics and visualization that would have been even better and faster. Because of what I did in the 15-20 minutes, we spent a long time just figuring out where to dig. And I just zoomed through the solution all in the demo.
With that, I’d like to thank everybody for watching.
Thanks, folks.