In his ServiceMeshCon talk, Alex Jones shares how his former cloud engineering team at a multinational financial services corporation used Linkerd to introduce rapid experimentation. With Linkerd, they were able to create more resilient services and comparatively test changes. Alex demos how they did it and shares lessons learned.
Note: this transcript has been automatically generated with light editing. It may contain errors! When in doubt, please watch the original talk!
Hello, ServiceMeshCon Europe 2021! Welcome to my talk about rapid experimentation simplified with Linkerd. My name is Alex Jones and I’m the Principal Engineer at Civo. Civo is a cloud computing company focusing on k3s Kubernetes and really driving home developer experience as a first-class citizen.
In my former life, I’ve worked at Microsoft, BSkyB, JPMorgan, American Express to name a few. The financial services industry is what I’m going to focus on today because they have a lot of problems that are really compounded by the fact that delivering versions of application software is very slow. On top of that, comparing the changes is very difficult and that’s why the talk of experimentation is so pertinent right now.
A lot of businesses are going through a transformational process where they’re trying to enable engineers to have the tools they require to be able to perform these low-cost experiments to determine whether the feature change is going to cause an application impact.
The agenda for today. We’re going to talk about why there is a need for experimentation, why firms invest in tooling, and why things like Linkerd are becoming extremely exciting as very low-bar-to-entry ways of measuring those experiments and testing hypotheses.
The apparatus of that experimentation. The technical side of how this is implemented, how difficult is it to use, and the kind of things I can do. Is that A/B testing? Is that chaos testing? Is that canary testing?
And thirdly, what is the implication of lowering the bar-to-entry for this experimentation.
Before we go any further, it’s really important to set the scene in terms of why is there a need for experimentation. If we think about this as a simple example, here you have a v 1.1 all the way to 1.4. Let’s say that you’re a product engineering team and you are rolling down the tracks building out these version changes. What happens here is that we start to see our SRE team is telling us that the latency of your application is increasing over time. This is a very coarse-grained approach to understanding the infrastructure footprint impact of our application. Whether that’s compute, manifested as IOPS, or some other signal, we are making a change in the environment which is going to cause additional outcomes over time in terms of how that microservice or application interacts with other systems.
Therefore, it makes a lot of sense to measure scientifically what the delta of change is — not just in code, but also in performance. I think about this in terms of reduced signals as well. I’ll come on to describe what I mean by that in a moment.
A secondary example is, how does that application or service interact in a complex environment? If we’re changing the version of several microservices, how do we know scientifically what the change will be to a queuing mechanism? And how do we understand the knock-on implication? This shows also that there’s a real need to be able to, not only inject faults but understand when there’s service degradation, how the environment performs.
There’s a joke about the disaster recovery plan that is nothing like what it looks like when you actually have to perform it. That’s because it’s so high cost, in many organizations, to perform a DR exercise, that many of these things are just existential and aren’t testable.
Another part, and illustrated by the previous two examples, is that A/B testing should be easy. The problem with this sort of FaaS-based example is that, between these two functions essentially performing a failover, I have to test the new optimized database table against a new function and then fail back again. This could be a code change, an optimization in the code, not necessarily the database.
But the idea is, you have to manually change the service, whether that’s through automated deployment config or some other human activity. It’s not something you run simultaneously. Equally, this becomes even more compounded when you want to run, say 20 to 30 FaaS changes with many small nuances between them.
So, A/B testing needs to be easy and this is a really difficult problem to solve for an enterprise in a safe and scalable way. But, wait a minute, many people will tell me that in our environment they can deploy multiple versions and have no problems whatsoever. They can have 1.1, 1.2, etc. on branches, PRs, and it’s fine.
Let’s break that down. You have your microservice alpha v1 and v2. The delta change is the code, and we go through the typical routine of committing that code change, we deploy that through a pipeline, and we get a new replica set.
That’s great, but how do we see what changes there are between these? We have to do some sort of activity where we observe the prior state and then look at the new state. And so we can look at a pattern of change and determine that there’s been a regression. If there’s a regression, we have to go back to the drawing board and we have to figure out what that regression is. That could be latency, saturation, or some other penalty we have to pay. The point I’m trying to illustrate here is that the cycle is fairly long and it also means that it’s extremely arduous to do across multiple versions. Because that’s just comparing v1 to v2, we should be able to compare v1 to v3, and v3 to v2 simultaneously.
The challenge is probably quite clear by now. It’s expensive, not only in monetary terms but in time to promote changes to a new environment, especially within financial institutions. Multi-stage, multi-dependency chain of promotion is a really big overhead to have to bear to test a micro version, like a small bet.
The observability of these small bets has to be targeted. A lot of the labeling systems that you get out of the box or in these organizations aren’t dynamic enough to be able to determine these micro-changes. Whether that’s a suffix on a version or a shar on an image, we need a way of being able to pin the difference between changes and measure those over time.
With all that said, it’s probably clear that this is complex. This is a complex and often impractical set of ideas to try and bring across to an organization where they don’t know where to start. There might be an application team that is delivering 20 different microservices, and each one of those microservices might have a bunch of branches with a bunch of changes in each branch. How do you determine which branch is introducing a regression in terms of infrastructure performance?
Let’s take a step back and think about distilling those requirements. Here’s your typical infrastructure architecture. You’ve got an application that has a microservice and a queue, and it might create something in a database. It stands to reason that we should be able to test in real-time an alternative vision to this architecture. In this case, it’s direct calling the database.
So, this is fairly well known and well-trodden, this kind of path. But we find it difficult to do because it’s hard to be able to tell the API gateway to send data to both of these without a code change in the gateway. And again, that is introducing more change, more unknowns to measure. Equally, we should be able to understand what happens if this service starts faulting without actually having to codify a fault into the service. Therefore, there needs to be a way, thinking back to our disaster recovery illustration, to bring failures and chaos into the system and build more resilient systems.
Lastly, observing the difference across generations is paramount to this succeeding. We can do all this stuff, but if we can’t observe it in a way that isn’t super coarse-grained, then it’s pointless. How do we bring rate, errors, duration, utilization, and saturation signals all to the table, and, say between version one and three, there’s a massive regression? These are things that we need to be able to understand and understand how to measure.
That brings me to the apparatus of this experimentation. After looking at a lot of different solutions, what we settled on time and time again, was Linkerd. The two key tenets of this are traffic splitting and observability. Both of which are underpinned by super easy-to-use DX that has saved us a ton of effort by just working out of the box. When we think about, especially how these things work, it reminds me that there’s a lot of effort that’s been put in at the SMI spec level from the CNCF SIGs who are caring about the future of these kinds of implementations and how the end-users are going to work with them. That’s very much appreciated because, when we look at the customer resource definition for how a traffic split should work, it’s super easy to understand that in this example there’s 90 % traffic bounced to the v1 versus 10 % on the v2. Equally, the alpha v4 of traffic splitting is taking us in a direction we can start to perform front-end application testing inside of the mesh. That’s exciting because we can start to define headers that we care about and, in this example, it’s a user agent of Firefox. Super exciting future going forward for enabling A/B testing within Linkerd and other SMI!
When we think about another big feature of Linkerd and this idea of traffic splitting, it’s about visualizing that data and about the developer experience. I mentioned two or three times already that the developer experience is super important. That’s because, when you have five or six hundred teams and amplify that by the amount of developers on those teams, all trying to work with the mesh, their level of experience is going to vary vastly. When you have super crisp dashboarding and visualizations of what’s going on, then it makes everyone’s lives easier.
This is great because you can double click into this. If you are an engineer that wants to understand what’s going on with the response codes, what’s going on with the internal host headers, you can do that within the mesh. Equally, for SREs, there’s that super deeply ingrained Prometheus and Grafana installation that lets you be a bit more scientific over time. It’s one thing just to deploy a service and say it has introduced some latency, it’s another thing to start pushing a load into that service and looking at how it performs compared to its prior generations.
Let’s look at a small demo. I’ve got this Linkerd demo repository here, I’ve got a client that calls a version, and there is version one, two, or three. What this client does is, it creates a user. The user will then sit in memory. The difference between one, two, and three is pretty small but I’ll show you. Version two, I’ve changed the swagger spec to represent some code change that an engineer might make.
In this case, it’s adding a required food preference field. My client is super dumb and all it’s doing is hitting with the default user field. What’s going to happen is, I’m going to get a 422 because it’s going to say it can’t process this, because it doesn’t understand where its food field is. Equally, in the open API v3, I’ve introduced some latency. I’ve just put some times dot sleeps all over the code to emulate what would happen if there was a real sort of regression introduced into that service. What brings this all together is the traffic split. The default behavior of the open API client is to hit this v1 but what we’re saying now, is actually, I want to balance equally between the v2 and the v3. Let’s go ahead and deploy that.
I want to visualize this, naturally. Let’s go Linkerd viz dashboard and, if we go here, we can see the default behavior as expected: the open API client is hitting the v1 backend. We can see, though, that the traffic splits have come online. We have some prior data for the existing service and, what will soon happen, is as the live data starts to come through from the new routes, we’ll see these fields start to get populated. You can see one has come up right there.
What’s really exciting and useful, is that it just works — there’s no additional configuration. It’s just what you saw, I applied that CRD and away we go. Now, what I’m interested in seeing is if there is going to be more latency on this route because, as we saw, I’ve added in all sorts of sleeps, and we’ve seen that user creation — as it gets round-robin between these — should start to get slower on this route.
We can see almost 10 seconds at the p99 on this route coming through. In addition to this, I could ask what the history of this v3 API is. And again, clicking through Grafana is great! I can look and see over the past couple of minutes, this is something that I’ve known to be a behavioral change in this service or this is something that’s absolutely fine.
I love the idea that you can combine this with existing dashboards. If I’m working in a compute-constrained environment and I’m deploying a hundred different micro bets, I can add a default dashboard focused purely on compute. When I have all these different services and — if I refresh now, I should see them — I can tell which one is going to be the most performant change.
These three could represent three different hashing algorithms. It could be anything that you want to test in a very low bar-to-entry-way.
Let me come back to thinking about, what has this done in terms of lowering the bar? To test a hypothesis, we’re essentially creating an experiment factory. We’re creating the ability for developers — on their local machine, in their lower environments, wherever they might be — to put a load of different small changes out there and test the ones that work the best. And when you start cascading this to, not just a single microservice, but multiple microservices, then it gets really exciting. Because, at that point, you can start combining it with chaos testing.
It’s almost like an evolutionary Darwinism of microservices, where you’re starting to see which one survives the best. In this example, let’s say alpha v1 and v2 have different roots. What we can see is that the implication of those roots can be quite significant in terms of which is more resilient. We might find that actually alpha v2, if we’re knocking it down, impacts the beta v1 service in a way we didn’t know was possible.
That is the power of traffic splitting. With the observability that’s afforded to you by Linkerd, you can do A/B testing, chaos testing, and canary releases as well. We’ve found time and time again, that developers find it so easy to use, they’re starting to chain this stuff in ways we didn’t even think was possible. It’s affording a lot more resilience at the product engineering level. This means that the candidate releases that go forward to production environments are innately more stable because there’s been a consideration around the signals of the infrastructure to which they’re measuring.
Obviously, there might be the argument of deploying to a lower environment isn’t completely representative of a higher environment. Well, that’s where the adventure gets really exciting because you can start dipping into those higher environments, especially if you federate your mesh across clusters. There is a strategy to scale this out to however wide your risk appetite is.
Emboldening engineers is the crux here. I’ve said throughout this talk that it’s about trying to perform experiments. But ultimately those experiments are about tiptoeing the path that helps to keep our service stability the highest. Rather than having SREs coming to the product engineering team after the fact to say they’ve got a degradation, it’s about building that cultural change in, so that people care about making those resilient services.
The behavior that I expect will be driven from this, is that SREs and product engineers are starting to work at the same kind of level. They’re starting to bring in QA to build these traffic spending methodologies so that they can start testing these microservices and think about how to break them. That’s the hat you need to wear. How can I break this microservice architecture because ultimately the best genetic variation will survive for the longest?
I feel like there’s a super bright future for this sort of stuff and I hope that you’ve enjoyed this talk. There are so many questions that I have yet to answer and so many things we can get to talk about, and if anybody wants to follow me offline, we can talk and chat about all this stuff. But ultimately, watch this space because, canary releasing A/B testing, and chaos testing are all completely possible within Linkerd and are all being done today in hundreds of companies with a lot of success.