The enterprise architect's guide to the service mesh

Download ebook

Relevant articles:

William Morgan

August 13, 2024

Linkerd

Note: this is an expanded version of the post on linkerd.io with additional details.

Today we're happy to announce Linkerd 2.16, a major step forward for Linkerd that adds a whole host of new features, including support for IPv6; an "audit mode" for Linkerd's zero trust security policies; a new implementation of retries, timeouts, and per-route metrics for HTTPRoute and GPRCRoute resources; and much more.

The 2.16 release also introduces two features to Buoyant Enterprise for Linkerd: automation for external workloads (e.g. VM applications) at scale, and a "send a flare" CLI tool to improve remote debugging and support.

See the full release notes or read on for details!

New route metrics, retries, and timeouts

The 2.16 release continues our goal of ensuring Linkerd is the truly future-proof service mesh. We expect the Gateway API to emerge as the standard for traffic configuration in the Kubernetes space, and when that happens, Linkerd users will be ready. To this end, Linkerd 2.16 now publishes metrics for Gateway API HTTPRoute and GRPCRoute resources, so you can capture granular per-route success rates, latencies, request volumes, and other metrics without changing any application code.

Linkerd 2.16 also adds retry and timeout configuration to these same Gateway API resources, bringing the feature sets for Gateway API and ServiceProfiles to parity (as promised in our February Linkerd 2.15 announcement). This configuration is backed by a new implementation that improves upon Linkerd's earlier retry and timeout logic in two key ways:

Requests that time out can now be retried; and
Retries and timeouts can now be combined with circuit breaking.

Enabling Linkerd's new retry and timeout support is as simple as adding annotations to Gateway API resources. For example:

kind: HTTPRoute
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: myapp-default-route
  namespace: myns
  annotations:
    retry.linkerd.io/http: 5xx
    retry.linkerd.io/limit: "2"
    retry.linkerd.io/timeout: 300ms
spec:
  parentRefs:
    - name: myapp
      kind: Service
      group: core
      port: 80
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: "/foo/"

‍

In short, Linkerd's new implementation of per-route metrics, retries, and timeouts are now provided in a principled, future-proof way that is composable with existing features such as circuit breaking, and configured using the Gateway API resources that we believe are the future of service mesh configuration. Learn more.

External workload support

In Linkerd 2.15 we introduced mesh expansion capabilities, allowing Linkerd users to add non-Kubernetes applications, e.g. apps running on VMs, to the mesh and take advantage of Linkerd's secure, reliable, and observable communication without changing application code.

In Buoyant Enterprise for Linkerd 2.16, we've further improved this story through automation, adding several new components designed to make managing these external workloads easier:

A "harness" executable that runs on alongside the application and handles the mechanics of network configuration, registration with Linkerd, health probes, and more.
An ExternalGroup CRD that provides a principled way to manage multiple, similar external applications—for example, multiple replicas of the same service.
An auto-registration control plane component that ties these two pieces together to automate population and management of external workloads through Kubernetes resources.

These components combine to make meshing external applications with Buoyant Enterprise for Linkerd significantly easier, especially at scale. Learn more.

Audit mode for security policies

Linkerd's "zero trust" authorization policies provide a powerful and expressive mechanism for controlling which network traffic is allowed. They support a wide variety of approaches to network security, including micro-segmentation and "deny by default" policies. In contrast to ambient or host-proxy approaches, Linkerd's sidecar design provides a clear security boundary that fits directly into the zero trust model, where each pod makes its own authorization decisions independently, maintains its (and only its) TLS keys, and makes policy decisions based on cryptographic workload identity, not IP addresses.

However, introducing authorization policy in a live system can be tricky. To address this, Linkerd 2.16 introduces a new audit mode to policies. In this mode, policy violations are logged but not enforced. This allows policies to be rolled out in a lower-risk fashion, as they can now start in audit mode and only move to enforcement once fully vetted. Audit mode can be enabled cluster-wide, per-namespace, or on specific Server resources by setting the new `accessPolicy` field to `audit`, vs its default `deny`.

For example:

apiVersion: policy.linkerd.io/v1beta3
kind: Server
metadata:
  namespace: emojivoto
  name: web-http
spec:
  accessPolicy: audit
  podSelector:
    matchLabels:
      app: web-svc
  port: http
  proxyProtocol: HTTP/1

‍

Similarly, the `linkerd policy generate` command in Buoyant Enterprise for Linkerd, which watches live traffic to a system and generates policy scaffolding that accounts for observed traffic, has been updated to use audit mode by default. Learn more.

IPv6

Linkerd 2.16 adds support for IPv6 on IPv6-only and dual-stack clusters. (When enabled on dual stack clusters, Linkerd will only use IPv6 endpoints.) For backwards compatibility, this feature is disabled by default, but enabling it is a simple boolean. Learn more.

Other noteworthy changes

Linkerd 2.16 adds HTTP/2 keep-alive messages by default for all meshed communication. This helps Linkerd proactively detect connections that have been lost by the operating system or underlying network.
All Linkerd CLI commands that output Kubernetes resources now support JSON output.
To mitigate CVE-2024-40632, in which a meshed application that is already vulnerable to an SSRF attack may also leave the proxy open to shutdown, the /shutdown endpoint is now disabled by default unless explicitly enabled.
To prevent accidentally logging sensitive information, HTTP headers are no longer logged in debug or trace output by default, unless explicitly enabled.
To remove unnecessary configuration, resource requests for proxy-init now simply use those of the proxy.

See the full Linkerd 2.16 changelog for more.

Linkerd continues to outperform Istio and Cilium

In May, cloud consultancy LiveWyre published a set of service mesh benchmarks showing that Linkerd resulted in lower latency and less resource consumption than either Istio or Cilium. This has been the consistent result of service mesh benchmarks since 2021, and we were happy to see this confirmed by another third party.

What's next for Linkerd?

Momentum compounds, and right now Linkerd's momentum is at an all-time high. We've shipped six stable point releases for Buoyant Enterprise for Linkerd 2.15 and 29 hotpatch releases, each with extensive documentation and release notes, all designed to provide safe and stable upgrades to our customers while getting access to the latest bugfixes and CVE remediations. On the open source side, we've merged 250+ pull requests in Linkerd and published an average of 5 edge releases a month—more than one a week. There is a lot more great news to report, and early next month we'll publish a deeper retrospective of the past six months, but in short—it's an incredibly exciting time to be involved with Linkerd!

We're hard at work on egress functionality, which will provide both visibility into all traffic leaving the cluster as well as the authorization policies necessary to control it. Our original plan was to deliver this feature in Linkerd 2.16, but we ultimately decided some of the features we had already shipped were too good to delay any longer. Egress is now slated for the upcoming Linkerd 2.17 release, which should follow relatively quickly after 2.16. After egress we have our sights on ingress, plus a couple other exciting multi-cluster features to make managing clusters at scale a lot easier.

We've discussed deprecating ServiceProfiles in the past. Based on the extensive use within the Linkerd community, we've decided to continue supporting them for the foreseeable future. However, the new Gateway API retry and timeout logic is a separate implementation, and that's where our active development will be focused. We expect the feature gap to grow over time, and encourage you to migrate to the new types.

Come see us at Kubecon!

Many of the maintainers will be in attendance at Kubecon NA this November in Salt Lake City, UT, where we have a great lineup of Linkerd talks as well as many of your fellow Linkerd users. If you're attending the conference, please stop by the Linkerd booth in the Project Pavilion and say hi!

Get started with Linkerd today

BEL is our production-ready distribution of Linkerd. It is a complete distribution of Linkerd plus a set of additional tools, features, and testing designed for sustained, production use, including a dynamic zone-aware load balancer which can dramatically cut cloud spend without the reliability sacrifices of Topology-Aware Routing; a Kubernetes operator that automates installs and upgrades; tools for managing external workloads at scale; and much more.

BEL is the distribution of Linkerd that we run in production ourselves. It's free for anyone to use in non-production environments and free for production use at companies with fewer than 50 employees. Get started with BEL in under five minutes!

‍

Announcing Linkerd 2.16! Metrics, retries, and timeouts for HTTP and gRPC routes; IPv6 support; policy audit mode; VM workload automation; and lots more