Buoyant’s Linkerd Production Runbook

A guide to running a service mesh in production

Last update: Sep 24, 2021 / Linkerd 2.10.2

Welcome to Buoyant’s Linkerd Production Runbook! We’re thrilled that you’re taking Linkerd to production. Today, organizations around the globe rely on Linkerd for their mission-critical systems—including us. You’ll be in great company.

This goal of this guide is to provide concrete, practical advice for deploying and operating Linkerd in production environments. We’ve written this guide based on our experience helping organizations large and small adopt Linkerd, as well as our experience running Linkerd ourselves.

Note: this doc covers 2.x versions of Linkerd only. It does not cover versions 1.x or earlier.

How to use this guide

This runbook is meant to be used in conjunction with the official Linkerd docs. While this guide will give you our advice for getting to (and staying in!) production, the official docs contain the majority of the information you’ll actually need to understand to be successful with Linkerd. Whenever possible, we’ll point to the relevant docs section from this guide.

This is a “living document.” Linkerd moves fast. We’ll update this guide for every release and add our release commentary to the Upgrade Notes.

Finally, please read the important Disclaimer below. We do our best to ensure this doc is accurate, but mistakes, omissions, and inaccuracies do happen. Ultimately, you are responsible for your production systems, not us. (But if you do find an error in the guide, please tell us!)

Let’s get started!

Before going to production

Before you can be ready to deploy Linkerd to production, there are some things you should do to prepare.

Join the community

If you’re serious about operating Linkerd in production, you should join the open source community channels around it. This is important for staying aware of important updates and announcements, and for learning from other users who are doing the same thing.

We recommend you join:

You can also join:

All Linkerd development happens on GitHub. That’s also the best place to submit bug reports and pull requests. Please also star the repo to inflate our vanity metric.

Understand how to get help

If you need help with Linkerd, you have a couple of channels available to you. Our recommendations are:

Of course, open source support is provided by the community on a best-effort basis. And don’t forget to help others—this is often the best way to give back!

(Note that if you are a Linkerd commercial support customer of Buoyant, you also have other, dedicated support channels available. Please consult the support onboarding instructions you received.)

Understand how to report a security disclosures

In the unlikely event that you discover a security vulnerability in Linkerd, please email the private [email protected] list. We’ll send a confirmation email to acknowledge your report, and we’ll send an additional email when we’ve identified the issue positively or negatively.

To receive notifications of vulnerabilities and critical updates, please subscribe to the linkerd-announce mailing list.

Understand where to get Linkerd

Linkerd is 100% open source, and the open source project contains everything you need to run Linkerd at scale and in production. Linkerd’s code is hosted in the GitHub repo. You may choose to build your own binaries or images from this code, or simply to use one of the published releases.

Open source releases are published in a split fashion: GitHub hosts the CLI binaries, and GitHub Container Registry hosts the container images. These are the canonical binaries and images for Linkerd.

(Note: Prior to version 2.9, container images were hosted on GCR rather than GHCR.)

Understand Linkerd’s versioning scheme

The 2.x branch of Linkerd follows two versioning schemes: one for stable releases and one for “edge” releases.

Stable releases

Linkerd stable releases follows a modified form of semantic versioning. Linkerd version numbers are of the form stable-2.<major>.[<minor>[.<patch>]]. Breaking changes (typically, configuration incompatibilities) and significant changes to functionality are denoted by changes in major version. Non-breaking changes and minor feature improvements are denoted by changes in minor version. Occasionally, we will release critical bugfixes to stable releases by incrementing the patch version.

For example:

  • 2.3.6 -> 2.4: major improvements, possible breaking changes
  • 2.3.6 -> 2.3.7: improvements or bugfixes, no breaking changes

Edge releases

Linkerd 2.x is also published in “edge” releases, typically on a weekly cadence. In contrast to stable releases, Linkerd edge releases follow a flat versioning scheme, of the form edge-<year>.<month>.<number>. Edge releases are provided to the community as a way of getting early access to feature work, and may introduce breaking changes at any point. Sometimes, edge releases are designated informally as a “release candidate” for the upcoming stable release; this designation also provides no guarantees about feature compatibility.

For example:

  • edge-21.4.5: the fifth edge release published in April 2021
  • edge-21.11.1: the first edge release published in November 2021

Understand feature denotations

Sometimes, Linkerd features are denoted as experimental in the documentation. This designation means that, while we feel confident in the viability of the feature, it hasn’t seen enough production use for us to recommend it unreservedly. Caution should be exercised before using an experimental feature in a production environment. The documentation for each experimental feature will describe why it has been classified this way; for example, “this feature has not been tested on all major cloud providers”.

Sometimes, Linkerd features as denoted as deprecated. This means that, while currently supported, we expect to remove the corresponding configuration in an upcoming release.

Rarely, we may denote features as not for production in the documentation. These features may be useful for debugging and getting started, but have known issues when applied to production. The documentation for each not for production feature will describe why it has been classified this way; for example, “this feature has known scaling issues above 10 services”.

Understand which environments are supported

The only requirement for Linkerd is a modern, functioning Kubernetes cluster. Regardless of whether the cluster is on-premises or in the cloud, and regardless of Kubernetes distribution or provider, if it’s running Kubernetes, generally speaking, Linkerd should work. (Of course, Linkerd does require specific capabilities and features of the Kubernetes cluster in order to function. See Preparing your environment for more on this topic.)

Understand which versions of Kubernetes are supported

Generally speaking, Linkerd follows the published policy for “supported” Kubernetes releases: effectively, the three most recent minor Kubernetes versions are supported. Of course, earlier Kubernetes versions may still work.

Going to production

With our preparations out of the way, let’s get into the details. In this section, we cover the basic recommendations for preparing Linkerd for production use.

Your preflight checklist

So you’re ready to take Linkerd into production. Great! Here are the basic steps we suggest for your production deploy:

  1. Prepare your Kubernetes environment. (See Preparing your environment below.)
  2. Configure Linkerd in a production-ready way. (See Configuring Linkerd for Production Use below.)
  3. Set up monitoring and alerting so that you’re informed if Linkerd’s behavior falls outside its normal operating range. (See Monitoring Linkerd below.)
  4. Have fun!

We’ll go over each of these in turn.

Preparing your environment

In this section, we cover how to configure your Kubernetes environment for Linkerd production use. The good news is that much of his preparation can be verified automatically.

Run the automated environment verification

Linkerd automates as much as possible. This includes verifying the pre-installation and post-installation environments. The linkerd check command will automatically validate most aspects of the cluster against Linkerd’s requirements, including operating system, Kubernetes deployment, and network requirements.

To run the automated verification, follow these steps:

  1. Validate that you have the linkerd CLI binary installed locally and that its version matches the version of Linkerd you want to install, by running linkerd version and checking the “Client version” section.
  2. Validate that the kubectl command in your terminal is configured to use the Kubernetes cluster you wish to validate.
  3. Run linkerd check --pre.
  4. Correct any issues in your environment highlighted in the output. Each failing check should contain a URL pointing to a more detailed explanation with possible solutions.

Once linkerd check --pre is passing, you can start to plan the control plane installation using the information in the following sections.

Validate certain exceptional conditions

There are some conditions that may prevent Linkerd from operating properly that linkerd check --pre cannot currently validate. The most common such conditions are those that prevent the Kubernetes API server from contacting Linkerd’s tap and proxy-injector control plane components.

Environments that can exhibit these conditions include:

  1. Private clusters on GKE or AKS clusters
  2. EKS clusters with custom CNI plugins

If you are in one of these environments, you may wish to do a “dry run” installation first to flesh out any issues. Remediations include updating internal firewall rules to allow certain ports (see the example in the Cluster Configuration documentation), or, for EKS clusters, switching to the AWS CNI plugin.

Provide sufficient system resources for Linkerd

Production users should use the high availability, or “HA”, installation path. We’ll get into all the details about HA mode later, in Configuring Linkerd for Production Use. In this section, we’ll describe our recommendations for resource allocation to Linkerd installed in HA mode. These values represent best practices, but you may need to tune them based on the specifics of your traffic and workload.

Control plane resources

The control plane of an HA Linkerd installation requires three separate nodes for replication. As a rule of thumb, we suggest planning for 512mb and 0.5 cores of consumption on each such node.

Of all control plane components, the linkerd-destination component, which handles service discovery information, is the one that will scale in terms of memory consumption with the size of the mesh. This components requires monitoring of consumption, and adjusting of resource requests and limits as appropriate.

In version 2.9 and earlier of Linkerd, the core control plane contained a Prometheus instance. In Linkerd 2.10 and beyond, this instance has been moved to the linkerd-viz plugin. This Prometheus instance is worth some extra attention as it stores metrics information from proxies, and thus can have wildly variable resource requirements. As a rough starting point, expect 5mb of memory per meshed pod, but this can vary very wildly basic on traffic patterns. We suggest planning for 512mb minimum for this instance and to take an empirical approach and monitor carefully.

(There are configurations of Linkerd that avoid running this control plane Prometheus instance. We’ll discuss these below as well.)

Data plane resources

Generally speaking, for Linkerd’s data plane proxies, resource requirements are a function of network throughput. Our conservative rule of thumb is that, for each 1,000 RPS of traffic expected to an individual proxy (i.e. to a single pod), you should ensure that you have 0.25 CPU cores and 20mb memory. Very high throughput pods (>5k RPS per individual pod), such as ingress controller pods, may require setting custom proxy limit/requests (e.g. via the --proxy-cpu-* configuration flags).

In practice, proxy resource consumption is affected by the nature of the traffic, including payload sizes, level of concurrency, protocol, etc. Our guidance here, again, is to take an empirical approach and monitor carefully.

Restrict NET_ADMIN permissions if necessary

By default, the act of creating pods with the Linkerd data plane proxies injected requires NET_ADMIN capabilities. This is because, at pod injection time, Linkerd uses a Kubernetes InitContainer to automatically reroute all traffic to the pod through the pod’s Linkerd data plane proxy. This rerouting is done via iptables, which requires the NET_ADMIN capability.

In some environments this is undesirable, as NET_ADMIN privileges grant access to all network traffic on the host. As an alternative, Linkerd provides a CNI plugin which allows Linkerd to run iptables commands within the CNI chain (which already has elevated privileges) rather than in InitContainers.

Using the CNI plugin adds some complication to installation, but may be required by the security context of the Kubernetes clusters. See the CNI documentation for more.

Ensure that time is synchronized across nodes in the cluster

Clock drift is surprisingly common in cloud environments. When the nodes in a cluster have different timestamps, the output won’t line up between log lines, metrics, etc. Clock skew can also break Linkerd’s ability to validate mutual TLS certificates, which can cause a critical outage of Linkerd’s ability to service requests.

Ensuring clock synchronization in networked computers is outside the scope of this document, but we will note that cloud providers often provide their own dedicated services on top of NTP to help with this problem, e.g. AWS’s Amazon Time Sync Service. A production environment should use those services. Similarly, if you are running on a private cloud, we suggest ensuring that all the servers use the same NTP source.

Ensure the kube-system namespace can function without proxy-injector

One sometimes surprising detail of Linkerd’s HA mode is that it configures the proxy-injector control plane component, which adds the proxy to scheduled pods, to be required before any pod on the cluster can be scheduled. HA mode adds this restriction in order to guarantee that all application pods have access to mTLS—creation of any application pods that are unable to have an injected proxy could lead to an insecure system.

However, this means that, in HA mode, if all proxy-injector instances are down, no pod anywhere on the cluster can be scheduled. To avoid catastrophic scenariou in the presence of complete proxy-injector failure, you must apply config.linkerd.io/admission-webhooks=disabled Kubernetes label to the kube-system namespace. This will allow system pods to be scheduled even in the absence of a functioning proxy-injector.

See the Linkerd HA documentation for more.

Configure your ingress for Linkerd

Adding Linkerd’s data plane proxies to ingress pods allows Linkerd to provide end-to-end mTLS and metrics. While in most respects this is identical to adding Linkerd’s data plane proxies to any other pod, there are some specific considerations for certain ingress providers.

Follow the instructions in the Ingress configuration documentation to see if there is specific configuration necessary to allow your ingress solution to be meshed with Linkerd.

Configuring Linkerd for Production Use

Having configured our environment for Linkerd, we now turn to Linkerd itself. In this section, we outline our recommendations for configuring Linkerd for production environments.

Choose your deployment tool

Linkerd supports two basic installation flows: a Helm chart or the CLI. For a production deployment, we recommend using Helm charts, which better allow for a repeatable, automated approach. See the Helm docs and the CLI installation docs for more details.

Decide on your metrics pipeline

Every data plane proxy provides a wealth of metrics data for the traffic its seen, by exposing it in a Prometheus-compatible format on a port. How you consume that data is an important choice.

Generally speaking, you have several options for aggregating and reporting these metrics. (Note that these options can be combined.)

  1. Ignore it. Linkerd’s deep telemetry and monitoring may simply be unimportant for your use case.

  2. Aggregate it on-cluster with the linkerd-viz extension. This extension contains an on-cluster Prometheus instance configured scrape all proxy metrics, as well as a set of CLI commands, and an on-cluster dashboard, that expose those metrics for short-term use. By default, this Prometheus instance holds only 6 hours of data, and will lose metrics data if restarted. (If you are relying on this data for operational reasons, this may be insufficient.)

  3. Aggregate it off-cluster with Prometheus. This can be done either by Prometheus federation from the linkerd-viz on-cluster Prometheus, or simply by using an off-cluster Prometheus to scrape the proxies directly, aka “Bring your own Prometheus”.

  4. Use a third-party metrics provider such as Buoyant Cloud. In this option, the metrics pipeline is offloaded entirely. Buoyant Cloud will work out of the box for Linkerd data; other metrics providers should also be able to easily consume data from Prometheus or from the proxies.

Screenshot of Buoyant Cloud's hosted Linkerd metrics

Screenshot of Buoyant Cloud's hosted Linkerd metrics

Enable HA mode

Production deployments of Linkerd should use the high availability, or HA, configuration. This mode enables several production-grade behaviors of Linkerd, including:

  • Running three replicas of critical control plane components.
  • Setting production-ready CPU and memory resource requests on control plane components.
  • Setting production-ready CPU and memory resource requests on data plane proxies
  • Requiring that the proxy auto-injector be functional for any pods to be scheduled.
  • Setting anti-affinity policies on critical control plane components so that are scheduled on separate nodes and in separate zones.

This mode can be enabled by using the --ha flag to linkerd install or by using the values-ha.yaml Helm file. See the HA docs for more.

Set up your mTLS certificate rotation and alerting

For mutual TLS, Linkerd uses three basic sets of certificates: the trust anchor, which is shared across clusters; the issuer certificate, which is set per cluster; and the proxy certificates, which are issued per pod. Each certificate also has a corresponding private key. (See our Kubernetes engineer’s guide to mTLS for more on this topic.)

Proxy certificates are automatically rotated every 24 hours without any intervention on your part. However, Linkerd does not rotate the trust anchor or the issuer certificate. Additionally, multi-cluster communication requires that the trust anchor be the same across clusters.

By default the Linkerd CLI will generate one-year self-signed certificates for both trust anchor and issuer certificates and will discard the trust anchor key. This allows for an easy installation, but is likely not be the configuration you want in production.

If either trust anchor or issuer certificate expires without rotation, no Linkerd proxy will be able to communicate with another Linkerd proxy. This can be catastrophic. Thus, for production environments, we recommend the following approach:

  1. Create a trust anchor with a 10-year expiration period, and store the key in a safe location. (Trust anchor rotation is possible, but is very complex, and in our opinion best avoided unless you have very specific requirements.)
  2. Set up automatic rotation for issuer certificates with cert-manager.
  3. Set up certificate monitoring with Buoyant Cloud. Buoyant Cloud will automatically alert you if you certificate is close to expiring.

Certificate management can be subtle. Be sure to read through Linkerd’s full mTLS documentation, including the sections on rotating certificates.

Set up automatic rotation of webhook credentials

Linkerd relies on set of TLS credentials for webhooks. (These credentials are independent of the ones used for mTLS.) These credentials are used when Kubernetes calls the webhooks endpoints of Linkerd’s control plane, which are secured with TLS.

As above, by default these credentials expire after one year, at which point Linkerd becomes inoperable. To avoid last-minute scrambles, we recommend using cert-manager to automatically rotate the webhook TLS credentials for each cluster. See the Webhook TLS documentation.

Configure certain protocols

Linkerd proxies perform protocol detection to automatically identify the protocol used by applications. However, there are some protocols which Linkerd cannot automatically detect, for example, because the client does not send the initial bytes of the connection. If any these protocols are in use, you need to configure Linkerd to handle them.

As of Linkerd 2.10, a connection that cannot be properly handled by protocol detection will incur a 10-second timeout, then be proxied as raw TCP. In earlier versions, the connection would fail after the timeout rather than continuing.

The exact configuration necessary depends not just on the protocol but on whether the default ports are used, and whether the destination service is in the cluster or outside the cluster. For example, any of the following protocols, if the destination is outside the cluster, or if the destination is inside the cluster but on a non-standard port, will require configuration:

  • SMTP
  • MySQL
  • PostgresQL
  • Redis
  • ElasticSearch
  • Memcache

See the Protocol Detection documentation for guidance on how to configure this, if necessary.

Secure your tap command if necessary

Linkerd’s tap command allows users to view live request metadata, including request and response gRPC and HTTP headers (but not bodies). In some organizations, this data may contain sensitive information that should not be available to operators.

If this applies to you, follow the instructions in the Securing your Cluster documentation to restrict or remove access to tap.

Validate your installation

After installing the control plane, we recommend running linkerd check to ensure everything is set up correctly. Correct any issues in your environment highlighted in the output. Each failing check should contain a URL pointing to a more detailed explanation with possible solutions.

Once linkerd check passes, congratulations! You have successfully installed Linkerd.

Monitoring Linkerd

Monitoring Linkerd involves monitoring both control plane components and data plane components.

Connect your cluster to Buoyant Cloud

One quick way to get a full suite of automated monitoring for your Linkerd deployment is simply to connect your cluster to Buoyant Cloud. The free tier of Buoyant Cloud provides a comprehensive suite of monitoring and diagnostics for Linkerd, including monitoring of TLS certificate expiration, control plane and data plane health, and more.

Screenshot of Buoyant Cloud's monitoring alerts

Screenshot of Buoyant Cloud's monitoring alerts

If that’s not possible, or if you simply want to set up your own monitoring in parallel, read on!

Monitoring Linkerd control plane metrics

Central to monitoring Linkerd’s heath is monitoring its metrics. Since the Linkerd control plane runs on the data plane, you can use the same metrics pipeline you’ve already set up.

As a starting point, we recommend monitoring:

  • Existence of control plane components. Each component needs to be running in order for Linkerd to function.
  • Success rate of control plane components. This should never drop below 100%; any failure responses are a sign that something is going wrong.
  • Latency of control plane components. These levels should be set empirically and unexpected changes should be investigated.
  • Optionally, resource consumption of control plane components. This also requires tuning, as some components scale in memory and CPU usage with the overall level of traffic passing through the mesh. However, rapid changes are worth investigating, and consumption that approaches any resource limit should be addressed before it becomes a problem.

Monitoring Linkerd’s data plane

Monitoring of Linkerd’s proxies should focus primarily on resource usage, since the golden metrics reported will be that of the application pod. As with control plane components, the exact thresholds will be dependent on the traffic to the pod, and thus alerting should focus on rapid changes, or on situations where consumption approaches resource limits.

Accessing Linkerd’s logs

You can view logs from Linkerd’s control plane or data plane through the usual kubectl logs command. For the control plane, both the main container and the linkerd-proxy container for each pod may deliver usable information.

By default the control plane’s log level is set at the INFO level, which surfaces various events of interest, plus warnings and errors. For diagnostic purposes, it may be helpful to raise log levels to DEBUG; this can be accomplished with the linkerd upgrade --controller-log-level debug command.

Similarly, by default, the proxy’s log level is set to INFO. The log level of a proxy can be modified at runtime if necessary. Note that debug mode can be extremely verbose, especially for high-traffic proxies. Care should be taken to change the level back to INFO after debugging, especially in environments where increased log usage has a financial impact.

Sending proxy diagnostics to Buoyant

Buoyant Cloud users can use the Send Diagnostics feature to send metrics and log information direction to Buoyant for debugging purposes.

Upgrading Linkerd

Generally speaking, Linkerd is designed for safe, in-place upgrades with no application downtime, when upgraded between consecutive stable versions—for example, from 2.8.1 to 2.9. (Upgrades that skip stable versions are sometimes possible, but are not always guaranteed; see the version-specific release notes for details.)

Note that, due to constraints that Kubernetes imposes, true zero-downtime upgrades are only possible if application components can themselves be “rolled” with zero downtime, as upgrading the data plane involves rolling injected workloads.

Upgrading Linkerd is done in two stages: control plane first, then data plane. To accomplish this, Linkerd’s data plane proxies are compatible with a control plane that is one stable version ahead; e.g. 2.8.1 data plane proxies can safely function with a 2.9 control plane.

Upgrading the control plane is typically done via the linkerd upgrade command. This will trigger a rolling deploy of control plane components, which should allow critical components to be upgraded without downtime. (Note that, in the event something does go wrong, Linkerd’s data plane proxies will continue functioning even if the control plane is unreachable; however, they will not receive service discovery updates.)

Once the control plane has been updated, the proxy-injector component will start injecting data plane proxies from the corresponding (newer) version. Since Kubernetes treats pods as immutable, upgrading the data plane thus requires rolling application components Fortunately, because of the forward compatibility between data plane proxy and control plane described above, these data plane upgrades can be done “lazily”—it is not necessary to immediately roll data plane deployments after upgrading the control plane.

Thus, our recommended steps for upgrading are:

  • Thoroughly read through the version-specific upgrade notes for the new release.
  • Survey the data plane versions at play in the cluster (e.g. via linkerd check --proxy) and ensure that existing data plane versions are within parameters for the new control plane version. Typically, they should all be the same version corresponding to one stable release prior to the version to which you want to upgrade.
  • Upgrade the control plane by following the documentation, and monitoring for changes.
  • Upgrade the data plane by rolling application components when possible.

As with all modifications of critical system software, extreme care should be taken during the upgrade process. In our experience, human error is almost always the source of software failures.

Version-specific upgrade notes are published in the Linkerd Upgrade documentation.

Good luck!

We know that productionizing and being on-call for critical systems can be difficult, stressful, and often thankless. We’ve done our best to make Linkerd as simple as possible to operate, but successfully operating a Kubernetes platform is by no means a easy task. We hope Linkerd treats you well, and from one group of engineers to another: we wish you the best of luck.

(And once you’re up and running, add yourself to ADOPTERS.md and we will send you some Linkerd swag!)

Disclaimer

Buoyant has made best efforts to confirm the accuracy and reliability of the information provided in this document. However, the information is provided “as is” without representation or warranty of any kind. Buoyant does not accept any responsibility or liability for the accuracy, completeness, legality, or reliability of the information in this document. Importantly, the information in this document is of a general nature; applicability and effectiveness will vary user by user according to use case, technical environment, traffic patterns, and integration, among many other factors.

No warranties, promises, or representations of any kind, expressed or implied, are given as to the nature, standard, accuracy, or otherwise of the information provided on this document, nor to the suitability or otherwise of the information to your particular circumstances.

Buoyant shall not be liable for any loss or damage of whatever nature (direct, indirect, consequential, or other) whether arising in contract, tort or otherwise, which may arise as a result of the use of (or failure to use) the information in this document.

Copyright © 2021 Buoyant, Inc. All rights reserved. This document is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Appendix: Upgrade notes

2.10.2

Release summary: This stable release fixes a proxy task leak that could be triggered when clients disconnect when a service is in failfast. It also fixed an issue where the opaque ports annotation on a namespace would overwrite the annotations on services in that namespace.

Who should upgrade: anyone on 2.10.1 who is experiencing issues with unbounded proxy memory usage, or with overridden service annotations in the presence of an opaque-ports annotation in the namespace.

Before upgrading: Please review the 2.10.2 release notes.

2.10.1

Release summary: This release adds CLI support for Apple Silicon M1 chips and support for SMI’s TrafficSplit v1alpha2. It fixes several proxy issues, including handling FailedPrecondition errors gracefully, inbound TLS detection from non-meshed workloads, and using the correct cached client when the proxy is in ingress mode. The logging infrastructure has also been improved to reduce memory pressure in high-connection environments. Finally, it includes several improvements to control plane, including as support for Host IP lookups in the destination service, updating the proxy-injector to add opaque ports annotation to pods if their namespace has it set.

On the CLI side, the linkerd repair command is now aware of the control plane version, and various bugs have been fixed around the linkerd identity command.

Who should upgrade: all 2.10.0 users.

Before upgrading: Please review the 2.9.5 release notes.

2.10.0

Release summary: This release introduces Linkerd extensions. The default control plane no longer includes Prometheus, Grafana, the dashboard, or tap, which have been moved to a linkerd-viz extension. Similarly, cross-cluster communication is now in the linkerd-multicluster extension and distributed tracing functionality is in the linkerd-jaeger extension.

This release also introduces the ability to mark certain ports as “opaque”, indicating that the proxy should treat the traffic as opaque TCP instead of attempting protocol detection. This allows the proxy to provide TCP metrics and mTLS for server-speaks-first protocols. Finally, it adds support for TCP traffic in multicluster communication.

Who should upgrade: This is a feature release.

Before upgrading: Please review the 2.10.0 upgrade notice and release notes. Pay special attention to the 2.10 ports and protocols upgrade guide as it is very likely you will have to update some of your configuration.

2.9.5

Release summary: This stable release fixes an issue where the destination service is throttled after overwhelming the Kubernetes API server with node topology queries. This results in the destination service failing requests and spiking in latency. By moving to a shared informer for these queries, the information is now fetched asynchronously.

Who should upgrade: anyone on 2.9.4 who is experiencing issues with spiking destination service latency, or failing requests.

Before upgrading: Please review the 2.9.5 release notes.

2.9.4

Release summary: This release fixes an issue that prevented the proxy from being able to speak HTTP/1 with older versioned proxies. This fix was announced in 2.9.3 but wasn’t actually included in the release.

This release also fixed the linkerd install command so that it can properly detect and avoid overwriting already installed linkerd instances from versions previous to 2.9.

Who should upgrade: Several classes of users should upgrade to this release. First, all users who upgraded from 2.8.x to 2.9.x should upgrade to this release prior to upgrading to future 2.10 releases. Second, 2.8.x users who were unable to upgrade to 2.9.x due to errors with communication between 2.9.x and 2.8.x proxies over HTTP/1 should upgrade. Finally, users who used cert-manager to automatically rotate webhook certificates should upgrade.

Before upgrading: Please review the upgrade notice for the earlier point release, 2.9.3 and the 2.9.3 release notes and 2.9.4 release notes.

2.9.3

Users should upgrade to 2.9.4 instead of this release.

This stable release was an attempt to fix an issue that prevented the proxy from being able to speak HTTP/1 with older versioned proxies. Unfortunately, the fix was not actually included int he release!

It also fixed an issue where the linkerd-config-overrides secret would be deleted during upgrade and provides a linkerd repair command for restoring it if it has been deleted.

2.9.2

Release summary: This stable release fixes an issue that stops traffic to a pod when there is an IP address conflict with another pod that is not in a running state.

It also fixes an upgrade issue when using HA that would lead to values being overridden.

Who should upgrade: Users who are experiencing unexpected traffic stops with Linkerd 2.9.1.

Before upgrading: Please review the 2.9.2 release notes.

2.9.1

Release summary: This stable release contains a number of proxy enhancements: better support for high-traffic workloads, improved performance by eliminating unnecessary endpoint resolutions for TCP traffic and properly tearing down serverside connections when errors occur, and reduced memory consumption on proxies which maintain many idle connections (such as Prometheus’ proxy).

On the CLI and control plane sides, it relaxes checks on root and intermediate certificates (following X509 best practices), and fixes two issues: one that prevented installation of the control plane into a custom namespace and one which failed to update endpoint information when a headless service was modified.

Who should upgrade: Users with high-traffic workloads or who are experiencing issues with the 2.9 release.

Before upgrading: Please review the 2.9.1 release notes.

2.9.0

Release summary: This release extends Linkerd’s zero-config mutual TLS (mTLS) support to all TCP connections, allowing Linkerd to transparently encrypt and authenticate all TCP connections in the cluster the moment it’s installed. Other notable features in this release are: support for ARM architectures, a new multi-core proxy runtime for higher throughput, and support for Kubernetes service topologies.

Who should upgrade: This is a feature release.

Before upgrading: Please review the 2.9.0 upgrade notice and release notes.

2.8.1

Release summary: This release fixes multicluster gateways support on EKS.

Who should upgrade: EKS users who desire cross-cluster connectivity.

Before upgrading: Please review the 2.8.1 release notes.

2.8.0

Release summary: This release introduces a new multi-cluster extension to Linkerd, allowing it to establish connections across Kubernetes clusters that are secure, transparent to the application, and work with any network topology.

Who should upgrade: This is a feature release. However, support for multi-cluster connectivity in EKS is a known issue. Users who desire this feature on EKS should delay upgrading until 2.8.1, expected within a few weeks.

Pleaes review the 2.8.0 upgrade notice and release notes.

2.7.1

Release summary: This release introduces substantial proxy improvements, resulting from continued profiling & performance analysis. Also support for Kubernetes 1.17 was improved.

Who should upgrade: Users of Kubernetes 1.17, and users who are experiencing missing updates from service discovery (often manifesting as 503 errors).

Before upgrading: Please review the 2.7.1 release notes.

2.7.0

Release summary: This release adds support for integrating Linkerd’s PKI with an external certificate issuer such as cert-manager as well as streamlining the certificate rotation process in general. For more details about cert-manager and certificate rotation, see the documentation. This release also includes performance improvements to the dashboard, reduced memory usage of the proxy, various improvements to the Helm chart, and much much more.

Who should upgrade: This is a feature release.

Before upgrading: Please review the 2.7.0 upgrade notice and release notes.