How Linkerd improved Salt's platform efficiency, reliability, performance within one week
Chasing API traffic anomalies with minimal downtime
Salt Security pioneered API security. The cyber security vendor built a platform that protects APIs across their entire life cycle, preventing their customers from being attacked. If the platform is down, API attacks won’t be blocked — downtime is thus not an option.
Salt’s platform was built as microservices running on Kubernetes from day one. But, as the company grew, the team started seeing backward-compatibility issues. To address that, they decided to move to gRPC, but that introduced a different challenge. They needed a way to load balance gRPC that isn’t supported by Kubernetes. After a little research, they came across Linkerd. To their surprise, Linkerd would not only help with gRPC load balancing, but also improve platform efficiency, reliability, and overall performance.
The backdrop: the platform, infrastructure, and team
Salt’s platform continuously monitors and learns from customer traffic. This allows it to rapidly and easily identify anomalies and sophisticated API attacks within seconds. Salt’s customers include household names such as The Home Depot, Telefonica Brasil, Citi National Bank, among others.
At any given moment, countless attacks are carried out across the internet. A gateway into an organization’s most valuable data and services, APIs are an attractive target for hackers.
This is how it works: Salt hosts the metadata of their customers’ APIs and, to find and stop attacks, they run AI and ML against that data. Minimizing downtime while ingesting all that customer traffic is imperative. Imagine a customer with traffic averaging around 30,000 RPS (requests per second). A downtime of merely one second would translate into 30,000 opportunities for malicious hackers! While data breaches may only last a few seconds, their consequences are often devastating. They create a perfect window of opportunity for DoS (denial of service) attacks or PII exposure.
To avoid service disruption, Salt’s platform is composed of about 40 microservices running in various Kubernetes clusters and spanning AWS and Azure regions. If one cloud provider experiences an outage, the platform can immediately trigger a failover, ensuring the availability of all Salt services.
Backward compatibility with gRPC
When Salt started to rapidly grow in 2020, the platform had to scale accordingly. To adapt to these changes, the architecture and services started evolving quickly, as did the messages exchanged between them.
The platform team wanted to enable their developers to move fast and with confidence. They needed a way to ensure that no change to an API call would break and gRPC seemed to be the perfect candidate.
Why gRPC, or in this case Protobuf specifically? Consider how data is serialized with Protobuf (Protocol Buffers) as opposed to JSON (or any other serialization framework such as Kryo). If a field from a message is removed, services could be introduced in an unsynchronized manner with these changes.
Say service A sends service B a message with three fields. Removing or changing a field name may prevent one of those services from deserializing the message. Protobuf helps by enumerating fields and reserving fields so deleted fields aren’t read by accident. This, and tools like Buf, allowed Salt to build a single repository to declare all company-internal APIs. They can also keep it stable and error-free in the CI pipeline during compile time.
The move to gRPC standardized Salt’s internal API, creating a single point of truth for intra-service messages. This is how they achieved backward compatibility with gRPC — the reason why Salt adopted gRPC in the first place.
But gRPC has many more benefits. Strongly-typed APIs, great performance, and support for more than 10 languages, are some of them. But it did also come with its challenges. Chief among them are load-balancing gRPC requests. Since gRPC uses HTTP/2 under the hood, it can’t be effectively balanced by Kubernetes’ native TCP load balancing. Since all microservices were replicated for load balancing and HA, the inability to distribute cross-service communication between replicas was not an option.
Load balancing gRPC with Linkerd
The platform team researched a number of solutions to address the load balancing issue. Envoy, Istio, and Linkerd were the final contenders. The team was familiar with the term “service mesh,” but had no production burn time with any of these tools and decided to assess each of them.
As a fast-growing company, Salt has to be smart about how they allocate resources. When evaluating any technology, the level of effort needed to maintain it is always a key criterion.
"We started with Linkerd and it turned out to be so easy to get started that we gave up on evaluating the other two — we loved it from the get-go. Within hours of stumbling over it online, we had deployed it on our dev environment. Three more days and it was up and running on our new service running in production."
— Eli Goldberg, Platform Engineering Lead at Salt Security
That gave them the confidence they needed to start migrating their services to gRPC and adding them to Linkerd.
Newly gained visibility, reliability, and security
After meshing their services, the platform team realized they had something far more powerful than a load balancing tool for gRPC. Linkerd provided them with additional visibility, reliability, and security features.
Inter-service communication was now fully mTLSed, stopping potential malicious hackers from “sniffing” their traffic, should there be an internal breach. Thanks to Linkerd’s latest gRPC retry feature, brief network errors look like small delays and not hard failures that trigger a full-blown investigation.
They’ve also come to rely on the Linkerd dashboard where they monitor internal live traffic and how services communicate with one another. A table displaying all request latencies has helped them identify the slightest regression in backend performance. And Linkerd’s top feature flagged live request with excessive calls between services they weren’t aware of.
They also found the CLI easy to use. Through the check command, they get instant feedback on any deployment issues. For instance, when resource problems cause a pod in the control plane to evacuate, linkerd check will flag those pods as missing.
The platform team soon realized that Linkerd isn’t just useful in production. It has similar monitoring and visibility features as other logging, metrics, or tracing platforms. If Linkerd allows them to see what’s wrong in production, why not use it before pushing code to prod? Today, Linkerd is a key part of Salt’s dev stack.
Additionally, they found it very simple to operate with no one having to be “in charge” of maintaining the service mesh — it “just works!”
These are quite a few extra benefits the platform engineers didn’t consider critical during their evaluation process. However, now that they’ve gotten used to them, they wouldn’t want to miss them.
The Salt team also appreciated the Linkerd community. The documentation covers all use cases they had so far, as well as those they are planning on implementing next (e.g. traffic splits and canary deployments). Whenever they didn’t find the answer in the docs, Slack turned out to be a valuable resource.
"The community is very helpful and super responsive. You can usually receive an answer within a few hours. People share their thoughts and solutions all the time, which is really nice."
— _Omri Zamir, Senior Software Engineer at Salt Security_
Increased efficiency, reliability, and performance within one week!
The team started seeing tangible results after only one week of implementing Linkerd! The performance, efficiency, and reliability of the Salt platform improved significantly. They are now able to observe and monitor service and RPC-specific metrics and take action in real-time. Since all service-to-service communication is now encrypted, they’ve increased the level of security within their cluster.
In Salt’s line of business, scalability and performance are mission-critical. As they continue to grow rapidly, so do the demands on their platform. More customers translate into more ingested real-time traffic. Recently, the scalability of their platform was put to test. They were able to increase traffic almost 10x without any issues. This elasticity is key to ensuring flawless service.
They were also able to identify excessive calls that had fallen under the radar and led to wasteful resource consumption. Since they are now using Linkerd in their dev environments, they now discover these excessive calls before they go to production, avoiding them altogether.
Finally, since no one is needed to maintain Linkerd, they will now dedicate those valuable development hours to projects like traffic splits, canary deployments, and chaos engineering to make their platform even more robust.
The cloud native effect
Avoiding downtime is absolutely critical to Salt. Especially for the platform team, there is little room for error. Not long ago this would have been an almost insurmountable task. But today, cloud-native technologies make that feasible. With a platform built entirely on Kubernetes, Salt’s microservices communicate via gRPC and the service mesh ensures communication is encrypted. Additionally, Linkerd provides deep, real-time insights into the traffic layer, enabling the team to remain a step ahead of potential problems.
Although the Salt platform team adopted Linkerd to solve the load balancing issue, the service mesh ended up improving the overall platform performance. That translates into fewer fire drills and more quality-time with family and friends. It also allows them to work on more advanced projects and deliver new features faster.