Aug 1, 2023
This article was originally published on Noobpreneur
Any organization working in the cloud has to manage multiple, complex challenges to security and reliability, while keeping a tight rein on costs. Your brand’s reputation depends on managing these challenges with aplomb, ensuring that you handle threats and failures quickly, transparently, and efficiently. Increasingly, organizations are choosing an open-source service mesh to help avoid downtime—while driving potentially game-changing business benefits, including drastic reductions in cloud spend.
The shift to cloud-native technologies has fundamentally changed application development from a world where applications run on hardware and networks completely controlled by the developing organization to a world where that control is traded for lower costs and speed in the development cycle. In turn, this tradeoff requires the organization to embrace new cloud-native patterns, like microservices, Kubernetes, and the use of a service mesh, so that the application still has needed security and resilience.
These new patterns allow shifting the security boundary entirely from physical data-center security to application security, including ensuring that all data is encrypted both at rest and in transit. The service mesh plays a critical role in this shift, by adding security, reliability, and observability to the application in a way that minimizes developer involvement.
For example, Linkerd, the first open source service mesh to achieve the “graduated” status in the Cloud Native Computing Foundation, uses sophisticated techniques such as mutual TLS to safeguard both confidentiality (encryption) and authenticity (identity validation) of both sides of the connection for all traffic within an application. Linkerd does this completely transparently, without the application needing to change.
Additionally, the mesh’s observability features can allow the operations staff to see problems on a Friday night before they become an emergency. And its reliability features can prevent needing to call in a developer team to work the weekend—instead, the operations staff can simply configure the mesh for automatic retries, preserving the user experience and leaving the more intense problem hunting for Monday.
Service meshes offer direct and secondary cost savings. The mesh can reduce direct cloud costs by allowing organizations to get rid of load balancers in the cloud and reduce some network traffic. In some cases, organizations have been able to eliminate hundreds of paid IP addresses for microservices.
In cases where the organization is running clusters spanning multiple availability zones, some service meshes like Linkerd can even further reduce costs by carefully routing traffic (or handling outages) so that traffic stays within a zone, which costs less than traffic between zones. This can bring dramatic reductions in cloud network spend (millions of dollars a year) while still retaining the failure-resistant properties of multi-zone deployments, as in the case of Entain Australia, which 10x’d throughput and saved thousands of dollars a day by deploying Linkerd.
Service meshes also offer secondary cost savings resulting from the increased efficiency of developers. By delivering critical platform features like mutual TLS, latency-aware load balancing, retries, success rate instrumentation, transparent traffic shifting, and more, service mesh frees developers of these tasks, allowing them to focus on the business logic that drives the organization. These savings are significant. Critical platform maintenance can be incredibly difficult to get right in a large distributed system, placing undue pressure on application developers.
Operational continuity—and the availability of online services—leaves organizations with no slack. Users have come to expect instant access at any time of the day. In a distributed system, IT outages that start as partial failures in one area can quickly escalate into major operational disruptions that impact the customer experience. Issues, errors, or delays reflect on the organization or the brand.
A service mesh delivers a sophisticated set of distributed system reliability features that can help prevent escalation in the first place, including request-level load balancing, timeouts, retries, rate limiting, circuit breaking, and traffic shifting. Some service meshes even provide powerful features like latency-based load balancing and retry budgets to tamp down on partial failures before they escalate.
Uses cases for service mesh come from industry leaders that include Microsoft, Plaid, and Adidas. These companies, all with global users, have realized the business benefits of creating scalable systems with resilient infrastructure that includes automatic retries, circuit breakers to isolate faults, and seamless restore functionality. Service mesh helps them detect where failures are happening with advanced observability and provides Zero Trust security system-wide.
The business impacts of using service mesh can be seen in an organization’s overall uptime (increased), overall spending in networking and engineering (decreased), and developer/engineer productivity (increased). Other changes are more subtle but still measurable, including employee satisfaction for those in networking or development and positive shifts in the organization’s development philosophy.
And, of course, there’s one other important metric: how many problems your customers notice. When customers and users are unaware of issues because service mesh has things covered, your organization is meeting expectations, building trust, and enhancing your reputation.