This case study was originally published on cncf.io.
Founded in 2015, Lunar challenged the banking status quo by reinventing how people interact with their finances. Lunar is for those who want everything money-related in one place — 100% digital, right in their hands.
That meant offering customers a smarter way to manage their money with more control, faster savings, easier investments, and no meaningless fees. That’s how Lunar envisions the future of banking.
Those are big plans, and, to achieve them, Lunar must be able to grow and scale rapidly. In 2021, they acquired Lendify, a Swedish lending company, and PayLike, a Danish fintech startup. With these acquisitions, Lunar gained various new business-critical capabilities, but it also meant the engineering team had to integrate all these systems, ensuring they work together smoothly — a non-trivial task.
Lunar’s team of 150+ full-time engineers push about 40 releases to production every day. Twenty of those engineers work across four platform teams. The platform team operates nine Kubernetes clusters across three cloud providers (AWS, Microsoft Azure, and Google Cloud Platform) on multiple availability zones. They run 250+ microservices plus a range of platform services that are part of their self-service developer platform. Lunar wants their teams — or Squads as they call them — to be autonomous and self-driven. To support this “shift left” mindset, a group of platform Squads builds abstractions and tooling to ensure developers can move their features fast, securely, compliant, and efficiently.
There are multiple reasons why Lunar chose the cloud native path. First, they needed a platform that allowed their teams to manage their services and be fully autonomous. Secondly, as a fintech company pioneering cloud-based banking, they had to clearly document how they would avoid cloud vendor lock-in — a regulatory requirement by the Danish FSA. Functioning as an abstraction on top of a cloud provider, Kubernetes helped them achieve both goals and was thus the perfect fit. With most dependencies removed, this autonomy allowed Lunar to scale easily and be perfectly positioned for achieving its ambitious goals.
The platform team also provides Squads with a mix of open source tooling, including Backstage, Prometheus, and Jaeger, and also some custom-built solutions which they have open-sourced, such as shuttle and release-manager.
This multi-cloud strategy and work style support the company’s goal of scaling in terms of the number of employees and mergers and acquisitions. It also allows us to stay technology agnostic and choose the technologies that best fit our needs. — Kasper Nissen, Lead Platform Architect, Lunar
The first platform service Lunar centralized was Humio, their log management system. At the time, Lunar was developing failover processes for their production clusters. During development, they realized logs in their log management system went missing. To avoid losing potentially critical data once in production, they removed the log system from their production cluster and centralized it before implementing the failover capability.
After successfully centralizing their log management system, they embarked on a platform services centralization journey. Lunar has multiple environments, and its platform services, including the observability stack, were replicated in each environment — all fairly complex services that consumed lots of resources. Minimizing the number of stateful services in these environments would clearly help. Additionally, running nine replicated setups simply didn’t scale.
Lunar also had multiple endpoints to access tools like Grafana, resulting in duplication of effort managing users and dashboards. This confused the development teams who had to implement changes in multiple places, leading to drift between environments, among other challenges. Managing users in one system is much more efficient than doing so in nine (or more).
All this factored into Lunar’s decision to create a centralized cluster owned by the platform team that would eventually run the entire observability stack, release management, developer tooling, and cluster-API — an ongoing effort.
Today, Lunar’s log and release management run as centralized services along with Backstage and a handful of other tools. Next is their monitoring setup, a mix of Buoyant Cloud and Prometheus/Grafana.
Once Lunar started centralizing platform services, they had to connect their clusters. At the time, they were only running clusters in AWS and considered VPC peering across accounts. Doing that was somewhat painful due to clashing CIDR ranges. They also evaluated VPNs but aren’t big fans of using technologies with two static boxes on each end. Besides, they wanted to move towards zero trust networking, following the principles of BeyondProd by Google.
During the 5+ years of running Kubernetes, Lunar continuously evaluated service meshes. In 2017, they even had Linkerd running as a PoC but decided against it. It was still the JVM-based Linkerd 1 and quite complex. They kept following the development and evolution of service meshes. When they heard the Linkerd 2.8 release included multi-cluster capabilities, they realized it was time to give service meshes another shot.
This decision was further reinforced by problems they were experiencing with gRPC load balancing (not natively supported by Kubernetes) and the need to switch to mTLS for all internal communication. A service mesh made a lot more sense now.
While they have always been big fans of Linkerd’s approach, starting with the basics and making that work well, they evaluated both Linkerd and Istio. They committed a week of time with two engineers, one playing with Istio and the other one with Linkerd.
We had the Linkerd multi-cluster up and running within an hour! After a few days of struggling with Istio, we gave up on it. Linkerd did the job fast and easily — the perfect mesh for us. It had all the features we needed at the time, was easy to operate, had a great community, and solid documentation — Kasper Nissen, Lead Platform Architect
Since going live, Lunar also started using Buoyant Cloud for better visibility across all their environments.
A CNCF End User Member, the Lunar engineering team, trusts and uses many CNCF projects. Their stack includes Kubernetes, Prometheus, cert-manager, Jaeger, Core DNS, Fluent-bit, Flux, Open Policy Agent, Backstage, gRPC, and Envoy, among others. They’ve built an Envoy-based ingress/egress gateway in all clusters to provide a nice abstraction for developers to expose services in different clouds.
From a technology perspective, Lunar has now achieved a fairly simple way to provide and connect clusters across clouds. Kubernetes allows us to run anywhere, Linkerd enables us to seamlessly connect our clusters, and GitOps provides an audited way to manage our environments across multiple clouds with the same tooling and process. From a developer’s perspective, the process is identical whether you deploy on GCP or AWS, said Nissen.
The business impact has been substantial. With Lunar’s new multi-cloud communication backbone, they are better positioned to support upcoming mergers and acquisitions — a key part of their business strategy. They can extend the Lunar platform in a cloud-agnostic way while selecting the provider that best fits their needs for each use case.
With logs now centralized and no risk of losing logs during failover, they will soon implement quarterly production cluster failovers. That will allow them to know exactly how their system behaves in case of a failure and how to bring it back up. It’s important both from a regulatory perspective and a business perspective. If their customers lost access to their account information, it would have disastrous consequences for their business. That’s why they proactively train for the worst-case scenario. If something were to happen, they would know exactly what to do and how to avert any issues.
Centralizing their platform services has already streamlined many processes and improved developer productivity. All releases, metrics, logs, traces, etc., are properly tagged with fields such as Squad names, environments, and so on, making it easy for developers to find what they are looking for. It also ensures clear ownership of that particular piece.
Managing the team is also a lot simpler for Nissen. “I don’t have to set up dashboards, help search through logs, etc. — our Squads are truly independent. Because our platform is based on self-service, it is decoupled from the organization allowing our team to focus on implementing the next thing that will help our developers move faster, be more secure, or ensure better quality,” explains Nissen.
Audits have become a lot easier too. Since everything is centralized, they can run audit reports for all clouds and services across groups and environments. That is good for them and provides peace of mind in the highly-regulated financial services industry.
While they aren’t there yet, they expect to save significant time in engineering resources by not having to operate and maintain nine versions of the soon-to-be fully centralized stack.
Overall, Lunar feels well-positioned for upcoming acquisitions and organic growth. With a platform able to extend anywhere, they’ve become a truly elastic organization. If your business expects to scale fast, Nissen recommends centralizing your platform services. "It’s not painless, but once it’s all set up, growing your business becomes increasingly easy," stated Nissen.