How Bink built a fault-tolerant app stack on a dynamic foundation
Powering Barclay’s digital loyalty transactions with Linkerd
Bink, a UK-based fintech company, has made it its mission to reimagine loyalty programs. Committed to making them easier for everyone — banks, shops, and customers alike — Bink developed an app that recognizes loyalty points every time customers shop. With a simple tap, it connects purchases with reward programs.
The Bink app links customer payment cards to any loyalty program and is powered by an extensible platform that meets strict regulatory security and accountability criteria. As a greenfield project, Bink’s engineering team was able to leverage multiple CNCF projects, including Kubernetes, Linkerd, Fluentd, Prometheus, and Flux. Today, their technology stack is performant, scalable, reliable, and secure and reduces app issues experienced through transient network problems.
Loyalty programs with a positive impact on retail
The Bink team has years of experience in banking, retail, and loyalty programs. With a deep understanding of banking opportunities, they know how to transform loyalty programs to have a positive impact on retail.
In 2019, Barclays recognized Bink’s immense potential and agreed to a significant investment. Because of this partnership, Bink is now available to millions of Barclays’ customers in the United Kingdom!
Although built mainly by one individual, when budgets were tight, the engineering team knew they had to design Bink’s infrastructure so it could keep pace as the startup grew.
"Fast forward three years, and a team of three is supporting our in-house built platform capable of processing millions of transactions per day — a true testament to the amazing technology of the cloud native ecosystem!"
— Mark Swarbrick, Head of Infrastructure at Bink
The engineering team, infrastructure, and platform
Mark Swarbrick, Chris Pressland, and Nathan Read are the three platform engineers in charge of Bink’s platform, which runs on six Kubernetes clusters in Microsoft Azure — two for production in a multi-cluster setup, each running about 57 microservices.
Bink initially had three web servers on bare metal Ubuntu 14.04 instances running a few uWSGI apps load-balanced by NGINX instances — they had no automation of any kind.
In 2016, the team began migrating the applications to Docker containers, moving away from SFTPing code on production servers, and restarting uWSGI pools. They built a container orchestrator in Chef that assigns host ports to containers and updates NGINX’s proxy_pass blocks to pass traffic through dynamically. This worked well until they learned that Docker caused numerous Kernel panics and other issues on their aging Ubuntu 14.04 infrastructure.
Around that time, the engineering team got approval to evaluate a migration from their data center to the cloud as their needs were far outgrowing what the data center could offer. As a Microsoft customer, Microsoft Azure became the obvious choice. “We quickly concluded that maintaining our own container orchestrator wasn’t sustainable in the long run and decided to move to Kubernetes,” explained Swarbrick. Since Microsoft didn’t have a Kubernetes offering that met their requirements at the time, they decided to write their own with Chef.
On Kubernetes? Time to look for a service mesh
Running their in-house Kubernetes distribution was painful at first. That was primarily due to Microsoft’s unstable networking infrastructure at the time. While that vastly improved over time, Bink saw massive amounts of random TCP disconnects, UDP connections going missing, and other unexpected faults for the first years.
In 2017, the engineering team began looking into service meshes. They were hoping it could help solve, or at least mitigate, some of these issues. Around that time, Monzo engineers gave a KubeCon presentation on an outage they recently experienced and how Linkerd fit in. “Not everyone is transparent about these things, and I really appreciated them sharing what happened so the community can learn from their failure — a big shout-out to the Monzo team for doing that!” said Swarbrick. “That’s when we began looking into Linkerd.” At the time, Linkerd was about to release a newer Kubernetes-native version called Conduit, which was later renamed Linkerd2.
Since the industry was leaning towards Envoy-based service meshes, Bink also briefly considered Istio. But they quickly realized Linkerd was very easy to implement and didn’t require writing any application code to deal with transient network failures. Since the latency was so small, the additional hop in the stack didn’t really make a difference. It also provided them with invaluable traceability capabilities — Linkerd seemed like the perfect fit for Bink’s use case.
"As soon as we started experimenting with Linkerd in our non-production clusters, network faults caused by the underpinning Azure instabilities dropped significantly. That gave us the confidence to add Linkerd to our production workloads where we saw similar results,"
— Mark Swarbrick, Head of Infrastructure at Bink
The Linkerd difference
Migrating their application to a cloud native platform was an easy decision. But some architectural components weren’t as performant or stable as they should have been. With Linkerd, Swarbrick’s team was able to implement connection and retry logic at the right level of the stack, providing the needed reliance and reliability. Now, they didn’t have to worry about using their software stack in the cloud with a significant uplift. Linkerd proved that placing the logic in the connection layer was the right approach. It allowed them to focus on product innovation without worrying about network or connection instability. “That really helped reduce operational development costs and time to market,” stated Swarbrick.
Linkerd improved things from a technology and business perspective.
"We had just begun conversations with Barclays and needed to prove we could scale to meet their needs. Linkerd gave us the confidence to adopt a scalable cloud-based infrastructure knowing it would be reliable — any network instability was now handled by Linkerd. This, in turn, allowed us to agree to a latency and success-rate-based SLA. Linkerd was the right place in the stack for us to monitor internal SLOs and track the performance of our software stack,"
— Mark Swarbrick
The engineering team didn’t want to extend an on-prem infrastructure or refactor their app to deploy retry logic. Linkerd gave them the metrics to achieve this fast while helping them track down platform bottlenecks in a cloud native environment.
The cloud native effect
When considering the entire stack, cloud native technologies enabled Bink to build a cloud-agnostic platform that scales as needed while letting them keep a close eye on stability and performance. “We’ve tested our platform at load and can perform full disaster recovery in under 15 minutes, recover easily from transitory network issues, and are able to perform root cause analysis of problems quickly and efficiently,” explained Swarbrick.
Shaking up banking and loyalty programs in the UK
Bink had to rapidly grow and mature its infrastructure to meet banking security, stability, and throughput requirements. The cloud native stack allowed them to confidently support and monitor contractual SLA and internal SLO requirements. This positioned Bink to capitalize on the retail space and help retailers provide payment-linked loyalty in partnership with some of the biggest banks in the UK.