Kubernetes can handle dying nodes and failing pods, but it can’t help when your application returns failure errors, slows down, or suffers from a sudden spike in traffic. In this talk, Nic Cope from Planet Labs, deep dives into how CNCF’s service mesh, Linkerd and Prometheus provide cluster wide visibility analytics for Planet Labs global network of micro-satellites.
With these analytics, Nik’s team is able to monitoring latency and service failure for all events. All Planet Labs customer facing latency or service failure warnings are scheduled to send a pager notification and are triaged according to urgency. If the event warning is for one service and doesn’t immediately effect a customer, then the team will wait until the service owner is back on shift. If the event warning is for multiple failed or slow services, Nik’s team is responsible for determining why the services are slow or have failed. In both scenarios, Nik shares how Linkerd and Prometheus make it easy to quickly rectify the service errors.
Nik talks about Linkerd and Prometheus around the 20 minute mark. But, the entire talk is full of useful perspective on systematic configurations for monitoring and alerting your Kubernetes applications.