What are the 4 Golden Signals for Monitoring Kubernetes?

Oct 04, 2022Isovalent

Golden Signals are the meaningful data insights that we use for monitoring and observability of a system. They are the signals vs. noise that can help guide us towards what’s affecting the health of the environment. Data metrics and signals can come from as high as the application layer and all the way down to the physical infrastructure.

Why does Kubernetes present challenges with finding the right signals? Kubernetes gives us a common level of abstraction so that developers can just deploy applications without needing to know everything about the underlying infrastructure. The same wondrous abstraction makes it complex and noisy to monitor what’s actually happening in the Kubernetes environment that is affecting our application.

Signal vs. Noise

There is an incredible amount of data available to gather across each layer in the application stack in Kubernetes. Even a simple application can generate a surprising amount of very diverse data. Monitoring metrics can be from the application, network, nodes, hardware, virtualization layers, service mesh, and many other sources.

All you really want to know is what’s happening when your application is failing or degrading. The problem is that you have to gather a lot of data, all the time, and figure out what key metrics matter at any point in time.

As we deploy more containerized and microservices applications, we are seeing that traditional monitoring tools that rely on IP addresses and more static resource allocations do not work in dynamic and ephemeral Kubernetes environments.

There are many common elements in any Kubernetes environment that are consistent across platforms (container manager, API server, Kubelet, etc.). Then there are additional infrastructure elements depending on the hosting platform (e.g. Kubernetes on public cloud, Kubernetes on bare metal, Kubernetes on VMs atop a hypervisor). Add multi-cloud Kubernetes and now you have to monitor distributed clusters running across different cloud platforms that have very different metrics.

Kubernetes platform operators and SREs also have very different consumers and users of their system. We have to answer questions from diverse teams that can include application developers, network operators, cloud platform operators, who all have different ways to view health and performance.

Every Kubernetes Ops team needs to monitor for the most important —and aptly named “golden signals”— in any Kubernetes environment, regardless of the deployment pattern.

Golden Signals for Kubernetes Ops

The 4 general categories of signals that matter to any systems —especially a Kubernetes environment— include latency, throughput, errors, and saturation. Each has its own individual definition of health and which metrics and analytics define those thresholds. The origins of this came from the popular Google SRE book, though we will go a little further as we explore how this transfers to Kubernetes Ops in particular.

1. Latency: most often represented as response time in milliseconds (ms) at the application layer. Application response time is affected by latency across all of the core system resources including network, storage, processor (CPU), and memory. Latency at the application layer also needs to be correlated to latency and resource usage that may be happening internally within the application processes, between pods/services, across the network/mesh etc.

2. Throughput: sometimes referred to as traffic, throughput is the volume and types of requests that are being sent and received by services and applications from within and from outside a Kubernetes environment. Throughput metrics include examples like web requests per second, API calls, and is described as the demand commonly represented as the number of requests per second. It should be measured across all layers to identify requests to and from services, and also which I/O is going further down to the node.

3. Errors: the number of requests (traffic) which are failing, often represented either in absolute numbers or as the percentage of requests with errors versus the total number of requests. There may be errors that happen due to application issues, possible misconfiguration, and some errors happen as defined by policy. Policy-driven error may indicate accidental misconfiguration or potentially a malicious process.

4. Saturation:  the overall utilization of resources including CPU (capacity, quotas, throttling), memory (capacity, allocation), storage (capacity, allocation, and I/O throughput), and network. Some resources saturate linearly (e.g., storage capacity) while others (memory, CPU, and network) fluctuate much more with the ephemeral nature of containerized applications. Network saturation is a great example of the complexity of monitoring Kubernetes because there is node networking, service-to-service network throughput, and once a service mesh is in place, there are more paths, and potentially more bottlenecks that can be saturated.

Now the question comes as to how we measure health with the golden metrics. This leads us to the two most widely referred monitoring methodologies known in the industry.

Seeing RED (Rate, Errors, Duration)

The RED method is focused on a top-down approach at application and service level. This does not consider the underlying infrastructure in the measurement, but does have a strong app-centric health view.

What’s the USE? (Utilization, Saturation, Errors)

The USE method is derived from the bottom-up, infrastructure-focused approach to correlate external resource issues (e.g., active vs. allocated memory, CPU utilization and throttling, node utilization and throttling congestion). 

Both the RED and USE methods are common across different application hosting environments. It’s also a team-oriented approach. Ops teams tend to work from the bottom up (USE) whereas the application development teams and AppOps teams will be focused from the application down.

APM (Application Performance Monitoring) is firmly in the RED category, where health is defined by application KPIs. ITOM (IT Operations Management) and IT Monitoring tools are usually driven by resource utilization as a metric of health. 

The SRE and Platform Ops teams now cross these categories, and Kubernetes opens up the door to a more blended approach. This is where we begin to look at the move towards observability.

Golden Signals and Kubernetes observability

Observability differs from traditional monitoring because it is not just collecting and visualizing metrics. Observability is about inferring the state of the environment based on the output (golden signals); it is about using the same metrics with a different methodology that gives us different outcomes. 

Observability is about being able to ask questions of the system rather than just piling up monitoring data to attempt to correlate it. What are the questions that we can ask the system in order to understand the current state? For example,

  • What is causing slowness for application X?
  • Which services are affecting errors in my front-end application?
  • What is causing congestion on my cluster nodes?
  • Why is my application unable to contact my messaging service in both clouds?

Moving from just graphing and poring over log and trace data to actively asking questions is the evolution towards observability. It relies on the foundation of deep analytics and an understanding of the relationship between them. Monitoring alone cannot deliver the insights needed for Day 2 operations.

Why is Observability in Kubernetes a Multi-Dimensional Challenge?

Our 4 golden signals for observing Kubernetes are especially interesting (and challenging) because each has its own measurement of health. An aggregate combination of golden signals defines the overall system health. Both visually and mathematically, this is a multi-dimensional challenge.  

How often do SRE and container Ops teams get asked “what’s going on with Application X?” without having a specific application monitoring trigger or alert? Or even when certain alerts are appearing but aren’t the singular reason an application is negatively affected. There could be a combination of latency, utilization, and errors that have to be correlated to the root cause. 

We also have to account for the 4 golden signals across multiple clusters, and possibly multiple data centers or regions. Some applications may even be deployed to Kubernetes spanning multiple cloud providers, which vastly increases the complexity of consistent monitoring and observability across platforms. 

The eBPF advantage for Observability in Kubernetes

Some of what we have described so far is available with out-of-the-box metrics using tools like Prometheus and other open-source monitoring and visualization for graphing latency, traffic, errors, and saturation. Legacy and existing monitoring tools are unable to get the deep insights needed for Kubernetes observability. At best, we see many agent-based tools which have a massive impact on performance and increases operational complexity

Enter eBPF which provides a low-overhead, programmable, and a secure method to gather instrumentation including all of our 4 golden signals plus deep, process-level visibility with only a low overhead. This means built-in access to metrics that allow for monitoring and observability. 

We are focusing on the 4 golden signals for now, and we will write a separate piece on the simplicity and security observability benefits with eBPF. 

What can you do with observability and golden signals in Kubernetes?

Once we observe the golden signals, we can correlate events, metrics, traces, and application data to specific issues.  For example, the 4 golden signals can collectively help us conclude that  2 out of 4 services are generating HTTP 500 errors and the front-end application is timing out intermittently.

If you dig into the utilization data, and you find out that those 2 failing services are running on a specific Kubernetes node. Next we see the node has extremely spiky utilization with memory peaking and causing OOM (out of memory) errors and OOMKilled events.

Sep 15 21:07:36 ubuntu kernel: frontend invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=-1, oom_score_adj=0
Sep 15 21:07:36 ubuntu kernel: CPU: 0 PID: 2315 Comm: frontend Not tainted 5.13.0-1019-gcp #23-Ubuntu
Sep 15 21:07:36 ubuntu kernel: Workqueue: events moom_callback
Sep 15 21:07:36 ubuntu kernel: Call Trace:
Sep 15 21:07:36 ubuntu kernel:
Sep 15 21:07:36 ubuntu kernel: show_stack+0x12/0x12
Sep 15 21:07:36 ubuntu kernel: dump_stack+0x1a/0x7b
Sep 15 21:07:36 ubuntu kernel: dump_header+0x34/0x345
Sep 15 21:07:36 ubuntu kernel: oom_kill_process.cold+0xa/0x10
Sep 15 21:07:36 ubuntu kernel: out_of_memory+0x12/0x123

Another, more common example is trying to trace the end-to-end network path from one service to another across multiple nodes and even across clusters. It’s incredibly complex doing flow tracing with iptables and Netfilter to try to understand how packets traverse the environment. 

You can use Cilium and Hubble which have deep network insight by using eBPF, so you can dig into a flow-related application issue by observing recent flow events. In this case, we have a pod named tiefighter that we are troubleshooting, so we pull the flow logs to see where communication patterns and issues could be occurring.

$ kubectl exec -n kube-system cilium-77lk6 -- hubble observe --since 3m --pod default/tiefighter
May 4 12:47:08.811: default/tiefighter:53875 -> kube-system/coredns-74ff55c5b-66f4n:53 to-endpoint FORWARDED (UDP)
May 4 12:47:08.811: default/tiefighter:53875 -> kube-system/coredns-74ff55c5b-66f4n:53 to-endpoint FORWARDED (UDP)
May 4 12:47:08.811: default/tiefighter:53875 <- kube-system/coredns-74ff55c5b-66f4n:53 to-endpoint FORWARDED (UDP)
May 4 12:47:08.811: default/tiefighter:53875 <- kube-system/coredns-74ff55c5b-66f4n:53 to-endpoint FORWARDED (UDP)
May 4 12:47:08.811: default/tiefighter:50214 <> default/deathstar-c74d84667-cx5kp:80 to-overlay FORWARDED (TCP Flags: SYN)
May 4 12:47:08.812: default/tiefighter:50214 <- default/deathstar-c74d84667-cx5kp:80 to-endpoint FORWARDED (TCP Flags: SYN, ACK)
May 4 12:47:08.812: default/tiefighter:50214 <> default/deathstar-c74d84667-cx5kp:80 to-overlay FORWARDED (TCP Flags: ACK)
May 4 12:47:08.812: default/tiefighter:50214 <> default/deathstar-c74d84667-cx5kp:80 to-overlay FORWARDED (TCP Flags: ACK, PSH)
May 4 12:47:08.812: default/tiefighter:50214 <- default/deathstar-c74d84667-cx5kp:80 to-endpoint FORWARDED (TCP Flags: ACK, PSH)
May 4 12:47:08.812: default/tiefighter:50214 <> default/deathstar-c74d84667-cx5kp:80 to-overlay FORWARDED (TCP Flags: ACK, FIN)
May 4 12:47:08.812: default/tiefighter:50214 <- default/deathstar-c74d84667-cx5kp:80 to-endpoint FORWARDED (TCP Flags: ACK, FIN)
May 4 12:47:08.812: default/tiefighter:50214 <> default/deathstar-c74d84667-cx5kp:80 to-overlay FORWARDED (TCP Flags: ACK)

You combine this knowledge with your application state information to narrow down if this is a problem with network access or flow policy. 

These are quick examples of where the 4 golden signals come together to surface the real issue in troubleshooting. 

Observability and Kubernetes Beyond Troubleshooting 

One of the most talked about capabilities of running applications on Kubernetes is the ability to build, deploy, and monitor for:

  • Security:  ensuring network and security visibility and integrity throughout the end-to-end path of one or more applications.
  • Performance:  measuring and monitoring application KPIs and mapping to latency and resource utilization.
  • Availability:  tracking issues in the environment causing reduced service and application availability.
  • Resiliency:  maintaining multiple, redundant resources and network paths to ensure available capacity and throughput in the event of partial failures or saturation.

Effective observability can be part of much more than just Day 2 operational troubleshooting. You can use golden signals and insights during design and build of the application because you can see things like utilization and flow patterns. This can influence building network policies by knowing what services need to communicate and why. It also lets you see and understand other potential failure domains like network availability and resiliency. 

Conclusion

The 4 golden signals for monitoring Kubernetes – latency, traffic, errors, and saturation – give us a broad coverage of important metrics from which we can derive the state of the environment, including health and utilization. Using eBPF for observability gives the deep insights without the resource overhead and operational complexity of agent-based, traditional, legacy monitoring tools. 

Kubernetes observability requires insights to come both from the top-down application level and bottom-up infrastructure. This also maps to the different teams who will be application-focused (developers, application ops) and infrastructure-focused (cloud ops, container ops, platform engineering, SRE).

The 4 golden signals for monitoring and observability in Kubernetes will be the core of consistent Day 2 operations and can drive effective design and implementation processes so that applications are built for secure, resilient operations from prototype to production.  If you want to learn more about Kubernetes monitoring and golden signals, schedule a demo with our engineers.

Roland Wolters
AuthorRoland WoltersTechnical Marketing Manager, Isovalent