Cilium Hubble Series (Part 3): Hubble and Grafana Better Together

In this part 3 of the series, we’ll dive deeper into the Cilium Hubble integration with Grafana, our new Hubble data source plugin, and the enhanced features integrated with Hubble Timescape. We’ll also look at some real world use-cases such as understanding if DNS is really causing issues with our Kubernetes workloads? And how to spot potentially malicious activities in our containers using Tetragon and understanding late process execution events.

Recapping on part 2 of this series re-introducing Cilium Hubble to the cloud-native world, we dived into the enterprise features available to bring observability production-ready features to your platforms.

This includes features such as:

The Network Policy Editor, providing your network engineers the ability to easily improve the security posture of their cluster with Network Policies.
Role Based Access Control (RBAC) to easily delegate access to your application developers, so they have access to rich data for their troubleshooting needs.
Hubble Timescape to easily and efficiently query and filter historical network flows to identify the root cause of issues with Kubernetes workloads.

In the last section of the part 2 blog post, we touched on the Grafana integration, bringing the data from Cilium and Hubble into your existing tooling, allowing your application to be developed to avoid the dreaded tool sprawl and fatigue.

Now let’s uncover this Cilium Hubble and Grafana integration in more detail!

Hubble datasource plugin for Grafana

The Hubble data source plugin has now reached General Availability with the 1.0 release. We created this plugin using the Grafana plugin development tools, focused on integrating three underlying data stores: Hubble Timescape, Prometheus (storing Hubble networking metrics), and Grafana Tempo (used to store OpenTelemetry traces that can be correlated with different application golden signals).

By adopting this new Grafana data source to leverage the Hubble data, your teams can benefit from the rich data from Cilium and Hubble alongside their existing monitoring datasets without having to change tooling or implement complex methods to join datasets together.

The data source implements three query types; Service Map, Flows, and Process Ancestry, and is supported by associated dashboards.

Service Map – Showing the communication paths and HTTP protocol metrics between services
- Dashboard: Hubble HTTP connectivity by Namespace
- Provides a visual communication paths service map, as well as charting out HTTP metrics such as request rate, error rate, and request duration.
Flows – Retrieve the raw network flows stored in Hubble Timescape, which can also be linked to trace data with Tempo.
- Dashboard: Hubble Flows by protocol
- This dashboard shows all the flows for the workloads broken up by monitored protocols; HTTP, Kafka, TCP, UDP, ICMP.
Process Ancestry – Visual render of kernel-level events captured by Tetragon and stored in Hubble Timescape. This is to be used with the Process Ancestry panel plugin.
- Dashboard: Tetragon Process Ancestry
- Replicates the Process Ancestry view from the Hubble Enterprise UI, detailing the process execution events from the chosen workloads in the cluster.

Let’s dive into some of the use cases covered and see these dashboards and data sources in action.

How can I view my applications’ communication paths and metrics?

One area that really shows the power of having Cilium and Hubble in your Kubernetes cluster, is the out-of-the-box visibility due to the eBPF implementation of the networking stack at the Kernel layer.

Without additional instrumentation or day 2 changes to your applications, you can open a world of Layer 7 metrics and insights. This configuration of the data source and associated dashboard is available in both Cilium Hubble OSS and Enterprise offerings.

In the below dashboard, we have the Service Map showing the communication paths between workloads, and even sources such as via Cilium Ingress. From this dashboard view, you can easily see the health of each workload, and request rates between them. Clicking on each workload provides you the ability to deep dive into the metrics around request rate, P90 request duration, average request duration and failed request percentage.

Grafana - Hubble - HTTP Connectivity By Namespace - Service Map

Continuing to scroll down this dashboard, we have the overview graphs of our HTTP specific metrics of all the workloads in the chosen namespace. Grafana provides interactivity for each of these graphs where you can drag the timeline to focus on areas of the plotted metrics.

Grafana - Hubble - HTTP Connectivity By Namespace - HTTP Request Rate and HTTP failed requests

In the final part of the dashboard screenshot, we present the concluding HTTP metrics, helping to understand the responsiveness of our workloads to the requests they receive.

Plotted in these graphs are exemplars from the tracing information pulled from the application and Grafana Tempo. This allows the application teams to marry the metrics pulled from the HTTP protocol against the trace logs from the application itself. For example, we can see that request duration has increased between two services, viewing the trace shows several retries between the services, which could indicate a resource load issue.

Grafana - Hubble - HTTP Connectivity By Namespace - Median HTTP request duration - 90th Percentile http request duration

I recommend using this “HTTP Connectivity” dashboard alongside the Cilium installation-provided “Hubble L7 HTTP metrics By Workload” dashboard. This dashboard provides you further breakdown of HTTP related metrics from source and destination, allowing you to dive deeper into your workloads. Again, this dashboard is available in both OSS and Enterprise configurations, via a variation of the below helm values. Some fine tuning of the Hubble metrics you want to collect from your environment may be needed.

Below is an example of the helm values used to configure Hubble Metrics and Dashboard (that can be loaded by the dashboards sidecar for Grafana).

hubble:
    metrics:
      enabled:
      - dns:query;ignoreAAAA
      - drop:sourceContext=identity;destinationContext=identity
      - tcp
      - flow
      - icmp
      # Enable additional labels for L7 flows
      - 'httpV2:exemplars=true;labelsContext=source_ip,source_namespace,source_workload,destination_ip,destination_namespace,destination_workload,traffic_direction;sourceContext=workload-name|reserved-identity;destinationContext=workload-name|reserved-identity'
    enableOpenMetrics: true
    dashboards:
      enabled: true
      namespace: monitoring
      annotations:
        grafana_folder: "Hubble"

Now let’s look at how we enrich this data further by combining the Hubble Metrics with application traces, to provide a complete deep dive into the application’s inner workings and communications.

Hubble Support for Tracing-enabled applications and Grafana Tempo Integration

Troubleshooting a distributed system is complex. To be able to drill down and identify the problem with the right level of detail, you might need distributed tracing. Today, we know from experience that many application developers already instrument their apps with tracing and send trace spans to their preferred tracing backend such as Grafana Tempo. Instrumentation can also be achieved automatically, for example using Grafana Beyla.

In Cilium 1.13, Hubble’s Layer 7 HTTP visibility feature was enhanced further to automatically extract the existing, app-specific OpenTelemetry TraceContext headers from Layer 7 HTTP requests into a new field in Hubble flows. In fact, TraceContext is a specification that is now widely adopted and can be configured in Datadog and others. This allows for correlating distributed traces with detailed network-level events. If traces are stored in Grafana Tempo, the Hubble datasource plugin will automatically link Layer 7 flows to traces.

Additionally, the trace ID is included in Hubble metrics as OpenMetrics exemplars. This effectively links Hubble metrics to distributed traces, which enables engineers to quickly investigate problems. They can, for example, use Hubble metrics to define an alert on high HTTP latency, and when it fires, use exemplars to jump to a distributed trace, which highlights details of problematic requests.
To learn more, watch this eCHO show recording or head out to the GitHub repository, Cilium Grafana Observability Demo, to try it yourself with Cilium OSS and Grafana.

In the below image we can see a trace originating from a service called Crawler, that connects to the CoreAPI service via two other chained services, Loader and Resumes. In the traces below, we can see that there are a number of retries for this connection, indicating that the CoreAPI service has a fault.

How do I dive into all the network flows in Grafana?

The quickest way to look at all the network flows available from Hubble Timescape in Grafana is via the Explore interface in Grafana. Simply select Hubble as the data source and set the query type to Flows.

Below you can see all the available table properties for each flow:

Time
Verdict
Source
Destination
Trace ID
Workloads
Protocol
IP
IP Port
L7 Protocol
L7 Type
HTTP
- Version
- Method
- URL
- Latency
- Status

There are a few other properties pertinent to the various protocols such as Kafka, that are not shown in the screenshot, that will be captured for those types of traffic (for example, Kafka version, API Key, Topic, correlation id, etc).

Out of the box, the Hubble Data Source provides the “Hubble / Flows by protocol” dashboard, which gives a curated view across each protocol, which can then be broken down with filters.

This is a great way to really dive into the raw data, but typically you would start to refine these views and queries for your use-cases and add in other visualisations.

Let’s look at one of those use-case areas further.

How would I troubleshoot a specific protocol like DNS?

We’ve all been there, something just isn’t working between our applications and services, and as the joke goes…. it’s always DNS!

But how can you prove it really is DNS that is at fault?

Thanks to Hubble Network flows, we can identify DNS protocol traffic, and inspect the requests. Out of the box with Cilium, we have the “Hubble / DNS Overview” Dashboard, which breaks down visualisations of DNS requests per namespace.

Most importantly, we can see Missing DNS responses charted, and DNS Errors by reason. This is a great high-level overview to start to answer the question “Is it DNS?”.

From the below we can see that yes, DNS is causing some issues in my namespace. But now we need to dig a little deeper.

Undercovering the full DNS flows from earlier, we can see the DNS Query Names in use, and for some reason we have lookups going to “github.com” but have been appended with “fritz.box” for some reason. We know this DNS address is incorrect, so now we can dive into application configuration further to see why this is happening.

For the curious amongst you reading this blog post, this particular issue is caused by my home network router, which has an interesting bug for some internal loopback lookups and responses it provides! I didn’t know about the issue until I saw it happening in the Hubble flows.

How do I view the process executions in my containers?

The final area to focus on is security. Using Hubble Timescape, part of Isovalent Enterprise for Cilium, to store the process events from Tetragon provides you a very powerful store of data to comb through in regard to process level events from your platform.

Bringing those directly into Grafana, for all your applications team to consume seems like a no brainer. However just having the JSON files parsed into tables isn’t always enough. And that’s why we’ve brought the process ancestry visualisation to Grafana using the Hubble Process Ancestry plugin.

Process Ancestry ties together all the process events into a usable timeline against your Kubernetes workload. This allows you to spot late execution processes, which can be a sign of either misconfigurations at an application level, or worse, an attacker executing commands via an existing workload.

In the provided dashboard “Pod Process Ancestry”, users not only get the full visualisation which ties together the various process events, but a full table output of the parsed JSON for each event captured for the Kubernetes workload.

Below is an example of a pod running in the environment, with Tetragon recording the process event information and TCP/HTTP requests and parsing them.

Let’s dive into an example when our environment undergoes changes and suspicious behaviour starts to take place, and see how this visualisation helps spot that activity!

Spotting suspicious and malicious behaviour in our Kubernetes workloads

Observing the process events during a container’s startup in the environment is valuable. However, in cyber intrusions, attackers often seek persistent access to platforms. This persistence, often established through a reverse shell, grants hackers remote access even after breach remediation attempts.

When we start to map out an attack, we know that the new processes will execute on the environment, this has the term, late running execution/process. Simply, a process which runs some time after all the normal startup processes have executed.

In the below screenshot, you can see I have spun up a new instance of my workload, it is the same workload as the previous screenshot. The nc command creates network connections; here, we’ve used it to set up a reverse shell attack on the container.

In the Pod Process Ancestry tree, we can now see that these “late running executions” are highlighted in yellow, making it easy to spot changes in our workload. But also using Tetragon, we have parsed the commands executed and displayed the FQDN and Public IP addresses that the container is now connecting to.

This is also backed by the full process event information in the table below the visualisation.

Straight away, we can action this information with Isovalent Enterprise for Cilium, by using the Network Policy Editor feature in Hubble to create a new Cilium network policy to block any workload from contacting this malicious address, and then continuing to investigate and remediate the issue.

How do I install the Hubble plugins for Grafana and get started?

To get started, you can enable Hubble Metrics out-of-the-box and have Prometheus scrape these into the data store.

To see the full value of hooking up Hubble and Grafana together, install both the Hubble Plugin and the Hubble Process Ancestry plugin. In the below screenshot, as my Grafana instance has internet access, I can do this straight from the Plugins view.

Once the Hubble plugin is installed, this will activate the Hubble Data Source configuration, which can be used to connect to both Hubble Timescape for the full data from your workloads such as network flows with protocol metrics and Tetragon security observability events, and Tempo for the application traces.

Grafana - Hubble Plugin Data Source Configuration

Selecting the dashboard tab will allow you to install the associated dashboards for the Hubble plugin that we’ve covered in this blog post.

Note, the “Tetragon / Pod Process Ancestry” dashboard needs the Hubble Process Ancestry plugin to be installed, however no further configuration is needed once the connection to Hubble Timescape is configured.

Where can I learn more?

If you want to dive a bit deeper into the Grafana Hubble plugin, then I recommend my colleague Anna’s guest blog post “How to monitor Kubernetes network and security events with Hubble and Grafana“, over on the Grafana website. You can also get hands-on with the Cilium and Hubble integrations with Grafana by starting the Golden Signal’s with Hubble and Grafana lab.

For those of you who are looking at how Cilium features fit together into an existing platform and integrate with tooling like Grafana, I recommend our new Discovery: Platform Engineer Lab, launched at KubeCon.

Discovery: Platform Engineer

In this short hands-on discovery lab designed for Platform and DevOps Engineers, you will learn several Cilium features, covering Built-in Ingress and Gateway API, Observability and Performance monitoring with Grafana.

Start Lab

Cilium Hubble Series (Part 3): Hubble and Grafana Better Together

Table of contents

Hubble datasource plugin for Grafana

How can I view my applications’ communication paths and metrics?

Hubble Support for Tracing-enabled applications and Grafana Tempo Integration

How do I dive into all the network flows in Grafana?

How would I troubleshoot a specific protocol like DNS?

How do I view the process executions in my containers?

Spotting suspicious and malicious behaviour in our Kubernetes workloads

How do I install the Hubble plugins for Grafana and get started?

Where can I learn more?

Discovery: Platform Engineer

Table of contents

Related

Golden Signals with Hubble and Grafana

Cilium Hubble Series (Part 1): Re-introducing Hubble

Cilium Hubble Series (Part 2): Hubble for the Enterprise

Industry insights you won’t delete. Delivered to your inbox weekly.