Back to blog

Isovalent Enterprise Dashboards for Cilium: Operating at Scale

Dean Lewis
Dean Lewis
Published: Updated: Isovalent
Isovalent Enteprise Dashboards - Cover Image

At Isovalent, we focus on removing the operational burdens of running your cloud native platforms at scale through our experience as the creators of Cilium, our enterprise-ready software offerings, alongside our stellar customer success and support experience. 

In this blog post, we will dive into another benefit available to all Isovalent customers: Enterprise Dashboards for Cilium, designed in partnership with our customers, to help operate your cloud native platform at scale using Grafana. 

We’ve previously covered how Isovalent Enterprise customers benefit from backports of features and fixes between Cilium versions and the programs that make up our Enterprise support offering itself, with unique capabilities such as Customer Testing Environments (CuTE) maintained by Isovalent Customer Success, which mirrors an environment configuration similar to a customer’s production environment.

Without context, data is meaningless

Out of the box, several per-subsystem dashboards are provided for Cilium. While these provide many data points for platform, network, and security engineers to comb through when managing their respective platform components, the lack of context creates a barrier for platform operators to understand how the data is relevant to their environments.

Ultimately, you already need to have a high prerequisite knowledge of Cilium, Networking, Kubernetes, and potentially even the Linux Kernel to make sense of the data. When speaking to customers, these dashboards were alienating to users who didn’t have a deep understanding of the platform upfront and imposed further complexity to create their own meaningful dashboards. If you don’t understand the initial data provided, how can you build something you do understand?

Let’s take a look at one of the existing out-of-the-box per-subsystem Cilium Overview dashboards in our Grafana environment. The per-subsystem focuses on the underlying components that make up Cilium rather than Cilium as the overall connectivity and security platform. This dashboard provides a wealth of performance information about the Cilium Agent’s health, including volume of identities, ingress and egress traffic, packet drops, agent warning rates and drop rates. If the latter two metrics increase, we must troubleshoot the platform. Marrying this against any reports of application issues may help us spot an underlying issue.

Cilium Grafana Dashboard - Per subsystem overview

These dashboards contain useful information and are part of a full observability toolset for your platform. They are for those who understand the platform deeply and want to use all the metrics Cilium offers.

However, to provide that better experience for platform teams to manage their environments without becoming Day-0 specialists, we’ve worked closely with our customers to design a new set of enterprise dashboards that provide data and context.

Dashboards, built with context to operate at scale

Moving from the subsystem data point view of the Cilium platform, we devised categories for the overall high-level health components that our customers need to understand and monitor before digging under the hood. 

Breaking this into seven modules within a dashboard makes its consumption easier, especially if you display these visuals in an NOC-type environment with monitors hung on the wall.

This approach simplifies the overall breakdown of the data points provided by the subsystem monitoring in more platform component-focused categories. This change makes it easier to consume and spot underlying platform-level issues that can be easily triaged against reports from application owners and incident response systems such as PagerDuty.  Furthermore, as you’ll see as we break down some of these focus areas, having available context alongside the data removes the need to be a master in the platform components to understand the data itself. 

Installing these new Isovalent Enterprise Dashboards for Cilium is as simple as running a single helm install command, which can be found in our enterprise documentation. These Grafana dashboards will be installed automatically as part of the dashboard reconciliation process from Grafana. Alternatively, you can simply use the JSON files in disconnected environments. We’ve focused on this installation method to ensure a short time to value, meaning platform teams spend more time using the tooling provided rather than installing and configuring it.

Cilium Enterprise Dashboard - High Level Health - Sections

Using the “High-Level Health” dashboard, you will see that this focuses on Cilium’s overall health within the monitored platform. This section of visuals collects metrics and statuses on Cilium components’ health indicators and Kubernetes metadata, such as pod state, pod restarts, etc.

The screenshot below is from my own Talos Linux environment running Cilium. As you can see, at the start of the monitoring period, I had some instability in my cluster (caused by myself). Rather than showing pod names with errors, the graph clearly shows that the Operator was not running. This makes it clear and easy for the user to understand which component is at fault.

As with all data tiles on the dashboard, this tile has an “Informational” Marker (ⓘ icon), which, when hovered over, provides a detailed explanation of what the tile is showing the user and additional information, such as what to expect if this data shows any errors or outages, etc. Clicking on the screenshot will open a new tab to inspect it in further detail.

Cilium Enterprise Dashboard - High Level Health - Cilium Health

All components in this environment have been resolved from errors and are running quite happily. For the remainder of this blog post, to replicate real-world situations, several demonstrations will be used through implementing stress tests in the environment, which will cause warnings, errors, and failures in various components to show the various sections of the dashboards in action. 

If you are using Cilium in your environment today and are considering performing stress tests on your platform, contact our specialists, who can help you plan, design, and validate your deliverables. 

Cilium Agent Errors

The first self-made error caused the Cilium Agent to crash by altering the container configuration to incompatible values. Below, we can see an outage whereby the red line in the graph shows that the Cilium Agent pods are not running as the line is increasing, and the green line for running agents decreases. To the right-hand side is the current visual aid, where the left-hand graph helps identify when the Agents became unavailable. We are starting off by showing a simple issue and a simple view; if we hover over the “informational” icon for this view straight away, we can understand the following from the associated tooltip;

  • Cilium Agents that are not running will disrupt workload Pod scheduling, may affect policy implementation, and can lead to disruption in L7 proxy functionality (without HA mode).
Cilium Enterprise Dashboard - High Level Health - Cilium Health - Pod State

Now, let’s focus on CNI (Container Network Interface) Operations and running workloads in our environment. The CNI is the Kubernetes API specification for creating and configuring container networking. Cilium uses this specification for Kubernetes Pod scheduling. If you see errors here, they will likely indicate Pod scheduling issues.

CNI API Rate Limiting

For this part of the demo, I have run a script to enable the screenshot below to be captured. This script was designed to create a high churn of Pods being added and deleted from the monitored cluster. Overall, we see a big discrepancy in the latency of requests that allow Cilium to proceed with the reconciliation of CNI operations. This has pushed the overall max latency into the seconds mark, and errors are reported as API rate limits are being hit within the platform. 

This impacts new workloads from being scheduled on the platform, the longer it takes to attach and detach networking from containers, and holds up the status from the pod, as the network interface is not yet provisioned. 

Essentially, the slower these operations are, the slower your environment becomes from an end-user experience.

Cilium Enterprise Dashboard - High Level Health - Cilium Health - Pod Scheduling

BPF Map Pressure

The final area I will cover in this blog post is BPF Map health. BPF Maps are used to share data between the user space and the kernel. In the specific example below, we will focus on the Policy Map, which stores the policy contents created by the platform engineer and me and hand them off to the BPF programs that implement the policy restrictions at the data path level. 

To create the below errors, I have configured the setting bpf-policy-map-max to the value “512.” This is an incredibly large reduction in the size of the BPF map, whose default setting is “16386” and can be set to a max of “65536”.

In the Cilium Health section of the dashboard, we will see that the number of errors found in the agent log starts to increase for the Cilium Agent. This is our first warning that there are issues with our platform.

Cilium Enterprise Dashboard - High Level Health - BFP Map - Cilium Agent Error Rate Current

Going to the “BPF Maps” Section of the dashboard, we can see quickly that the error rate logged for Policy maps has increased dramatically, and on the right side, we see the value for the “Map Pressure” tile is now at 100%. This means that our Policy Map has run out of space, which will hinder the functions of Cilium and our workload’s ability to communicate on the datapath. This dashboard tile uses the Cilium Agent Metric cilium_bpf_map_pressure{map_name="cilium_policy_*"}

Cilium Enterprise Dashboard - High Level Health - BFP Map Health

If we look at the Cilium Agent logs, we will see error messages similar to the example below. 

time="2024-06-14T15:25:16Z" level=error msg="Failed to add PolicyMap key" bpfMapKey="{40 16779348 1 0 0}" ciliumEndpointName=tenant-jobs/loader-7d9bbd7f4f-w8662 containerID=8e05c684a9 containerInterface=eth0 datapathPolicyRevision=0 desiredPolicyRevision=10 endpointID=849 error="update map cilium_policy_00849: update: no space left on device" identity=11560 ipv4=10.244.4.252 ipv6= k8sPodName=tenant-jobs/loader-7d9bbd7f4f-w8662 port=0 subsys=endpoint

To resolve this issue, I must increase the size of my Policy map. However, this may not fix the cause of the pressure itself; it may just alleviate the issue. Further troubleshooting of the BPF maps may be needed, and we’ll need to dive further into the subsystem and Cilium Agent itself. Troubleshooting steps for BPF map pressure are documented here. Cilium users deploying to larger cluster sizes in their environment can reach out to the team at Isovalent, who can help design, plan and aid the execution of delivering your platform.

Get Started with the Enterprise Dashboards today!

Covered in this blog, we looked at another enterprise addition for Cilium and how our customer-focused dashboards for Grafana help you operate your cloud native platform. 

Combining Isovalent Enterprise offerings with our customer success programs, such as software backports and replica testing environments, to name but a few, you can be assured that working with Isovalent, we can help you increase operational confidence in your cloud native platforms, allowing you to focus more time on delivering business value.

These dashboards are available today for all existing Isovalent customers. Visit our enterprise documentation for instructions on how to get started. Installation couldn’t be any simpler than via a single helm command. 

Are you a platform owner considering switching to Cilium? Are you interested in seeing Isovalent Enterprise in action? Reach out to our team to set up a demo and discuss how our support programs can enhance the delivery of your cloud native platforms. 

Are you a cloud native practitioner who wants to get hands-on experience with the enterprise feature we offer for Cilium? Then head over to our free hands-on labs, covering Kubernetes networking and security using Cilium and Tetragon!

Dean Lewis
AuthorDean LewisSenior Technical Marketing Engineer

Related

Blogs

Benefits of Isovalent Enterprise for Cilium Support and replica Customer Testing Environments

Learn how Isovalent Enterprise Support helps customers achieve success using hardened cilium distributions & customer replica testing environments.

By
Dean Lewis
Blogs

The value of Cilium backports

Need security fixes or new features in a older Cilium version? That's called a backport. Learn how backports happen in Cilium with live examples!

By
Dean LewisRoland Wolters
Videos

How Does Isovalent’s Support Model Work?

Join Toufic Arabi, Isovalent's VP of Customer Success, as he provides a high-level overview of the types of support that Isovalent customers can expect from our Customer Success team.

By
Toufic Arabi

Industry insights you won’t delete. Delivered to your inbox weekly.