At Isovalent, we focus on removing the operational burdens of running your cloud native platforms at scale through our experience as the creators of Cilium, our enterprise-ready software offerings, alongside our stellar customer success and support experience.
In this blog post, we will dive into another benefit available to all Isovalent customers: Enterprise Dashboards for Cilium, designed in partnership with our customers, to help operate your cloud native platform at scale using Grafana.
We’ve previously covered how Isovalent Enterprise customers benefit from backports of features and fixes between Cilium versions and the programs that make up our Enterprise support offering itself, with unique capabilities such as Customer Testing Environments (CuTE) maintained by Isovalent Customer Success, which mirrors an environment configuration similar to a customer’s production environment.
Without context, data is meaningless
Out of the box, several per-subsystem dashboards are provided for Cilium. While these provide many data points for platform, network, and security engineers to comb through when managing their respective platform components, the lack of context creates a barrier for platform operators to understand how the data is relevant to their environments.
Ultimately, you already need to have a high prerequisite knowledge of Cilium, Networking, Kubernetes, and potentially even the Linux Kernel to make sense of the data. When speaking to customers, these dashboards were alienating to users who didn’t have a deep understanding of the platform upfront and imposed further complexity to create their own meaningful dashboards. If you don’t understand the initial data provided, how can you build something you do understand?
Let’s take a look at one of the existing out-of-the-box per-subsystem Cilium Overview dashboards in our Grafana environment. The per-subsystem focuses on the underlying components that make up Cilium rather than Cilium as the overall connectivity and security platform. This dashboard provides a wealth of performance information about the Cilium Agent’s health, including volume of identities, ingress and egress traffic, packet drops, agent warning rates and drop rates. If the latter two metrics increase, we must troubleshoot the platform. Marrying this against any reports of application issues may help us spot an underlying issue.
These dashboards contain useful information and are part of a full observability toolset for your platform. They are for those who understand the platform deeply and want to use all the metrics Cilium offers.
However, to provide that better experience for platform teams to manage their environments without becoming Day-0 specialists, we’ve worked closely with our customers to design a new set of enterprise dashboards that provide data and context.
Dashboards, built with context to operate at scale
Moving from the subsystem data point view of the Cilium platform, we devised categories for the overall high-level health components that our customers need to understand and monitor before digging under the hood.
Breaking this into seven modules within a dashboard makes its consumption easier, especially if you display these visuals in an NOC-type environment with monitors hung on the wall.
This approach simplifies the overall breakdown of the data points provided by the subsystem monitoring in more platform component-focused categories. This change makes it easier to consume and spot underlying platform-level issues that can be easily triaged against reports from application owners and incident response systems such as PagerDuty. Furthermore, as you’ll see as we break down some of these focus areas, having available context alongside the data removes the need to be a master in the platform components to understand the data itself.
Installing these new Isovalent Enterprise Dashboards for Cilium is as simple as running a single helm install
command, which can be found in our enterprise documentation. These Grafana dashboards will be installed automatically as part of the dashboard reconciliation process from Grafana. Alternatively, you can simply use the JSON files in disconnected environments. We’ve focused on this installation method to ensure a short time to value, meaning platform teams spend more time using the tooling provided rather than installing and configuring it.
Using the “High-Level Health” dashboard, you will see that this focuses on Cilium’s overall health within the monitored platform. This section of visuals collects metrics and statuses on Cilium components’ health indicators and Kubernetes metadata, such as pod state, pod restarts, etc.
The screenshot below is from my own Talos Linux environment running Cilium. As you can see, at the start of the monitoring period, I had some instability in my cluster (caused by myself). Rather than showing pod names with errors, the graph clearly shows that the Operator was not running. This makes it clear and easy for the user to understand which component is at fault.
As with all data tiles on the dashboard, this tile has an “Informational” Marker (ⓘ icon), which, when hovered over, provides a detailed explanation of what the tile is showing the user and additional information, such as what to expect if this data shows any errors or outages, etc. Clicking on the screenshot will open a new tab to inspect it in further detail.
All components in this environment have been resolved from errors and are running quite happily. For the remainder of this blog post, to replicate real-world situations, several demonstrations will be used through implementing stress tests in the environment, which will cause warnings, errors, and failures in various components to show the various sections of the dashboards in action.
If you are using Cilium in your environment today and are considering performing stress tests on your platform, contact our specialists, who can help you plan, design, and validate your deliverables.
Cilium Agent Errors
The first self-made error caused the Cilium Agent to crash by altering the container configuration to incompatible values. Below, we can see an outage whereby the red line in the graph shows that the Cilium Agent pods are not running as the line is increasing, and the green line for running agents decreases. To the right-hand side is the current visual aid, where the left-hand graph helps identify when the Agents became unavailable. We are starting off by showing a simple issue and a simple view; if we hover over the “informational” icon for this view straight away, we can understand the following from the associated tooltip;
- Cilium Agents that are not running will disrupt workload Pod scheduling, may affect policy implementation, and can lead to disruption in L7 proxy functionality (without HA mode).
Now, let’s focus on CNI (Container Network Interface) Operations and running workloads in our environment. The CNI is the Kubernetes API specification for creating and configuring container networking. Cilium uses this specification for Kubernetes Pod scheduling. If you see errors here, they will likely indicate Pod scheduling issues.
CNI API Rate Limiting
For this part of the demo, I have run a script to enable the screenshot below to be captured. This script was designed to create a high churn of Pods being added and deleted from the monitored cluster. Overall, we see a big discrepancy in the latency of requests that allow Cilium to proceed with the reconciliation of CNI operations. This has pushed the overall max latency into the seconds mark, and errors are reported as API rate limits are being hit within the platform.
This impacts new workloads from being scheduled on the platform, the longer it takes to attach and detach networking from containers, and holds up the status from the pod, as the network interface is not yet provisioned.
Essentially, the slower these operations are, the slower your environment becomes from an end-user experience.
BPF Map Pressure
The final area I will cover in this blog post is BPF Map health. BPF Maps are used to share data between the user space and the kernel. In the specific example below, we will focus on the Policy Map, which stores the policy contents created by the platform engineer and me and hand them off to the BPF programs that implement the policy restrictions at the data path level.
To create the below errors, I have configured the setting bpf-policy-map-max
to the value “512.” This is an incredibly large reduction in the size of the BPF map, whose default setting is “16386” and can be set to a max of “65536”.
In the Cilium Health section of the dashboard, we will see that the number of errors found in the agent log starts to increase for the Cilium Agent. This is our first warning that there are issues with our platform.
Going to the “BPF Maps” Section of the dashboard, we can see quickly that the error rate logged for Policy maps has increased dramatically, and on the right side, we see the value for the “Map Pressure” tile is now at 100%. This means that our Policy Map has run out of space, which will hinder the functions of Cilium and our workload’s ability to communicate on the datapath. This dashboard tile uses the Cilium Agent Metric cilium_bpf_map_pressure{map_name="cilium_policy_*"}
.
If we look at the Cilium Agent logs, we will see error messages similar to the example below.
To resolve this issue, I must increase the size of my Policy map. However, this may not fix the cause of the pressure itself; it may just alleviate the issue. Further troubleshooting of the BPF maps may be needed, and we’ll need to dive further into the subsystem and Cilium Agent itself. Troubleshooting steps for BPF map pressure are documented here. Cilium users deploying to larger cluster sizes in their environment can reach out to the team at Isovalent, who can help design, plan and aid the execution of delivering your platform.
Get Started with the Enterprise Dashboards today!
Covered in this blog, we looked at another enterprise addition for Cilium and how our customer-focused dashboards for Grafana help you operate your cloud native platform.
Combining Isovalent Enterprise offerings with our customer success programs, such as software backports and replica testing environments, to name but a few, you can be assured that working with Isovalent, we can help you increase operational confidence in your cloud native platforms, allowing you to focus more time on delivering business value.
These dashboards are available today for all existing Isovalent customers. Visit our enterprise documentation for instructions on how to get started. Installation couldn’t be any simpler than via a single helm command.
Are you a platform owner considering switching to Cilium? Are you interested in seeing Isovalent Enterprise in action? Reach out to our team to set up a demo and discuss how our support programs can enhance the delivery of your cloud native platforms.
Are you a cloud native practitioner who wants to get hands-on experience with the enterprise feature we offer for Cilium? Then head over to our free hands-on labs, covering Kubernetes networking and security using Cilium and Tetragon!
Dean Lewis is a Senior Technical Marketing Engineer at Isovalent – the company behind the open-source cloud native solution Cilium.
Dean had a varied background working in the technology fields, from support to operations to architectural design and delivery at IT Solutions Providers based in the UK, before moving to VMware and focusing on cloud management and cloud native, which remains as his primary focus. You can find Dean in the past and present speaking at various Technology User Groups and Industry Conferences, as well as his personal blog.