Addressing Bandwidth Exhaustion with Cilium Bandwidth Manager

Since inception, part of the core characteristics of Kubernetes have been its outstanding orchestration and scheduling capabilities. Kubernetes considers various parameters and application requirements in its algorithm – such as CPU and memory – before placing pods accordingly.

Kubernetes can also enforce and limit how many resources a pod might use. By nature, resources available to the pods are shared resources and preventing compute starvation is critical. But if we are capable of preventing compute starvation, shouldn’t we also prevent network starvation?

Organizations are becoming more efficient at running more pods per node, to take full advantage of available resources. Consequently, it’s important to consider the impact bandwidth consumption can have on overall pod performance.

After all, if a pod began downloading terabytes of traffic at a huge rate, it could cause bandwidth exhaustion and affect neighbors. Currently, Kubernetes has limitations preventing such a scenario. But now it is possible with the Cilium Bandwidth Manager.

In this post we will discuss Cilium Bandwidth Manager, starting with a review of how rate-limiting is done in Kubernetes traditionally and some of its constraints. We will then look at how Cilium addresses some of these limitations before looking at a practical implementation of Cilium Bandwidth Manager.

Note: This blog post is part of a series on Bandwidth Management with Cilium. In the second part, we will dive into a revolutionary TCP congestion technology called BBR and how Cilium became the first cloud native platform to support it (with the help of eBPF of course!). If you’re interested in offering latency and throughput improvements while controlling pod network contention (and really – who isn’t?), then this series is for you.

Congestion Handling

In the hardware networking world, we have had the ability to impose some limits on traffic consumption for a very long time. In simple terms, network engineers would specify a limit on a physical interface and when traffic exceeded that limit, the router would queue up and drop packets. TCP then adjusted the throughput through TCP congestion protocols. (Note: I am intentionally keeping it simple for brevity but if you want to truly understand TCP Congestion Control, please read the TCP Congestion Control e-book).

Imposing similar limits for pods is not as straightforward but is necessary to avoid a potential “noisy neighbor” scenario where network-hungry pods hog the bandwidth. Kubernetes has provided support for traffic shaping through a bandwidth CNI plugin. An operator would apply pod annotations like the one below and the CNI plugin would rate-limit the pod’s ingress and egress traffic.

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/ingress-bandwidth: 1M
    kubernetes.io/egress-bandwidth: 1M
...

Unfortunately, this feature is not only still experimental, but the plugin has severe limitations and does not really scale for production use cases.

In this KubeCon 2022 session presented by my Isovalent teammates Daniel Borkmann and Christopher M. Luciano, the CNI plugin’s limitations are explained in more detail.

To summarize the key challenges with the existing CNI bandwidth plugin implementation:

On the ingress side (controlling traffic entering the pod), the CNI bandwidth plugin implements queueing at the virtual ethernet (veth) – traffic has already entered the host and packets are being queued and waiting for the qdisc (qdisc stands for Queueing Discipline and is essentially the Linux traffic control scheduler) to be processed through an algorithm called Token Bucket Filter (TBF) before it can enter the pod.

Tokens roughly correspond to bytes. Initially, the TBF is stocked with tokens which correspond to the amount of traffic that can be burst in one go. Tokens arrive at a steady rate, until the bucket is full.

If no tokens are available, packets are queued, up to a limit. The TBF now calculates the token deficit, and throttles until the first packet in the queue can be sent.

The consequence is with TBF is that it can, by holding on to packets until there are enough tokens, increase delay.

This type of algorithm is also not built for multi-core and multi-queue scenarios.
A multi-queue NIC leverages multi-CPU cores to handle traffic and balance packets across multiple queues, which can improve network performance significantly.
While the host’s physical NIC is probably multi-queue capable (as most modern NICs are), the fact that the traffic will eventually hit the Token Bucket queue hinders the multi-queue performance gains. It’s a limitation described as a single lock contention point.

On the egress side (controlling traffic leaving the pod), again the CNI plugin has to work around limitations on how traffic can be shaped with TBF. As Linux can only implement traffic shaping in egress, the traffic is redirected to an Intermediate Functional Block (IFB) in order for the IFB to apply traffic shaping.

The consequence is that a hop is artificially inserted in the network path in order to implement shaping. And of course, the more hops you add in a network, the greater the latency is.

In summary, if we were to use the existing CNI plugin to limit pod bandwidth, we would:

increase resource consumption on the hosts,
sacrifice network performance gains from modern NICs, and
add hops and latency to the mix.

This is counterproductive.

The goal is therefore to implement bandwidth management but without the aforementioned limitations.

The pathway to this goal came with an Internet Hall of Fame inductee’s help.

Introducing the Timing Wheels

Van Jacobson is considered a legend in the networking world.

Not only did he co-write some of the core network diagnostic tools (like traceroute and the BPF-based tcpdump), he also contributed to major improvements around TCP. The Congestion Avoidance and Control paper essentially led to the development of TCP Congestion and to the reliability of today’s Internet.

Today’s TCP Congestion leverages queues to handle shaping but at the Netdev 0x12 keynote, Van proposed that timing wheels would be a more effective mechanism to handle congestion than queues.

The proposal suggests putting an Earliest Departure Time (EDT) timestamp on each packet (depending on the policy and rate) and sending out packets based on the timestamp.

For our context around rate-limiting, it signifies that the packet would not be sent earlier than their timestamp dictates, therefore slowing down the network flow.

Timing Wheel — From the Carousel publication

Compared to traditional queueing models, this mechanism provides traffic performance benefits (reducing delay and avoiding unnecessary packet drops) while having a minimal CPU impact on the host.

This is why Cilium implemented Bandwidth Management with EDT.

Cilium’s Bandwidth Manager Implementation

Bandwidth Manager's Implementation — Cilium Bandwidth Manager

Cilium leverages some of the technologies mentioned throughout the post to impose rate-limiting:

Cilium agent monitors the pod annotations. Operators can configure their bandwidth limits with the same annotations used by the CNI plugin (kubernetes.io/egress-bandwidth).
Cilium agent pushes the bandwidth requirements into eBPF datapath.
Enforcement happens on physical interfaces instead of the veth which avoids bufferbloat and improves TCP TSQ handling.
Cilium agent implements the EDT timestamps based on the user-provided bandwidth policy
Cilium is multi-queue aware and automatically sets up multi-queue qdiscs with fair queue leaf qdiscs (in other words, traffic arrives at a main queue (root) and is classified into several sub queues (leaf)).
The fair queues implement the timing wheel mechanism and distribute the traffic according to the packet timestamp.

In summary:

Cilium removes the need for an IFB, therefore reducing the latency that was introduced with the CNI plugin implementation
Cilium leverages multi-core and multi queue capabilities, ensuring rate-limiting is not detrimental to performances
Cilium leverages state-of-the-art and optimal congestion avoidance technologies like Earliest Departure Time and Timing Wheel to reduce latency.

Now – platform operators don’t always care how it’s done – they want to know 1) if it works and 2) if it’s easy to operate.

From a performance perspective, we found tremendous performance gains in our testing. During the latency tests, we saw a 4x reduction of p99 latency (p99 is the worst latency that was observed by 99% of all requests – it is the maximum value if you ignore the top 1%).

Cilium's EDT vs TBF performance — Cilium’s EDT latency results

From an operational perspective, platform engineers will find it’s extremely easy to set up – it’s literally one line of YAML.

Time for some practice!

Let’s look at how it works in the lab. You can watch this demo below or if you prefer reading or doing it yourself, see the steps below:

Bandwidth Manager is not enabled by default so, when installing Cilium, make sure to set up the right Helm flag. For this lab, I am using a GKE cluster. I am primarily following the official Cilium docs.

First, we’re deploying a GKE cluster, using the gcloud cli:

% export NAME="$(whoami)-$RANDOM"             
% gcloud container clusters create "${NAME}" \
 --node-taints node.cilium.io/agent-not-ready=true:NoExecute \
 --zone us-west2-a
gcloud container clusters get-credentials "${NAME}" --zone us-west2-a   

 Creating cluster nicovibert-22439 in us-west2-a... Cluster is being health-checked (master is healthy)...done.                                                           

kubeconfig entry generated for nicovibert-22439.
NAME              LOCATION    MASTER_VERSION  MASTER_IP      MACHINE_TYPE  NODE_VERSION    NUM_NODES  STATUS
nicovibert-22439  us-west2-a  1.22.8-gke.202  34.94.100.216  e2-medium     1.22.8-gke.202  3          RUNNING

Fetching cluster endpoint and auth data.
kubeconfig entry generated for nicovibert-22439.

Next, we are installing Cilium with Helm, using the -set bandwidthManager=true flag to make sure the feature is enabled (it’s disabled by default). Alternatively, we could have installed cilium with the cilium-cli command cilium install.

% NATIVE_CIDR="$(gcloud container clusters describe "${NAME}" --zone "us-west2-a" --format 'value(clusterIpv4Cidr)')"
echo $NATIVE_CIDR
10.108.0.0/14

% helm search repo cilium
NAME            CHART VERSION   APP VERSION     DESCRIPTION                                       
cilium/cilium   1.11.6          1.11.6          eBPF-based Networking, Security, and Observability
cilium/tetragon 0.8.0           0.8.0           Helm chart for Tetragon       

% helm install cilium cilium/cilium --version 1.11.6 \
  --namespace kube-system \
  --set nodeinit.enabled=true \
  --set nodeinit.reconfigureKubelet=true \
  --set nodeinit.removeCbrBridge=true \
  --set cni.binPath=/home/kubernetes/bin \
  --set gke.enabled=true \
  --set ipam.mode=kubernetes \
  --set ipv4NativeRoutingCIDR=$NATIVE_CIDR \
  --set bandwidthManager=true
NAME: cilium
LAST DEPLOYED: Wed Jun 22 10:42:53 2022
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
You have successfully installed Cilium with Hubble.

Your release version is 1.11.6.

After doing a rollout restart of the Cilium DaemonSet and checking that the Cilium pods are running, we also check with “kubectl -n kube-system exec ds/cilium -- cilium status | grep BandwidthManager” that the feature was implemented correctly.

% kubectl -n kube-system rollout restart ds/cilium
daemonset.apps/cilium restarted

% kubectl -n kube-system get pods -l k8s-app=cilium                             
NAME           READY   STATUS    RESTARTS   AGE
cilium-2grmp   1/1     Running   0          21m
cilium-h2rkf   1/1     Running   0          21m
cilium-hshtw   1/1     Running   0          21m


% kubectl -n kube-system exec ds/cilium -- cilium status | grep BandwidthManager
Defaulted container "cilium-agent" out of: cilium-agent, mount-cgroup (init), wait-for-node-init (init), clean-cilium-state (init)
BandwidthManager:       EDT with BPF   [eth0]

During our tests, we’ll run a network performance test between two pods (server/client).

The YAML below includes the specs for both pods. Note the "kubernetes.io/egress-bandwidth: "10M" is how we specify the bandwidth requirements and that we are using anti-affinity so that the two pods are placed in different nodes.

---
apiVersion: v1
kind: Pod
metadata:
  annotations:
    # Limits egress bandwidth to 10Mbit/s.
    kubernetes.io/egress-bandwidth: "10M"
  labels:
    # This pod will act as server.
    app.kubernetes.io/name: netperf-server
  name: netperf-server
spec:
  containers:
  - name: netperf
    image: cilium/netperf
    ports:
    - containerPort: 12865
---
apiVersion: v1
kind: Pod
metadata:
  # This Pod will act as client.
  name: netperf-client
spec:
  affinity:
    # Prevents the client from being scheduled to the
    # same node as the server.
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/name
            operator: In
            values:
            - netperf-server
        topologyKey: kubernetes.io/hostname
  containers:
  - name: netperf
    args:
    - sleep
    - infinity
    image: cilium/netperf

Let’s deploy the pods and start running network throughput tests. You can see below we are doing a TCP_MAERTS network performance test.

What is MAERTS you may be wondering?

A typical netperf test is called TCP_STREAM and goes from the netperf client to the netperf server. Therefore a stream from the netperf server to the client will be STREAM backwards – MAERTS. This is how servers typically operate, with larger data flows going from server to client.

% kubectl apply -f netperf.yaml 
pod/netperf-server created
pod/netperf-client created
% NETPERF_SERVER_IP=$(kubectl get pod netperf-server -o jsonpath='{.status.podIP}')
kubectl exec netperf-client -- \
    netperf -t TCP_MAERTS -H "${NETPERF_SERVER_IP}"
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.108.2.172 (10.108.) port 0 AF_INET
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 87380  16384  16384    10.00       9.54

The results? 9.54 Mbps for a specified 10Mbps limit.

When I increase the max bandwidth on the pod manifest from 10Mbps to 100Mbps, re-apply it and re-run the speed test, I get the increased throughput I expect (95 Mbps):

% NETPERF_SERVER_IP=$(kubectl get pod netperf-server -o jsonpath='{.status.podIP}')
kubectl exec netperf-client -- \
    netperf -t TCP_MAERTS -H "${NETPERF_SERVER_IP}"

MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.108.2.172 (10.108.) port 0 AF_INET
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 87380  16384  16384    10.00      94.97

Summary

To wrap-up, pod network rate-limiting is an appealing feature, and Cilium implements it in a very innovative manner. As we discussed, it is easy to set up using only one line of YAML.

It highlighted another great use case for eBPF – Linux gives you a built-in shaper (TBF) but there is a better idea out there (timing wheels) and rather than wait for a new kernel version, eBPF lets you get the new function into the kernel as eBPF code.

And while we have so far focused on traffic entering and exiting Kubernetes pods, eBPF also gives us an opportunity to improve broader Internet-level bandwidth management.

We will explore that in the next blog post. Meanwhile, you can read more about some of the topics highlighted in the post (see links below), talk to the experts behind eBPF and Cilium (book your session), or you can head over to isovalent.com/labs to play with Cilium.

Learn More

Cilium and eBPF Resources:

Isovalent Resources:

Learn more about Isovalent Cilium Enterprise

Technical Resources:

KubeCon Session and Slides: Better Bandwidth Management with eBPF
Official Cilium Docs on Bandwidth Manager: Cilium Docs on Bandwidth Manager
An eBook on TCP Congestion Control: TCP Congestion Control: A Systems Approach
Why Google decided to replace HTB with EDT and BPF (Slides and Video) Netdev 0x14 – Replacing HTB with EDT and BPF
A detailed white paper on EDT (by Google): Carousel: Scalable Traffic Shaping at End Hosts