The Internet remains a miracle to me. How can it serve billions of simultaneous connections across thousands of miles, serving users and systems with vastly different capabilities?
One of the protocols ensuring the Internet’s stability is TCP and its congestion control protocol. While TCP congestion control has clearly proven its effectiveness, it’s a protocol that was invented over thirty years ago, when the Internet was in its infancy.
And with Black Friday soon approaching, many platform operators will want their applications to handle significant traffic spikes and the resulting congestion that come with the busiest online day of the year.
There have been many incremental and significant improvements to TCP congestion over time but one particular algorithm – called BBR (Bottleneck Bandwidth and Round-trip propagation time) – is so transformative Google saw up to a staggering 2,700x improvement in throughput in their tests.
Good news: Cilium is the first cloud native networking platform to support BBR.
If you’re leveraging Kubernetes for Internet-facing services, read on: you might have found a way to significantly improve the user experience for your clients.
We will start this post with a brief recap of TCP Congestion, examine the benefits of BBR and how it was implemented on Cilium and conclude by going through some test results.
Note: This blog post is part of a series on Bandwidth Management with Cilium. In the first part, we looked at the potential issues with bandwidth starvation in a Kubernetes environment and we explained how Cilium Bandwidth Manager efficiently enforces Pod traffic rate-limiting.
TCP Congestion Control
Congestion control is an essential aspect of networks such as electrical grids, highways, your home Wi-Fi, the broader Internet, etc: they all have limited resources and numerous clients. Eventually, the number of electrons/cars/devices transiting through the network increases to a point whereby the underlying system cannot offer enough resources to its client. This can lead to a degradation, or worse, a total collapse.
Thankfully, congestion control is there to prevent or remediate against such events. In the world of networking, we have the TCP Congestion Control protocol to thank. To handle congestion, a device might have to drop packets. Being a connection-oriented protocol, TCP will eventually notice the lack of packet acknowledgement and will assume that a packet loss was caused by congestion. TCP will then adjust and reduce its sending rate. This approach is often described as “loss-based” or “control-based” as it happens after the congestion event.
While there are incremental gains from newer loss-congestion protocols, the problem with those protocols is that congestion and packet loss are not always synchronous:
- Many commodity switches in use today have “shallow buffers”; they don’t have deep buffers to handle bursty traffic and may drop traffic during bursting conditions. Traffic drops in turn cause a drastic slowdown.
- We often find edge network equipment with very deep buffers. While these devices are able to handle bursting well, their buffers tend to get bloated with packets causing latency (a problem unsurprisingly called bufferbloat).
So what if we solve the congestion issue differently? What if we are able to anticipate and avoid congestion altogether instead of overreacting to it? This is what “avoidance-based” protocols like Vegas and BBR intend to do.
Bottleneck Bandwidth and Round-trip propagation time
At any time, a TCP connection has exactly one slowest link – what we will refer to as bottleneck. Evidently, the ideal traffic flow behavior is for the sender to send traffic at the bottleneck bandwidth’s rate, with the lowest possible delay.
Loss-based congestion delivers full bottleneck bandwidth but at the cost of high delay and frequent packet loss (with these packet losses often triggering abysmal throughput reduction).
What TCP Congestion tends to do is fill up the network pipe until it becomes saturated and congestion is observed.
TCP Vegas was the first congestion protocol that looked at adjusting the sending rate before the congestion occurs. While not widely adopted, it led the way to subsequent algorithms, including BBR.
BBR determines the bottleneck’s capacity and intends to optimize how much traffic is sent, while minimizing the volume of packets waiting in queues (avoiding bufferbloat).
This model – avoidance-based instead of loss-based – effectively solves the latency issues that comes with deep buffers interfaces while preventing the significant rate reduction that happens with packet drops on shallow buffers interfaces.
Google, who developed this protocol and spent hours each day examining TCP packet header captures from all over the world, saw some staggering results. Even when compared with CUBIC (a highly-optimized loss-based congestion control protocol and the default these days on Linux), Google observed significant improvements.
While I don’t expect users will often see a 2,700x gain in throughput, the performance improvements are impressive regardless, especially considering how optimized the YouTube network already is.
Daniel Borkmann, Staff Software Engineer at Isovalent and co-creator of eBPF, came across this innovation and believed that applications running on Kubernetes could achieve tremendous performance benefits, if only the underlying network platform supported BBR.
Thus he took the challenge to implement BBR in Cilium.
However, Daniel quickly realized he would have significant challenges with implementing this functionality (challenges he shared during this Linux Plumbers Conference 2021 session).
If I were to describe these challenges succinctly: Fair Queues (FQ) schedulers need packet timestamps to determine how to pace the traffic to a given rate (ideally, as close as possible to the bottleneck’s bandwidth). The problem was that the Linux kernel would clear packet timestamps as packets traversed network namespaces.
As each Kubernetes Pod is assigned its own network namespace, the timestamp being cleared would prevent the support of BBR for Kubernetes Pods.
Thankfully, the Linux kernel networking community came together to address the issue.
BBR in Cilium
Daniel and the Cilium team worked with the kernel networking community to address this issue in v5.18 and newer kernels by explicitly retaining egress packet timestamps instead of clearing them upon network namespace traversal.
By combining Cilium’s efficient BPF host routing architecture (described in details here) with the kernel enhancements, we can carry the original delivery timestamp and retain the packet’s socket association all the way to the physical devices.
Cilium’s Bandwidth Manager’s FQ scheduler (described in the previous blog post) can then deliver packets with BBR at the correct rate.
As mentioned earlier, BBR in Cilium requires a Linux kernel of 5.18 or later and Cilium 1.12 or later.
Enabling simply requires the following flags:
helm upgrade cilium cilium/cilium --version 1.12.1 \
--namespace kube-system \
--set bandwidthManager.enabled=true \
Once the feature is enabled, all your Pods will benefit from it but this feature is particularly relevant for your external-facing Pods.
What needs highlighting is that this feature does not require any change on the (external Internet-based) clients. Once the feature is enabled, the Cilium-based services exposed on the Internet will automatically deliver these improvements, regardless of who and where the client is.
In the KubeCon session where Daniel and Christopher M. Luciano presented advancements in Bandwidth Management, they first described the results seen with CUBIC: a very respectable 274MBps on average.
As you can see in the picture below, TCP ramped up its rate all the way to 431 Mbps, hit some loss and overreacted to it, dropping the transmit rate down to 256 Mbps momentarily.
With BBR enabled, the behavior is different (see below): BBR works out the bottleneck bandwidth’s (evidently around 430 Mbps in this particular example) and traffic increases to that rate much faster than CUBIC. There’s no overreaction or any wild variation in the throughput.
The results? Over a 50% improvement in throughput performances.
It probably sounds too good to be true, right? Well, we need to acknowledge that BRR is not perfect. BBRv2 is actually already under development to address some of the shortcomings of BBRv1 ; its main one being fairness (when competing with BBRv1, CUBIC flows tend to get far less bandwidth) and a high volume of packet retransmissions.
Regardless – if Google is adopting it across all their Internet-facing services, it must be for a good reason.
Cilium and eBPF Resources:
- KubeCon Session and Slides: Better Bandwidth Management with eBPF
- Official Cilium Docs on Bandwidth Manager: Cilium Docs on Bandwidth Manager
- An eBook on TCP Congestion Control: TCP Congestion Control: A Systems Approach
- Linux Plumber Conference – BPF-datapath extensions for Kubernetes workloads – Slides and Video
- Evaluating BBRv2 on the Dropbox Edge Network – Slides
- Bufferbloat.net – Link