Last updated on 24th August 2003 with BIG TCP support over IPv4.
On a recent transatlantic trip to California, I was admiring the efficiency of the highway express lane when it occurred to me that its fluidity was comparable to a recent Cilium feature.
I know comparing road traffic to network congestion is an often-used comparison in networking (I used it myself in my recent bandwidth blog posts) but this might make it easier for me to explain what could be considered a complex topic: BIG TCP.
The highway in and out of San Francisco has a lane dedicated to public transportation and carpooling. I took full benefit of it on my way home via San Francisco International while smugly looking at the traffic jam on the standard lanes.
Public transportation is rarely perceived as the fastest method of transportation but in the case of the US-101 in the Bay Area – and in the case of BIG TCP – grouping passengers (or packets) together is the fastest way to transit through your network.
A single car taking a single individual to a destination can be considered wasteful, especially when many folks drive their own car to head out to the same destination. A car weighs on average 4,000 lbs compared to 180 lbs for its driver. That’s a lot of overhead and is pretty inefficient – as inefficient as the Linux networking stack processing packets after packets when their payload could just be grouped together.
BIG TCP – the network-equivalent of car pooling – is now available with Cilium 1.13 to provide enhanced network performances for your nodes.
100Gbps and beyond
Many of the organizations adopting Cilium – cloud providers, financial institutions and telecommunications providers – all have something in common: they all want to extract as much performance from the network as possible and they are constantly looking out for marginal performance gains.
These organizations are building networks capable of 100Gbps and beyond but with the adoption of 100Gbps network adapters comes the inevitable challenge: how can a CPU deal with eight-million packets per second (assuming a MTU of 1,538 bytes)? That leaves only 120 nanoseconds per packet for the system to handle, which is unrealistic. There is also a significant overhead with handling all these packets from the interface to the upper protocol layer.
Evidently, this limitation is not new and has been addressed by batching packets together. Within the Linux stack, it is addressed by GRO (Generic Receive Offload) and TSO (Transmit Segmentation Offload). On the receiving end, GRO would group packets into a super-sized 64KB packet within the stack and pass it up the networking stack. Likewise, on the transmitting end, TSO would segment TCP super-sized packets for the NIC to handle.
While that super-sized 64K packet helps, modern CPUs can actually handle much larger packets.
But 64K had remained a hard limit: the length of an IP packet is specified, in octets, in a 16-bit field. Its maximum value is therefore 65,535 bytes (64KB).
What if we could find a bigger header to specify a larger packet size? Having bigger packets would mean reducing the overhead and would theoretically improve throughput and reduce latency.
BIG TCP over IPv6
This is what BIG TCP was designed to do. Created by Linux kernel experts Eric Dumazet & Coco Li from Google, BIG TCP was introduced in Linux 5.19 kernel.
Unusually, BIG TCP support was first introduced for IPv6 before IPv4. It was easier to introduce it in IPv6 through the use of a 22-year-old RFC (RFC2675) that describes IPv6 jumbograms (packets bigger than 64KB).
IPv6 supports a Hop-by-Hop header that can be inserted into the packet. By specifying the payload length in the Hop-by-hop extension header (and setting the Payload Length field in the IPv6 header to 0 to ignore it), we can work around the 64K limitations.
The Hop-by-Hop extension header is using a 32-bit field for the payload length, which would let us have 4GB packets but to start with, the limits will increase to 512KB from 64KB.
If you’d let me resume the transport analogy: we went from a single car (standard TCP/IP) to a bus (GSO/GRO) to a fast-speed train (BIG TCP).
Support for BIG TCP over IPv6 was introduced in Cilium 1.13.
BIG TCP over IPv4
BIG TCP support for IPv4 was introduced in the Linux kernel 6.3. You can read more about it in the Linux kernel mailing list but in a nutshell, as IPv4 does not have the Hop-by-hop extension header, we are using the length of the data payload stored in the socket buffer (referred to skb->len
by Linux developers) to specify the bigger packet size.
BIG TCP over IPv4 was introduced in Cilium 1.14.
Jumbo Frames vs BIG TCP
We are 700 words into this blog post and you might wonder why I still haven’t mentioned MTU (Maximum Transmission Unit). You might also ask yourself whether: 1) are BIG TCP and Jumbo frame the same thing? 2) whether you will need change the MTU across all your network devices to support BIG TCP?
Jumbo Frames provide a 6x increase in the Ethernet frame size (from a standard 1,500 Bytes to a whopping 9,000 Bytes) and, as with BIG TCP, provide a significant improvement in performance. As it’s at the Layer 2 of the OSI Layer, it applies both for IPv4 and IPv6. But the main difference is that the Jumbo frames traverse the physical network and require the Ethernet equipment to support jumbo frames and all network interfaces traversed by the frames to be configured accordingly.
BIG TCP does not require MTU on your network devices to be modified. This is designed to happen within the Linux networking stack and is to be used in ingress by GRO to aggregate packets in a 512K hyper-sized packet (instead of a 64KB super-sized one) and in egress by GSO to craft a super packet and push it down the stack.
As a former network engineer who still has nightmares about having to troubleshoot mismatched MTUs, it was comforting to know that, contrary to my initial understanding, BIG TCP does not require any change to the physical network.
Let’s now review BIG TCP on Cilium.
BIG TCP on Cilium
BIG TCP over IPv6
BIG TCP was introduced in the first release candidate of Cilium 1.13 (you can read more about it in the PR) and requires the following:
- Kernel >= 5.19
- eBPF Host-Routing
- eBPF-based kube-proxy replacement
- eBPF-based masquerading
- Tunneling and encryption disabled
- Supported NICs: mlx4, mlx5
You can test it yourself – it took me about 15 minutes to get it working. For my tests, I created a VM based on a 22.10 Ubuntu image on GCP (as you can see below, it runs a 5.19 kernel), installed all the required tools (Docker, Helm, Kubectl, Kind, Cilium-cli) and used kind to deploy my cluster.
We use Kind in Dual-Stack mode (you can read more about it in the Dual Stack Tutorial, the Dual Stack video or the IPv6 lab).
The kind config is the following:
Save this as kind-config.yaml
and deploy it with kind create cluster --config kind-config.yaml
and you will have your 3-node cluster. We will install Cilium with BIG TCP enabled. Note that the kind-worker
node will run a netperf client while kind-worker2
will run a netperf server.
This time, I used helm
to install Cilium (I detailed the many ways to install Cilium in a previous tutorial). Remember to add the cilium
repo with helm repo add cilium https://helm.cilium.io/
before installing Cilium with helm:
Once enabled in Cilium, all nodes will be automatically configured with BIG TCP. There is no cluster downtime when enabling this feature, albeit the Kubernetes Pods must be restarted for the changes to take effect.
Let’s double-check it’s enabled:
Let’s now verify the GSO settings on the node:
196608 bytes is 192KB: the current optimal GRO/GSO value with Cilium is 192K
but it can eventually be raised to 512K
if additional performance benefits are observed. The performance results with 192K were impressive, even for small-sized request/response-type workloads.
Likewise, when I deploy my netperf Pods (using this manifest), I can see they are automatically set up with the right GSO:
Let’s run a quick netperf test (notice how it will be done over IPv6):
Note, that without BIG TCP, if you run the same command to check the GSO, you will see a value of 65536
– the 64K limit I described earlier in the blog post.
If you run exactly the same test, you should see worse performance results (higher latency and lower throughput) compared to the earlier test, when BIG TCP had been enabled:
In our lab, we saw a 42% increase in transactions per seconds:
And a 2.2x lower p99 latency between Pods:
We also saw a 15% increase in throughput with our sample app.
Details of the tests can be found in the KubeCon session linked below.
As mentioned previously, support for BIG TCP required certain changes in the Linux Kernel and Cilium BIG TCP therefore requires a 5.19 or later kernel version. While I expect large-scale network providers and cloud operators to use it first, the fact that major OS releases such as Ubuntu 22.10 has 5.19 kernel means that adoption will soon after trickle down to users of any scale. Note 22.04.2 will be released in February and will include 5.19 kernel support.
Eventually, we expect this feature to be enabled by default for IPv6 clusters.
BIG TCP over IPv4
BIG TCP over IPv4 has very similar requirements to BIG TCP over IPv6. The primary difference is that a newer kernel (6.3) is required. To try it, I had to install Linux on my host. Here is what I did:
Let’s verify that the Kernel has been updated:
Once the kernel is upgraded, the instructions are very similar to the above. I deployed my cluster using Kind (same instructions as above) before installing Cilium without BIG TCP for IPv4 first.
Let’s verify that BIG TCP is not enabled yet:
Let’s deploy the netperf server and client again and extract the IPv4 address of the server
Let’s now run the performance testings:
Without BIG TCP, I see a minimum latency around 75 microseconds, a 90th percentile latency about 170 microseconds, a 99th percentile latency about 300 microseconds and finally a throughput between 5,000 and 6,000 packets per seconds.
Let’s now enable BIG TCP over IPv4:
I would then delete the netperf pods and re-deploy them before running the same tests:
The results were impressive:
I saw a minimum latency around 60 microseconds, a 90th percentile latency around 120 microseconds, a 99th percentile latency about 180 microseconds and finally a throughput between 10,000 and 11,000 packets per seconds (a whopping 50% increase!).
To learn more about this feature in this KubeCon session presented by Isovalent engineers Daniel Borkmann and Nikolay Aleksandrov and follow the official docs to try it out.
Thanks for reading.
Learn More
- NetDev session by Eric Dumazet (videos and slides)
- Cilium Docs on IPv6 BIG TCP (docs)
- Cilium Docs on IPv4 BIG TCP (docs)
- KubeCon session by Daniel Borkmann and Nikolay Aleksandrov (video and slides)
- Going big with TCP by Jonathan Corbet (article)
- Credit for FastTrack picture: I-680 Express Lanes Photos Copyright Noah Berger / 2019
- Cilium 1.14 Release Blog post
- Cilium 1.13 Release Blog post
Prior to joining Isovalent, Nico worked in many different roles—operations and support, design and architecture, and technical pre-sales—at companies such as HashiCorp, VMware, and Cisco.
In his current role, Nico focuses primarily on creating content to make networking a more approachable field and regularly speaks at events like KubeCon, VMworld, and Cisco Live.