BIG Performances with BIG TCP on Cilium

Feb 15, 2023Cilium
BIG Performances with BIG TCP on Cilium

On a recent transatlantic trip to California, I was admiring the efficiency of the highway express lane when it occurred to me that its fluidity was comparable to a recent Cilium feature.

I know comparing road traffic to network congestion is an often-used comparison in networking (I used it myself in my recent bandwidth blog posts) but this might make it easier for me to explain what could be considered a complex topic: BIG TCP.

The highway in and out of San Francisco has a lane dedicated to public transportation and carpooling. I took full benefit of it on my way home via San Francisco International while smugly looking at the traffic jam on the standard lanes.

Public transportation is rarely perceived as the fastest method of transportation but in the case of the US-101 in the Bay Area – and in the case of BIG TCP – grouping passengers (or packets) together is the fastest way to transit through your network.

A single car taking a single individual to a destination can be considered wasteful, especially when many folks drive their own car to head out to the same destination. A car weighs on average 4,000 lbs compared to 180 lbs for its driver. That’s a lot of overhead and is pretty inefficient – as inefficient as the Linux networking stack processing packets after packets when their payload could just be grouped together. 

BIG TCP – the network-equivalent of car pooling – is now available with Cilium 1.13 to provide enhanced network performances for your nodes.

100Gbps and beyond

Many of the organizations adopting Cilium – cloud providers, financial institutions and telecommunications providers – all have something in common: they all want to extract as much performance from the network as possible and they are constantly looking out for marginal performance gains. 

These organizations are building networks capable of 100Gbps and beyond but with the adoption of 100Gbps network adapters comes the inevitable challenge: how can a CPU deal with eight-million packets per second (assuming a MTU of 1,538 bytes)? That leaves only 120 nanoseconds per packet for the system to handle, which is unrealistic. There is also a significant overhead with handling all these packets from the interface to the upper protocol layer. 

Evidently, this limitation is not new and has been addressed by batching packets together. Within the Linux stack, it is addressed by GRO (Generic Receive Offload) and TSO (Transmit Segmentation Offload). On the receiving end, GRO would group packets into a super-sized 64KB packet within the stack and pass it up the networking stack. Likewise, on the transmitting end, TSO would segment TCP super-sized packets for the NIC to handle.

While that super-sized 64K packet helps, modern CPUs can actually handle much larger packets.

But 64K had remained a hard limit: the length of an IP packet is specified, in octets, in a 16-bit field. Its maximum value is therefore 65,535 bytes (64KB).

What if we could find a bigger header to specify a larger packet size? Having bigger packets would mean reducing the overhead and would theoretically improve throughput and reduce latency.

This is what BIG TCP was designed to do. Created by Linux kernel experts Eric Dumazet & Coco Li from Google and introduced in Linux 5.19 kernel, it is now supported in Cilium 1.13.

BIG TCP support was first introduced for IPv6 (although as I was writing this blog post, I came across some proposals to introduce support for IPv4) to leverage a 22-year-old RFC (RFC2675) that describes IPv6 jumbograms (packets bigger than 64KB).

IPv6 supports a Hop-by-Hop header that can be inserted into the packet. By specifying the payload length in the Hop-by-hop extension header (and setting the Payload Length field in the IPv6 header to 0 to ignore it), we can work around the 64K limitations.

The Hop-by-Hop extension header is using a 32-bit field for the payload length, which would let us have 4GB packets but to start with, the limits will increase to 512KB from 64KB.

If you’d let me resume the transport analogy: we went from a single car (standard TCP/IP) to a bus (GSO/GRO) to a fast-speed train (BIG TCP).

Jumbo Frames vs BIG TCP

We are 700 words into this blog post and you might wonder why I still haven’t mentioned MTU (Maximum Transmission Unit). You might also ask yourself whether: 1) are BIG TCP and Jumbo frame the same thing? 2) whether you will need change the MTU across all your network devices to support BIG TCP?

Jumbo Frames provide a 6x increase in the Ethernet frame size (from a standard 1,500 Bytes to a whopping 9,000 Bytes) and, as with BIG TCP, provide a significant improvement in performance. As it’s at the Layer 2 of the OSI Layer, it applies both for IPv4 and IPv6. But the main difference is that the Jumbo frames traverse the physical network and require the Ethernet equipment to support jumbo frames and all network interfaces traversed by the frames to be configured accordingly.

BIG TCP does not require MTU on your network devices to be modified. This is designed to happen within the Linux networking stack and is to be used in ingress by GRO to aggregate packets in a 512K hyper-sized packet (instead of a 64KB super-sized one) and in egress by GSO to craft a super packet and push it down the stack.

As a former network engineer who still has nightmares about having to troubleshoot mismatched MTUs, it was comforting to know that, contrary to my initial understanding, BIG TCP does not require any change to the physical network.

Let’s now review BIG TCP on Cilium.

BIG TCP on Cilium

BIG TCP was introduced in the first release candidate of Cilium 1.13 (you can read more about it in the PR) and requires the following:

  • Kernel >= 5.19
  • eBPF Host-Routing
  • eBPF-based kube-proxy replacement
  • eBPF-based masquerading
  • Tunneling and encryption disabled
  • Supported NICs: mlx4, mlx5

You can test it yourself – it took me about 15 minutes to get it working. For my tests, I created a VM based on a 22.10 Ubuntu image on GCP (as you can see below, it runs a 5.19 kernel), installed all the required tools (Docker, Helm, Kubectl, Kind, Cilium-cli) and used kind to deploy my cluster.

root@nvibert-big-tcp-test:~# uname -ar
Linux nvibert-big-tcp-test 5.19.0-1015-gcp #16-Ubuntu SMP Mon Jan 9 13:08:37 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

We use Kind in Dual-Stack mode (you can read more about it in the Dual Stack Tutorial, the Dual Stack video or the IPv6 lab).

The kind config is the following:

kind: Cluster
- role: control-plane
- role: worker
- role: worker
  ipFamily: dual
  disableDefaultCNI: true

Save this as kind-config.yaml and deploy it with kind create cluster --config kind-config.yaml and you will have your 3-node cluster. We will install Cilium with BIG TCP enabled. Note that the kind-worker node will run a netperf client while kind-worker2 will run a netperf server.

This time, I used helm to install Cilium (I detailed the many ways to install Cilium in a previous tutorial). Remember to add the cilium repo with helm repo add cilium before installing Cilium with helm:

helm install cilium cilium/cilium --version 1.13.0-rc4 \
  --namespace kube-system \
  --set tunnel=disabled \
  --set bpf.masquerade=true \
  --set ipv6.enabled=true \
  --set enableIPv6Masquerade=false \
  --set kubeProxyReplacement=strict
  --set tunnel='disabled' \
  --set ipam.mode=kubernetes \
  --set nodePort.enabled=true \
  --set autoDirectNodeRoutes=true \
  --set hostLegacyRouting=false \
  --set ipv4NativeRoutingCIDR="" \
  --set enableIPv6BIGTCP=true

Once enabled in Cilium, all nodes will be automatically configured with BIG TCP. There is no cluster downtime when enabling this feature, albeit the Kubernetes Pods must be restarted for the changes to take effect.

Let’s double-check it’s enabled:

root@nvibert-big-tcp-test:~# cilium config view | grep ipv6-big-tcp
enable-ipv6-big-tcp                            true

Let’s now verify the GSO settings on the node:

root@nvibert-big-tcp-test:~# docker exec kind-worker ip -d -j link show dev eth0
root@nvibert-big-tcp-test:~# docker exec kind-worker ip -d -j link show dev eth0 | jq -c '.[0].gso_max_size'

196608 bytes is 192KB: the current optimal GRO/GSO value with Cilium is 192K but it can eventually be raised to 512K if additional performance benefits are observed. The performance results with 192K were impressive, even for small-sized request/response-type workloads.

Likewise, when I deploy my netperf Pods (using this manifest), I can see they are automatically set up with the right GSO:

root@nvibert-big-tcp-test:~# kubectl exec netperf-server -- ip -d -j link show dev eth0 | jq -c '.[0].gso_max_size'
root@nvibert-big-tcp-test:~# kubectl exec netperf-client -- ip -d -j link show dev eth0 | jq -c '.[0].gso_max_size'

Let’s run a quick netperf test (notice how it will be done over IPv6):

root@nvibert-big-tcp-test:~# NETPERF_SERVER=`kubectl get pod netperf-server -o jsonpath='{.status.podIPs}' | jq -r -c '.[].ip | select(contains(":") == true)'`
root@nvibert-big-tcp-test:~# echo $NETPERF_SERVER
root@nvibert-big-tcp-test:~# kubectl get pods
netperf-client   1/1     Running   0          94m
netperf-server   1/1     Running   0          94m
root@nvibert-big-tcp-test:~# kubectl exec netperf-client -- netperf  -t TCP_RR -H ${NETPERF_SERVER} -- -r80000:80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT
MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to fd00:10:244:2::15f9 () port 0 AF_INET6 : first burst 0
Minimum      90th         99th         Throughput 
Latency      Percentile   Percentile              
Microseconds Latency      Latency                 
             Microseconds Microseconds            
59           148          290          8663.99 

Note, that without BIG TCP, if you run the same command to check the GSO, you will see a value of 65536 – the 64K limit I described earlier in the blog post.

root@nvibert-big-tcp-test:~# docker exec kind-worker ip -d -j link show dev eth0 | jq -c '.[0].gso_max_size'
root@nvibert-big-tcp-test:~# kubectl exec netperf-server -- ip -d -j link show dev eth0 | jq -c '.[0].gso_max_size'
root@nvibert-big-tcp-test:~# cilium config view | grep big
enable-ipv6-big-tcp                            false

If you run exactly the same test, you should see worse performance results (higher latency and lower throughput) compared to the earlier test, when BIG TCP had been enabled:

root@nvibert-big-tcp-test:~# kubectl exec netperf-client -- netperf  -t TCP_RR -H ${NETPERF_SERVER} -- -r80000:80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT
MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to fd00:10:244:1::1469 () port 0 AF_INET6 : first burst 0
Minimum      90th         99th         Throughput 
Latency      Percentile   Percentile              
Microseconds Latency      Latency                 
             Microseconds Microseconds            
73           201          357          6587.63 

In our lab, we saw a 42% increase in transactions per seconds:

And a 2.2x lower p99 latency between Pods:

We also saw a 15% increase in throughput with our sample app.

Details of the tests can be found in the KubeCon session linked below.

As mentioned previously, support for BIG TCP required certain changes in the Linux Kernel and Cilium BIG TCP therefore requires a 5.19 or later kernel version. While I expect large-scale network providers and cloud operators to use it first, the fact that major OS releases such as Ubuntu 22.10 has 5.19 kernel means that adoption will soon after trickle down to users of any scale. Note 22.04.2 will be released in February and will include 5.19 kernel support.

Eventually, we expect this feature to be enabled by default for IPv6 clusters.

Learn more about this feature in this KubeCon session presented by Isovalent engineers Daniel Borkmann and Nikolay Aleksandrov and follow the official docs to try it out.

Thanks for reading.

Learn More

  • NetDev session by Eric Dumazet (videos and slides)
  • Cilium Docs on IPv6 BIG TCP (docs)
  • KubeCon session by Daniel Borkmann and Nikolay Aleksandrov (video and slides)
  • Going big with TCP by Jonathan Corbet (article)
  • Credit for FastTrack picture: I-680 Express Lanes Photos Copyright Noah Berger / 2019
Nico Vibert
AuthorNico VibertSenior Technical Marketing Engineer