Back to blog

BIG Performances with BIG TCP on Cilium

Nico Vibert
Nico Vibert
Published: Updated: Cilium
BIG Performances with BIG TCP on Cilium

Last updated on 24th August 2003 with BIG TCP support over IPv4.

On a recent transatlantic trip to California, I was admiring the efficiency of the highway express lane when it occurred to me that its fluidity was comparable to a recent Cilium feature.

I know comparing road traffic to network congestion is an often-used comparison in networking (I used it myself in my recent bandwidth blog posts) but this might make it easier for me to explain what could be considered a complex topic: BIG TCP.

The highway in and out of San Francisco has a lane dedicated to public transportation and carpooling. I took full benefit of it on my way home via San Francisco International while smugly looking at the traffic jam on the standard lanes.

Public transportation is rarely perceived as the fastest method of transportation but in the case of the US-101 in the Bay Area – and in the case of BIG TCP – grouping passengers (or packets) together is the fastest way to transit through your network.

A single car taking a single individual to a destination can be considered wasteful, especially when many folks drive their own car to head out to the same destination. A car weighs on average 4,000 lbs compared to 180 lbs for its driver. That’s a lot of overhead and is pretty inefficient – as inefficient as the Linux networking stack processing packets after packets when their payload could just be grouped together. 

BIG TCP – the network-equivalent of car pooling – is now available with Cilium 1.13 to provide enhanced network performances for your nodes.

100Gbps and beyond

Many of the organizations adopting Cilium – cloud providers, financial institutions and telecommunications providers – all have something in common: they all want to extract as much performance from the network as possible and they are constantly looking out for marginal performance gains. 

These organizations are building networks capable of 100Gbps and beyond but with the adoption of 100Gbps network adapters comes the inevitable challenge: how can a CPU deal with eight-million packets per second (assuming a MTU of 1,538 bytes)? That leaves only 120 nanoseconds per packet for the system to handle, which is unrealistic. There is also a significant overhead with handling all these packets from the interface to the upper protocol layer. 

Evidently, this limitation is not new and has been addressed by batching packets together. Within the Linux stack, it is addressed by GRO (Generic Receive Offload) and TSO (Transmit Segmentation Offload). On the receiving end, GRO would group packets into a super-sized 64KB packet within the stack and pass it up the networking stack. Likewise, on the transmitting end, TSO would segment TCP super-sized packets for the NIC to handle.

While that super-sized 64K packet helps, modern CPUs can actually handle much larger packets.

But 64K had remained a hard limit: the length of an IP packet is specified, in octets, in a 16-bit field. Its maximum value is therefore 65,535 bytes (64KB).

What if we could find a bigger header to specify a larger packet size? Having bigger packets would mean reducing the overhead and would theoretically improve throughput and reduce latency.

BIG TCP over IPv6

This is what BIG TCP was designed to do. Created by Linux kernel experts Eric Dumazet & Coco Li from Google, BIG TCP was introduced in Linux 5.19 kernel.

Unusually, BIG TCP support was first introduced for IPv6 before IPv4. It was easier to introduce it in IPv6 through the use of a 22-year-old RFC (RFC2675) that describes IPv6 jumbograms (packets bigger than 64KB).

IPv6 supports a Hop-by-Hop header that can be inserted into the packet. By specifying the payload length in the Hop-by-hop extension header (and setting the Payload Length field in the IPv6 header to 0 to ignore it), we can work around the 64K limitations.

The Hop-by-Hop extension header is using a 32-bit field for the payload length, which would let us have 4GB packets but to start with, the limits will increase to 512KB from 64KB.

If you’d let me resume the transport analogy: we went from a single car (standard TCP/IP) to a bus (GSO/GRO) to a fast-speed train (BIG TCP).

Support for BIG TCP over IPv6 was introduced in Cilium 1.13.

BIG TCP over IPv4

BIG TCP support for IPv4 was introduced in the Linux kernel 6.3. You can read more about it in the Linux kernel mailing list but in a nutshell, as IPv4 does not have the Hop-by-hop extension header, we are using the length of the data payload stored in the socket buffer (referred to skb->len by Linux developers) to specify the bigger packet size.

BIG TCP over IPv4 was introduced in Cilium 1.14.

Jumbo Frames vs BIG TCP

We are 700 words into this blog post and you might wonder why I still haven’t mentioned MTU (Maximum Transmission Unit). You might also ask yourself whether: 1) are BIG TCP and Jumbo frame the same thing? 2) whether you will need change the MTU across all your network devices to support BIG TCP?

Jumbo Frames provide a 6x increase in the Ethernet frame size (from a standard 1,500 Bytes to a whopping 9,000 Bytes) and, as with BIG TCP, provide a significant improvement in performance. As it’s at the Layer 2 of the OSI Layer, it applies both for IPv4 and IPv6. But the main difference is that the Jumbo frames traverse the physical network and require the Ethernet equipment to support jumbo frames and all network interfaces traversed by the frames to be configured accordingly.

BIG TCP does not require MTU on your network devices to be modified. This is designed to happen within the Linux networking stack and is to be used in ingress by GRO to aggregate packets in a 512K hyper-sized packet (instead of a 64KB super-sized one) and in egress by GSO to craft a super packet and push it down the stack.

As a former network engineer who still has nightmares about having to troubleshoot mismatched MTUs, it was comforting to know that, contrary to my initial understanding, BIG TCP does not require any change to the physical network.

Let’s now review BIG TCP on Cilium.

BIG TCP on Cilium

BIG TCP over IPv6

BIG TCP was introduced in the first release candidate of Cilium 1.13 (you can read more about it in the PR) and requires the following:

  • Kernel >= 5.19
  • eBPF Host-Routing
  • eBPF-based kube-proxy replacement
  • eBPF-based masquerading
  • Tunneling and encryption disabled
  • Supported NICs: mlx4, mlx5

You can test it yourself – it took me about 15 minutes to get it working. For my tests, I created a VM based on a 22.10 Ubuntu image on GCP (as you can see below, it runs a 5.19 kernel), installed all the required tools (Docker, Helm, Kubectl, Kind, Cilium-cli) and used kind to deploy my cluster.

root@nvibert-big-tcp-test:~# uname -ar
Linux nvibert-big-tcp-test 5.19.0-1015-gcp #16-Ubuntu SMP Mon Jan 9 13:08:37 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

We use Kind in Dual-Stack mode (you can read more about it in the Dual Stack Tutorial, the Dual Stack video or the IPv6 lab).

The kind config is the following:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
networking:
  ipFamily: dual
  disableDefaultCNI: true

Save this as kind-config.yaml and deploy it with kind create cluster --config kind-config.yaml and you will have your 3-node cluster. We will install Cilium with BIG TCP enabled. Note that the kind-worker node will run a netperf client while kind-worker2 will run a netperf server.

This time, I used helm to install Cilium (I detailed the many ways to install Cilium in a previous tutorial). Remember to add the cilium repo with helm repo add cilium https://helm.cilium.io/ before installing Cilium with helm:

helm install cilium cilium/cilium --version 1.13.0-rc4 \
  --namespace kube-system \
  --set tunnel=disabled \
  --set bpf.masquerade=true \
  --set ipv6.enabled=true \
  --set enableIPv6Masquerade=false \
  --set kubeProxyReplacement=strict \
  --set ipam.mode=kubernetes \
  --set nodePort.enabled=true \
  --set autoDirectNodeRoutes=true \
  --set hostLegacyRouting=false \
  --set ipv4NativeRoutingCIDR="10.0.0.0/8" \
  --set enableIPv6BIGTCP=true

Once enabled in Cilium, all nodes will be automatically configured with BIG TCP. There is no cluster downtime when enabling this feature, albeit the Kubernetes Pods must be restarted for the changes to take effect.

Let’s double-check it’s enabled:

root@nvibert-big-tcp-test:~# cilium config view | grep ipv6-big-tcp
enable-ipv6-big-tcp                            true

Let’s now verify the GSO settings on the node:

root@nvibert-big-tcp-test:~# docker exec kind-worker ip -d -j link show dev eth0
[{"ifindex":9,"link_index":10,"ifname":"eth0","flags":["BROADCAST","MULTICAST","UP","LOWER_UP"],"mtu":1500,"qdisc":"noqueue","operstate":"UP","linkmode":"DEFAULT","group":"default","txqlen":1000,"link_type":"ether","address":"02:42:ac:12:00:04","broadcast":"ff:ff:ff:ff:ff:ff","link_netnsid":0,"promiscuity":0,"min_mtu":68,"max_mtu":65535,"linkinfo":{"info_kind":"veth"},"inet6_addr_gen_mode":"eui64","num_tx_queues":2,"num_rx_queues":2,"gso_max_size":196608,"gso_max_segs":65535}]
root@nvibert-big-tcp-test:~# docker exec kind-worker ip -d -j link show dev eth0 | jq -c '.[0].gso_max_size'
196608

196608 bytes is 192KB: the current optimal GRO/GSO value with Cilium is 192K but it can eventually be raised to 512K if additional performance benefits are observed. The performance results with 192K were impressive, even for small-sized request/response-type workloads.

Likewise, when I deploy my netperf Pods (using this manifest), I can see they are automatically set up with the right GSO:

root@nvibert-big-tcp-test:~# kubectl exec netperf-server -- ip -d -j link show dev eth0 | jq -c '.[0].gso_max_size'
196608
root@nvibert-big-tcp-test:~# kubectl exec netperf-client -- ip -d -j link show dev eth0 | jq -c '.[0].gso_max_size'
196608

Let’s run a quick netperf test (notice how it will be done over IPv6):

root@nvibert-big-tcp-test:~# NETPERF_SERVER=`kubectl get pod netperf-server -o jsonpath='{.status.podIPs}' | jq -r -c '.[].ip | select(contains(":") == true)'`
root@nvibert-big-tcp-test:~# echo $NETPERF_SERVER
fd00:10:244:2::15f9
root@nvibert-big-tcp-test:~# kubectl get pods
NAME             READY   STATUS    RESTARTS   AGE
netperf-client   1/1     Running   0          94m
netperf-server   1/1     Running   0          94m
root@nvibert-big-tcp-test:~# 
root@nvibert-big-tcp-test:~# kubectl exec netperf-client -- netperf  -t TCP_RR -H ${NETPERF_SERVER} -- -r80000:80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT
MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to fd00:10:244:2::15f9 () port 0 AF_INET6 : first burst 0
Minimum      90th         99th         Throughput 
Latency      Percentile   Percentile              
Microseconds Latency      Latency                 
             Microseconds Microseconds            
59           148          290          8663.99 

Note, that without BIG TCP, if you run the same command to check the GSO, you will see a value of 65536 – the 64K limit I described earlier in the blog post.

root@nvibert-big-tcp-test:~# docker exec kind-worker ip -d -j link show dev eth0 | jq -c '.[0].gso_max_size'
65536
root@nvibert-big-tcp-test:~# kubectl exec netperf-server -- ip -d -j link show dev eth0 | jq -c '.[0].gso_max_size'
65536
root@nvibert-big-tcp-test:~# cilium config view | grep big
enable-ipv6-big-tcp                            false

If you run exactly the same test, you should see worse performance results (higher latency and lower throughput) compared to the earlier test, when BIG TCP had been enabled:

root@nvibert-big-tcp-test:~# kubectl exec netperf-client -- netperf  -t TCP_RR -H ${NETPERF_SERVER} -- -r80000:80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT
MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to fd00:10:244:1::1469 () port 0 AF_INET6 : first burst 0
Minimum      90th         99th         Throughput 
Latency      Percentile   Percentile              
Microseconds Latency      Latency                 
             Microseconds Microseconds            
73           201          357          6587.63 

In our lab, we saw a 42% increase in transactions per seconds:

And a 2.2x lower p99 latency between Pods:

We also saw a 15% increase in throughput with our sample app.

Details of the tests can be found in the KubeCon session linked below.

As mentioned previously, support for BIG TCP required certain changes in the Linux Kernel and Cilium BIG TCP therefore requires a 5.19 or later kernel version. While I expect large-scale network providers and cloud operators to use it first, the fact that major OS releases such as Ubuntu 22.10 has 5.19 kernel means that adoption will soon after trickle down to users of any scale. Note 22.04.2 will be released in February and will include 5.19 kernel support.

Eventually, we expect this feature to be enabled by default for IPv6 clusters.

BIG TCP over IPv4

BIG TCP over IPv4 has very similar requirements to BIG TCP over IPv6. The primary difference is that a newer kernel (6.3) is required. To try it, I had to install Linux on my host. Here is what I did:

root@server:~# wget https://raw.githubusercontent.com/pimlie/ubuntu-mainline-kernel.sh/master/ubuntu-mainline-kernel.sh
root@server:~# chmod +x ubuntu-mainline-kernel.sh
root@server:~# sudo mv ubuntu-mainline-kernel.sh /usr/local/bin/
root@server:~# ubuntu-mainline-kernel.sh -c
root@server:~# sudo ubuntu-mainline-kernel.sh -i v6.4.0

Let’s verify that the Kernel has been updated:

root@server:~# uname -ar
Linux server 6.4.0-060400-generic #202306271339 SMP PREEMPT_DYNAMIC Tue Jun 27 14:26:34 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Once the kernel is upgraded, the instructions are very similar to the above. I deployed my cluster using Kind (same instructions as above) before installing Cilium without BIG TCP for IPv4 first.

root@server:~# helm repo add cilium https://helm.cilium.io/
"cilium" has been added to your repositories
root@server:~# 
root@server:~# helm install cilium cilium/cilium --version 1.14.0 \
  --namespace kube-system \
  --set tunnel=disabled \
  --set bpf.masquerade=true \
  --set ipv6.enabled=true \
  --set enableIPv6Masquerade=false \
  --set kubeProxyReplacement=strict \
  --set ipam.mode=kubernetes \
  --set nodePort.enabled=true \
  --set autoDirectNodeRoutes=true \
  --set hostLegacyRouting=false \
  --set ipv4NativeRoutingCIDR="10.0.0.0/8" \
  --set enableIPv6BIGTCP=false \
  --set enableIPv4BIGTCP=false
NAME: cilium
LAST DEPLOYED: Thu Aug 24 15:44:34 2023
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
You have successfully installed Cilium with Hubble.

Your release version is 1.14.0.

For any further help, visit https://docs.cilium.io/en/v1.14/gettinghelp
root@server:~# 

Let’s verify that BIG TCP is not enabled yet:

root@server:~# cilium config view | grep ipv4-big-tcp
enable-ipv4-big-tcp                            false
root@server:~# 

Let’s deploy the netperf server and client again and extract the IPv4 address of the server

root@server:~# kubectl apply -f https://raw.githubusercontent.com/NikAleksandrov/cilium/42b93676d85783aa167105a91e44078ce6731297/test/bigtcp/netperf.yaml
pod/netperf-server created
pod/netperf-client created
root@server:~# NETPERF_SERVER=`kubectl get pod netperf-server -o jsonpath='{.status.podIPs}' | jq -r -c '.[].ip | select(contains(":") == false)'`
echo $NETPERF_SERVER
10.244.2.109

Let’s now run the performance testings:

root@server:~# kubectl exec netperf-client -- netperf  -t TCP_RR -H ${NETPERF_SERVER} -- -r80000:80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.244.2.109 (10.244.) port 0 AF_INET : first burst 0
Minimum      90th         99th         Throughput 
Latency      Percentile   Percentile              
Microseconds Latency      Latency                 
             Microseconds Microseconds            
73           171          279          7102.21  

Without BIG TCP, I see a minimum latency around 75 microseconds, a 90th percentile latency about 170 microseconds, a 99th percentile latency about 300 microseconds and finally a throughput between 5,000 and 6,000 packets per seconds.

Let’s now enable BIG TCP over IPv4:

root@server:~# cilium config set enable-ipv4-big-tcp true
✨ Patching ConfigMap cilium-config with enable-ipv4-big-tcp=true...
♻️  Restarted Cilium pods

I would then delete the netperf pods and re-deploy them before running the same tests:

root@server:~# NETPERF_SERVER=`kubectl get pod netperf-server -o jsonpath='{.status.podIPs}' | jq -r -c '.[].ip | select(contains(":") == false)'`
root@server:~# echo $NETPERF_SERVER
10.244.2.244

The results were impressive:

root@server:~# kubectl exec netperf-client -- netperf  -t TCP_RR -H ${NETPERF_SERVER} -- -r80000:80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT
10.244.2.244
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.244.2.244 (10.244.) port 0 AF_INET : first burst 0
Minimum      90th         99th         Throughput 
Latency      Percentile   Percentile              
Microseconds Latency      Latency                 
             Microseconds Microseconds            
56           116          186          10304.92   

I saw a minimum latency around 60 microseconds, a 90th percentile latency around 120 microseconds, a 99th percentile latency about 180 microseconds and finally a throughput between 10,000 and 11,000 packets per seconds (a whopping 50% increase!).


To learn more about this feature in this KubeCon session presented by Isovalent engineers Daniel Borkmann and Nikolay Aleksandrov and follow the official docs to try it out.

Thanks for reading.

Learn More

  • NetDev session by Eric Dumazet (videos and slides)
  • Cilium Docs on IPv6 BIG TCP (docs)
  • Cilium Docs on IPv4 BIG TCP (docs)
  • KubeCon session by Daniel Borkmann and Nikolay Aleksandrov (video and slides)
  • Going big with TCP by Jonathan Corbet (article)
  • Credit for FastTrack picture: I-680 Express Lanes Photos Copyright Noah Berger / 2019
  • Cilium 1.14 Release Blog post
  • Cilium 1.13 Release Blog post
Nico Vibert
AuthorNico VibertSenior Staff Technical Marketing Engineer
Share on social media

Industry insights you won’t delete. Delivered to your inbox weekly.