Nico Vibert is a Senior Technical Marketing Engineer at Isovalent – the company behind the open-source cloud native solution Cilium. Nico has worked in many different roles – operations and support, design and architecture, technical pre-sales – at companies such as HashiCorp, VMware and Cisco. Nico’s focus is primarily on network, cloud and automation and he loves creating content and writing books. Nico regularly speaks at events, whether on a large scale such as VMworld, Cisco Live or at smaller forums such as VMware and AWS User Groups or virtual events such as HashiCorp HashiTalks. Outside of Isovalent, Nico’s passionate about intentional diversity & inclusion initiatives and is Chief DEI Officer at the Open Technology organization OpenUK.
Cilium BIG TCP
With Cilium 1.13 comes a new exciting feature that enables faster performance and lower latency through the network stack: BIG TCP.
Welcome to a Cilium Tech Flash episode on BIG TCP support on Cilium. So, we’ll start with an explanation of BIG TCP before going into a demo. So, I’ve just published a blog post yesterday which is all about BIG TCP support on Cilium, which came with release 1.13. I highly recommend you read the blog post. I also recommend you watch the KubeCon session presented by Daniel Borkmann. Again, you will see the link here at the bottom of the page, and also here. But really, the idea around this feature is for customers and users that are looking to roll out large-scale data center, looking at 100 Gig and beyond, and the challenge that comes with the volume of packets. Assuming an MTU of 1500 bytes, it’s a lot for the system to handle. If we’re looking at 100 Gig networks, that’s eight million packets per second, which is a lot for the system to handle. So the idea really is, what can we do to maybe group packets together to let the system handle fewer packets, and eventually reduce the latency and improve the throughput. And it’s something that has been addressed by already sent by using something called GRO or TSO, which is Generic Receiver Offload, then Transmit Segmentation Offload, which is essentially why we batch packets together into a 64 KB packet within the Linux Network stack. You know everything we talk about today is about things that happen within the Linux Network stack. It’s different from something like jumbo frames, where you typically increase the size of the Ethernet frame within your physical devices. This is really within the network stack. So it’s just a difference that I explain as well later on in the article. So we’ve been using something called, as I mentioned, something called GRO and GSO and TSO to group 1.5K packets into a supersize 64K packet, but typically that’s actually, you know, the modern CPU can handle much larger packets, but we were limited to 64K to a maximum size of 64K. Why is that? Because in an IP header, you would specify the length of the packet in a 16-bit field in the octet unit. So the maximum value was 2 to the power of 16, which is 64 kilobytes. So the idea around Big TCP is to get over that limit by finding a header in the packet that is bigger than 16 bits. And that’s what Big TCP is about. It’s been created very recently by a couple of Linux kernel experts from Google and has been introduced in Linux 5.19, in the kernel 5.19. And it’s first introduced and supported with IPv6, but there are some plans to support it for IPv4. The way it works is we’re actually using an older RFC that enables IPv6 jumbograms, which is essentially a packet bigger than 64 KB. So it’s just an extra header where it’s called a hop-by-hop header that you insert into the packets, and you can essentially, it’s a 32-bit field instead of the 16-bit field. So we could have, in principle, 4 gigabytes large packets, but insofar what Big TCP enables is to go from 64 to 512. What we’re actually setting up that limit is at 192K, which is essentially three times the size of the existing packet sizes, but we are already seeing some significant performance advantages. So again, I recommend you read the blog post, and there is a detailed tutorial, but let’s go through the demo now.
So, let’s have a look at this demo environment. As you can see, we’re running kernel 5.19. I’ve got a GCP image, and I’m going to be running Kind. I’ve got three nodes and it’s running in dual-stack. I’ve disabled the default CNI because I installed Cilium on top of it. There are more details about the configuration in the blog post as well. But what you can see is that I’ve got my Kind cluster deployed, and I’ve got three nodes. And if we look at the actual nodes, you will see that we have both an IPv4 and an IPv6 address, right?And that’s on kind-worker2.
We’ve also deployed a couple of pods called Netperf. One is a client and one is a server, and they’re deployed on two different nodes. We’re using anti-affinity to prevent the client from being located on the same node as the server. If we look at the pods again, you can see that one is on kind-worker1and one on kind-worker2. And again, if we look at the IP address, it’s both an IPv4 and an IPv6 address allocated. So, we’re running in dual-stack mode.
Now, let’s just go to look at the cilium status. I’ve already installed Cilium, it’s running version 1.13. And if we just filter the configuration based on “big,” we can see that BIG TCP is not enabled yet. And we’ll have a look at the GSO settings and do some performance tests before we enable the BIG TCP setting.
I just had copy-paste a couple of commands. What this command does is it looks at the Ethernet settings on the node. And you can see that it looks at the GSO value, which is 64k, 65,536. And it’s the same on the pods, right? And what Cilium does automatically is to update the GSO on the pods after the feature has been enabled. So, we take the IPv6 address of the Netperf server, and we’re going to run a Netperf test from the client to the server. And you will see the latency, the throughputs, 6,739, and 99 percentile. The P99 latency microseconds is 341. And we’ll compare with that later once we’ve enabled big TCP.
So, we’re just enabling that feature and waiting for the nodes to restart. And I’ve just sped up the demo here to save us some time. And Cilium is back up and running. And again, we’re just going to verify that BIG TCP has been enabled. This time, yes, it is.
Now, if we look at the GSO max size actually on the Netperf client, it’s still 64k. Why? Because we actually need to reboot the pods once we’ve enabled the feature. So, I’m going to do it in a quick and dirty way. I’m going to delete the pods and just reapply the manifest. But in production, you would do it in a more sensible manner, rolling out, for example, your deployment. So, I delete the pods, and I’m going to just reapply the manifest.
And this time, once the pods are up and running, if we run the command again, you can see that the client and the server have had their GSO updated to 192k. Now, Cilium supports a upper limit of 512k, but for now, by default, the BIG TCP limit is 192k. Now, if we run the performance test again, remember the first one was 6,739. Now we’ve gone up to 8,577, and the latency has gone from from 341 down to 280. If we do it again, the latency is lower and the throughput is higher. That’s it! Thanks very much for watching. See you later. Bye.