Marginal gains is the theory that small yet significant improvements can lead to monumental results. This methodology was applied to Britain’s cycling team training, leading them to great success at recent Olympic events. Cilium’s successive introductions of networking enhancements are another case study in “marginal gains.” Just like the cycling training regime was able to extract as much performance out of the athlete’ body and mind, every improvement introduced by Cilium – XDP, BPF-routing, BIG TCP, etc… – was aimed at getting as much performance from the network as possible.
And in modern networks, whether you are running bandwidth-hungry and latency-sensitive AI workloads or not, every millisecond counts.
Available in the 1.16 release, Cilium netkit is the latest Cilium feature designed to optimize container network performance. Cilium is the first public project providing built-in support for netkit, a Linux network device introduced in the 6.7 kernel release and developed by Isovalent engineers.
Netkit is already showing exceptional improvements:
Good news, we have finished the Proof of Concept (PoC) and netkit provided a 12% increase in CPS (Connections Per Seconds) compared with veth (incredible!). And now we plan to use netkit in our DC as much as possible.
Chen Tang, Cloud Computing Developer at ByteDance
Cilium netkit aims to achieve the otherwise unachievable: network performance matching between host networking and container networking.
Did it come even close? Let’s find out.
High-Performance Networking Experts
To get the best network performance out of your Kubernetes cluster, engage with the creators of eBPF and Isovalent.
Request DemoPreviously in container networking – a series of compromises
The benefits of containerization and cloud native architecture have become so evident over the years that we tend to forget the compromises made along the way. It’s simple – abstraction comes at a cost. Virtual machines could never perform as fast as bare metal, but it was an acceptable trade-off: we knew virtualization provided significant scalability and flexibility benefits.
Likewise for containerization and Kubernetes: by now, we all understand the benefits they bring for infrastructure engineers and application developers alike. But to support this model, we had to make some concessions along the way, with networking performance one of the most severe.
The impact of these limitations had been a hefty 35% drop in performance between container network performance and host network performance.
With Cilium 1.16, the performance penalties can now be eliminated.
Let’s review some of the limitations that come with a standard Kubernetes network datapath architecture and how Cilium addresses them, concluding with the latest changes introduced by Cilium 1.16 (if you’re already familiar with Cilium, you can skip directly to the new stuff).
Roadblock #1: Kube-Proxy and iptables
In vanilla Kubernetes deployments, Kube-Proxy would handle NAT and Service Load Balancing through the iptables or ipvs systems. We’ve extensively talked about the limitations of Kube-Proxy in a previous blog post and there is no need to repeat ourselves here, especially when the iptables creator himself considers eBPF to be more adequate.
Simply put: Kube-Proxy’s reliance on iptables – a technology not built for the churn and the scale of Kubernetes – adds significant overhead, as highlighted in the image below.
Shortcut #1: eBPF-based Kube-Proxy replacement
Again – Jeremy’ excellent blog post about kube-proxy replacement should tell you everything you need to know about why you should use Cilium’s eBPF-based kube-proxy replacement. Read it for more context but as a picture speaks louder than words, compare the simplicity of Cilium’s eBPF-based Kube-Proxy replacement to the previous one:
eBPF’s kube-proxy replacement simply performs significantly better than the original, especially at scale:
The eBPF kube-proxy replacement is now a foundational aspect of Cilium: it’s also a requirement for the tuning features highlighted below.
Roadblock #2: upper stack forwarding
When traffic leaves a Pod, it is often required to masquerade the Pod’s source IP to an IP on the host it leaves on.
Enforcing this form of network translation was traditionally done by sending the packet up the networking stack for the netfilter subsystem to change the IP address. Given that masquerading required the connection tracker to see traffic from both directions to avoid drops from invalid connections, the return ingress traffic from the host network card also had to go up the network stack.
Masquerading also requires consulting the kernel’s routing layer in order to know the next hop address/interface and to populate the Destination/Source MAC addresses.
In summary – masquerading traffic would cause egress and ingress traffic to be processed by the internal networking stack. This introduces unnecessary latency and processing but that’s not all: it disrupts TCP flow control mechanisms. When forwarding the traffic up the stack, the packet’s socket association becomes orphaned inside the host stack’s forwarding layer.
In other words, the system notifies the application that the traffic has left the node, even if the packet is still being processed by the internal system. This means that the application might not realize when the system is congested, affecting the TCP backpressure process and eventually causing bufferbloat and a significant impact on performance (I wrote about bufferbloat in a previous blog post about another network performance feature – BBR) :
Shortcut #2: eBPF-based host routing
Given how much impact the diversion to the upper networking stack caused, the Cilium and eBPF developers looked at implementing eBPF-based host routing to provide a more direct route for packets leaving and entering Pods. They created two Linux kernel BPF helpers – bpf_redirect_peer()
and bpf_redirect_neigh()
– and appropriate changes in the Cilium code base and released eBPF-based host routing in Cilium 1.9.
The bpf_redirect_neigh()
and bpf_redirect_peer()
forward the traffic from and to the Pod respectively and avoid the need to push packets up the network stack.
bpf_redirect_peer() ensures that the physical NIC can push packets up the stack into the application’s socket residing in a different Pod namespace in one go.
bpf_redirect_neigh() will resolve the L2 address of the next hop and redirect the packet down into the NIC driver while maintaining the packet’s socket association. This avoids orphaned sockets and ensuring proper back pressure for the TCP stack.
The improvement on network performances is enormous.
Still, as you can see between the orange bar (“veth + BPF host routing”) and the yellow bar (“host (baseline/best case)”), there remained a significant performance gap.
What else could we do to close it?
Roadblock #3 : undersized TCP packets
The introduction of 100+ Gbps-capable network interface cards had an unexpected consequence: how can a CPU deal with eight-million packets per second (assuming an MTU of 1,538 bytes)?
As I explained in depth in the BIG TCP on Cilium blog post, a way to reduce the number of packets is simply by grouping them together. While GRO (Generic Receive Offload) and TSO (Transmit Segmentation Offload) can batch packets together, we were limited by a 64K packet size due to the Length field in the IP header.
What if we could group packets into even bigger packets?
Shortcut #3: BIG TCP
With BIG TCP on Cilium, we can now group TCP packets together in a super-sized 192K packet ; reducing the impact on the CPU and significantly increasing performance.
Again, the results of the throughput testings were spectacular:
Read more about the history behind BIG TCP and how to use it on Cilium, or test it out yourself in the BIG TCP lab.
Roadblock #4: A legacy virtual cable
Most Container Network Interface (CNI) plugins, including Cilium up to the 1.15 release, would attach a Kubernetes Pod to the node it’s hosted on by a virtual ethernet device (veth). veth would typically connect a Pod in a network namespace to the node in the host network namespace and is how containers can reach other containers or external workloads.
Let’s look inside a Pod in our cluster (running Cilium 1.15) and check its network interface.
You can see in the output above that eth0
is paired, via a virtual Ethernet (veth) interface, with another interface named if11
on the host in another namespace.
When we check on the node, we can see the other side of the virtual ethernet pairing:
While widely adopted as the Linux container networking technology, it is a technology introduced over 15 years ago in Linux kernel 2.6.24 and comes with a couple of drawbacks.
The first one is that veth relies on Layer 2 communications and on ARP for a container to talk to the other end of the veth pair. This is an artificial and unnecessary step: Pods shouldn’t have to do ARP resolutions.
More crucially, veth comes with a performance penalty. When traffic leaves the Pod for another node or off-node traffic ingressing into the Pod, the data has first to enter the network stack within the container, switch namespace, be processed and handled on the host namespace before being sent and handled by the network interface for retransmission. This might seem innocuous, but the whole process – including sending packets through the per-CPU backlog queue – can become costly under pressure.
Shortcut #4: introducing netkit
Netkit was recently introduced into the Linux kernel (6.7) by Isovalent engineers Daniel Borkmann and Nikolay Aleksandrov. Daniel has led many of the improvements highlighted in this blog post and presented them in a recent KubeCon talk:
Netkit is based on an original concept: what if we could load BPF programs directly into the Pods and bring them even closer to the source?
One major benefit would be the ability to make networking decisions earlier. For example, for Pod egress traffic bound for a workload outside of the node, netkit would redirect to the physical device without going via the per-CPU backlog queue.
This is one of the reasons behind the phenomenal performance improvements, as you will see in the charts later.
In this first iteration, netkit devices ship as a pair, with a primary and a peer device. The primary device typically resided in the host namespace, while its peer lives inside the Pod’s namespace.
Only the primary device would be able to manage BPF programs for itself and its peer. This is another benefit introduced by Cilium netkit: no one inside the Pod can remove the BPF programs. As Cilium is the owner of the BPF programs running on the Pod, it would also prevent situations such as BPF programs clashing with each other or, worse, one eBPF-based application unloading BPF programs installed by another.
Cilium is the first public project to benefit from netkit and will set up netkit devices for Pods instead of veth. Cilium netkit will support L3 by default, thus removing the latency and management overhead introduced by ARP (with L2 also a supported option).
Netkit in practice: a detailed look at replacing veth
Let’s test it now. First, let’s install Cilium 1.16 with the netkit
option enabled:
Let’s double-check that Cilium netkit has been enabled:
Let’s run bpftool
on the Cilium agent to see which eBPF programs have been loaded. We can see that netkit
programs have been loaded for our container virtual device (with tcx
, a recent rework of the tc BPF data and control used for other attachments):
Let’s get more detailed information about our network interfaces:
When comparing with the earlier output when running Cilium 1.15, the notable differences are:
netkit
is used instead ofveth
mode l3
is now used, replacing L2 and removing the need for ARPpeer policy blackhole
means that no traffic can leak out the Pod at any time if no BPF program is attached.
Let’s now review the final performance tests (which you can find more information about how they were run in Daniel’s KubeCon slides). First, latency; we now have Pod-to-Pod latency as low as host-to-host.
Same for throughput – performances over TCP are as high.
We’ve now reached performance parity. Let’s recap one last time how we got there:
Limitation | Notes | Solution |
---|---|---|
Limit #1: iptables-based Kube-Proxy | Kube-proxy relies on iptables – a technology not designed for the scale and churn of Kubernetes. | Use eBPF-based high performance kube-proxy replacement. |
Limit #2: upper stack forwarding | Traffic to and from the Pod had to be forwarded up the networking stack, causing additional latency and impacting TCP throughput. | Leverage BPF-host routing to avoid traffic traversing the networking stack. |
Limit #3: undersized TCP packets | 100G networks are driving the need for bigger TCP packets than the standard 64K size. | BIG TCP on Cilium triples the size of TCP packets as it enters the networking stack, improving performance and throughput. |
Limit #4: a legacy virtual cable | Containers are attached via the legacy veth device, which comes with performance drawbacks, especially for off-node traffic. | Enable netkit for fast network namespace switch for off-node traffic. |
Final Words
In a recent talk at the Linux Plumbing Conference, Martin KaFai Lau (Software Engineer at Meta) shared some recent test results, comparing the software interrupt (softirq) load for live Facebook production traffic to the optimal baseline against veth and netkit..
As you can see above, while we can observe an impact on performance between veth and baseline, netkit’s performances are indistinguishable from the host. This is leading Martin and his team to exciting plans for netkit:
Our plan is to replace the veth usage with netkit in production.
Martin KaFai Lau, Software Engineer (Meta)
Will you be next?
Learn More
If you’re curious and you’d like to learn more, you can consult the following articles, videos and documentation.
- Article: The BPF-programmable network device
- Video: eCHO Episode 140: Cilium 1.16 with netkit devices
- Video: Turning up Performance to 11: Cilium, netKit Devices, and Going Big with TCP – Daniel Borkmann, Isovalent
- Blog: BIG TCP on Cilium
- Blog: What is Kube-Proxy and why move from iptables to eBPF?
- Documentation: Tuning Guide
- Bonus Reading: Marginal Gains: This Coach Improved Every Tiny Thing by 1 Percent and Here’s What Happened
To learn more about high-performance Kubernetes networking, talk to our experts.
Cover Photo Credit: Full Speed by Dario Belingheri
Prior to joining Isovalent, Nico worked in many different roles—operations and support, design and architecture, and technical pre-sales—at companies such as HashiCorp, VMware, and Cisco.
In his current role, Nico focuses primarily on creating content to make networking a more approachable field and regularly speaks at events like KubeCon, VMworld, and Cisco Live.