Back to blog

Cilium netkit: The Final Frontier in Container Networking Performance

Nico Vibert
Nico Vibert
Published: Updated: Cilium
Cilium netkit: The Final Frontier in Container Networking Performance

Marginal gains is the theory that small yet significant improvements can lead to monumental results. This methodology was applied to Britain’s cycling team training, leading them to great success at recent Olympic events. Cilium’s successive introductions of networking enhancements are another case study in “marginal gains.” Just like the cycling training regime was able to extract as much performance out of the athlete’ body and mind, every improvement introduced by Cilium – XDP, BPF-routing, BIG TCP, etc… – was aimed at getting as much performance from the network as possible.

And in modern networks, whether you are running bandwidth-hungry and latency-sensitive AI workloads or not, every millisecond counts.

Available in the 1.16 release, Cilium netkit is the latest Cilium feature designed to optimize container network performance. Cilium is the first public project providing built-in support for netkit, a Linux network device introduced in the 6.7 kernel release and developed by Isovalent engineers.

Netkit is already showing exceptional improvements:

Good news, we have finished the Proof of Concept (PoC) and netkit provided a 12% increase in CPS (Connections Per Seconds) compared with veth (incredible!). And now we plan to use netkit in our DC as much as possible.

Chen Tang, Cloud Computing Developer at ByteDance

Cilium netkit aims to achieve the otherwise unachievable: network performance matching between host networking and container networking.

Did it come even close? Let’s find out.

High-Performance Networking Experts

To get the best network performance out of your Kubernetes cluster, engage with the creators of eBPF and Isovalent.

Request Demo

Previously in container networking – a series of compromises

The benefits of containerization and cloud native architecture have become so evident over the years that we tend to forget the compromises made along the way. It’s simple – abstraction comes at a cost. Virtual machines could never perform as fast as bare metal, but it was an acceptable trade-off: we knew virtualization provided significant scalability and flexibility benefits.

Likewise for containerization and Kubernetes: by now, we all understand the benefits they bring for infrastructure engineers and application developers alike. But to support this model, we had to make some concessions along the way, with networking performance one of the most severe.

The impact of these limitations had been a hefty 35% drop in performance between container network performance and host network performance.

With Cilium 1.16, the performance penalties can now be eliminated.

Let’s review some of the limitations that come with a standard Kubernetes network datapath architecture and how Cilium addresses them, concluding with the latest changes introduced by Cilium 1.16 (if you’re already familiar with Cilium, you can skip directly to the new stuff).

Roadblock #1: Kube-Proxy and iptables

In vanilla Kubernetes deployments, Kube-Proxy would handle NAT and Service Load Balancing through the iptables or ipvs systems. We’ve extensively talked about the limitations of Kube-Proxy in a previous blog post and there is no need to repeat ourselves here, especially when the iptables creator himself considers eBPF to be more adequate.

Simply put: Kube-Proxy’s reliance on iptables – a technology not built for the churn and the scale of Kubernetes – adds significant overhead, as highlighted in the image below.

Shortcut #1: eBPF-based Kube-Proxy replacement

Again – Jeremy’ excellent blog post about kube-proxy replacement should tell you everything you need to know about why you should use Cilium’s eBPF-based kube-proxy replacement. Read it for more context but as a picture speaks louder than words, compare the simplicity of Cilium’s eBPF-based Kube-Proxy replacement to the previous one:

eBPF’s kube-proxy replacement simply performs significantly better than the original, especially at scale:

The eBPF kube-proxy replacement is now a foundational aspect of Cilium: it’s also a requirement for the tuning features highlighted below.

Roadblock #2: upper stack forwarding

When traffic leaves a Pod, it is often required to masquerade the Pod’s source IP to an IP on the host it leaves on.

Masquerading

Enforcing this form of network translation was traditionally done by sending the packet up the networking stack for the netfilter subsystem to change the IP address. Given that masquerading required the connection tracker to see traffic from both directions to avoid drops from invalid connections, the return ingress traffic from the host network card also had to go up the network stack.

Masquerading also requires consulting the kernel’s routing layer in order to know the next hop address/interface and to populate the Destination/Source MAC addresses.

Upper Stack Forwarding causes orphaned sockets

In summary – masquerading traffic would cause egress and ingress traffic to be processed by the internal networking stack. This introduces unnecessary latency and processing but that’s not all: it disrupts TCP flow control mechanisms. When forwarding the traffic up the stack, the packet’s socket association becomes orphaned inside the host stack’s forwarding layer.

In other words, the system notifies the application that the traffic has left the node, even if the packet is still being processed by the internal system. This means that the application might not realize when the system is congested, affecting the TCP backpressure process and eventually causing bufferbloat and a significant impact on performance (I wrote about bufferbloat in a previous blog post about another network performance feature – BBR) :

Shortcut #2: eBPF-based host routing

Given how much impact the diversion to the upper networking stack caused, the Cilium and eBPF developers looked at implementing eBPF-based host routing to provide a more direct route for packets leaving and entering Pods. They created two Linux kernel BPF helpers – bpf_redirect_peer() and bpf_redirect_neigh() – and appropriate changes in the Cilium code base and released eBPF-based host routing in Cilium 1.9.

The bpf_redirect_neigh()and bpf_redirect_peer()forward the traffic from and to the Pod respectively and avoid the need to push packets up the network stack.

bpf_redirect_peer() ensures that the physical NIC can push packets up the stack into the application’s socket residing in a different Pod namespace in one go.

bpf_redirect_neigh() will resolve the L2 address of the next hop and redirect the packet down into the NIC driver while maintaining the packet’s socket association. This avoids orphaned sockets and ensuring proper back pressure for the TCP stack.

The improvement on network performances is enormous.

Still, as you can see between the orange bar (“veth + BPF host routing”) and the yellow bar (“host (baseline/best case)”), there remained a significant performance gap.

What else could we do to close it?

Roadblock #3 : undersized TCP packets

The introduction of 100+ Gbps-capable network interface cards had an unexpected consequence: how can a CPU deal with eight-million packets per second (assuming an MTU of 1,538 bytes)?

As I explained in depth in the BIG TCP on Cilium blog post, a way to reduce the number of packets is simply by grouping them together. While GRO (Generic Receive Offload) and TSO (Transmit Segmentation Offload) can batch packets together, we were limited by a 64K packet size due to the Length field in the IP header.

What if we could group packets into even bigger packets?

Shortcut #3: BIG TCP

With BIG TCP on Cilium, we can now group TCP packets together in a super-sized 192K packet ; reducing the impact on the CPU and significantly increasing performance.

Again, the results of the throughput testings were spectacular:

Read more about the history behind BIG TCP and how to use it on Cilium, or test it out yourself in the BIG TCP lab.

Roadblock #4: A legacy virtual cable

Most Container Network Interface (CNI) plugins, including Cilium up to the 1.15 release, would attach a Kubernetes Pod to the node it’s hosted on by a virtual ethernet device (veth). veth would typically connect a Pod in a network namespace to the node in the host network namespace and is how containers can reach other containers or external workloads.

Let’s look inside a Pod in our cluster (running Cilium 1.15) and check its network interface.

# ip a show eth0
10: eth0@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 9a:f9:bd:42:0c:ae brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.244.2.114/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::98f9:bdff:fe42:cae/64 scope link 
       valid_lft forever preferred_lft forever

# ip -d l
[...]
11: lxc3c34280cf99e@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 86:fe:01:80:08:ac brd ff:ff:ff:ff:ff:ff link-netns cni-8ba2017a-96bd-f18d-d09a-8e48c9284121 promiscuity 0  allmulti 0 minmtu 68 maxmtu 65535 
    veth addrgenmode eui64 numtxqueues 8 numrxqueues 8 gso_max_size 65536 gso_max_segs 65535 tso_max_size 524280 tso_max_segs 65535 gro_max_size 65536

You can see in the output above that eth0 is paired, via a virtual Ethernet (veth) interface, with another interface named if11 on the host in another namespace.

When we check on the node, we can see the other side of the virtual ethernet pairing:

11: lxcfd04d472bcde@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 2e:f9:fc:35:d1:95 brd ff:ff:ff:ff:ff:ff link-netns cni-f24a815a-65e8-fc9e-be8e-c62d419c8b6f
    inet6 fe80::2cf9:fcff:fe35:d195/64 scope link 
       valid_lft forever preferred_lft forever

While widely adopted as the Linux container networking technology, it is a technology introduced over 15 years ago in Linux kernel 2.6.24 and comes with a couple of drawbacks.

The first one is that veth relies on Layer 2 communications and on ARP for a container to talk to the other end of the veth pair. This is an artificial and unnecessary step: Pods shouldn’t have to do ARP resolutions.

More crucially, veth comes with a performance penalty. When traffic leaves the Pod for another node or off-node traffic ingressing into the Pod, the data has first to enter the network stack within the container, switch namespace, be processed and handled on the host namespace before being sent and handled by the network interface for retransmission. This might seem innocuous, but the whole process – including sending packets through the per-CPU backlog queue – can become costly under pressure.

Shortcut #4: introducing netkit

Netkit was recently introduced into the Linux kernel (6.8) by Isovalent engineers Daniel Borkmann and Nikolay Aleksandrov. Daniel has led many of the improvements highlighted in this blog post and presented them in a recent KubeCon talk:

Netkit is based on an original concept: what if we could load BPF programs directly into the Pods and bring them even closer to the source?

One major benefit would be the ability to make networking decisions earlier. For example, for Pod egress traffic bound for a workload outside of the node, netkit would redirect to the physical device without going via the per-CPU backlog queue.

This is one of the reasons behind the phenomenal performance improvements, as you will see in the charts later.

In this first iteration, netkit devices ship as a pair, with a primary and a peer device. The primary device typically resided in the host namespace, while its peer lives inside the Pod’s namespace.

Only the primary device would be able to manage BPF programs for itself and its peer. This is another benefit introduced by Cilium netkit: no one inside the Pod can remove the BPF programs. As Cilium is the owner of the BPF programs running on the Pod, it would also prevent situations such as BPF programs clashing with each other or, worse, one eBPF-based application unloading BPF programs installed by another.

Cilium is the first public project to benefit from netkit and will set up netkit devices for Pods instead of veth. Cilium netkit will support L3 by default, thus removing the latency and management overhead introduced by ARP (with L2 also a supported option).

Netkit in practice: a detailed look at replacing veth

Let’s test it now. First, let’s install Cilium 1.16 with the netkit option enabled:

$ cilium install --version 1.16.0-rc.1 --namespace kube-system \
--set routingMode=native \
--set bpf.masquerade=true \
--set kubeProxyReplacement=true \
--set ipam.mode=kubernetes \
--set autoDirectNodeRoutes=true \
--set ipv4NativeRoutingCIDR="10.0.0.0/8" \
--set bpf.datapathMode=netkit

Let’s double-check that Cilium netkit has been enabled:

$ kubectl exec -n kube-system -it -c cilium-agent cilium-2qjjb -- cilium status | grep "Device Mode"
Device Mode:            netkit

Let’s run bpftool on the Cilium agent to see which eBPF programs have been loaded. We can see that netkit programs have been loaded for our container virtual device (with tcx, a recent rework of the tc BPF data and control used for other attachments):

$ kubectl exec -n kube-system -it -c cilium-agent cilium-2qjjb -- bpftool net
xdp:

tc:
cilium_net(2) tcx/ingress cil_to_host prog_id 3002 link_id 42 
cilium_host(3) tcx/ingress cil_to_host prog_id 2985 link_id 40 
cilium_host(3) tcx/egress cil_from_host prog_id 2986 link_id 41 
eth0(9) tcx/ingress cil_from_netdev prog_id 3006 link_id 43 
eth0(9) tcx/egress cil_to_netdev prog_id 3012 link_id 44 
lxc07d02daab9b3(108) netkit/peer cil_from_container prog_id 4416 link_id 207 
lxc_health(110) netkit/peer cil_from_container prog_id 4429 link_id 208 

flow_dissector:

netfilter:

Let’s get more detailed information about our network interfaces:

$ kubectl exec -it pod-worker -- ip -d l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 promiscuity 0 allmulti 0 minmtu 0 maxmtu 0 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 tso_max_size 524280 tso_max_segs 65535 gro_max_size 65536 gso_ipv4_max_size 65536 gro_ipv4_max_size 65536 
107: eth0@if108: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff link-netnsid 0 promiscuity 0 allmulti 0 minmtu 68 maxmtu 65535 
    netkit mode l3 type peer policy blackhole numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 tso_max_size 524280 tso_max_segs 65535 gro_max_size 65536 gso_ipv4_max_size 65536 gro_ipv4_max_size 65536 

When comparing with the earlier output when running Cilium 1.15, the notable differences are:

  • netkit is used instead of veth
  • mode l3 is now used, replacing L2 and removing the need for ARP
  • peer policy blackhole means that no traffic can leak out the Pod at any time if no BPF program is attached.

Let’s now review the final performance tests (which you can find more information about how they were run in Daniel’s KubeCon slides). First, latency; we now have Pod-to-Pod latency as low as host-to-host.

Same for throughput – performances over TCP are as high.

We’ve now reached performance parity. Let’s recap one last time how we got there:

LimitationNotesSolution
Limit #1: iptables-based Kube-ProxyKube-proxy relies on iptables – a technology not designed for the scale and churn of Kubernetes.Use eBPF-based high performance kube-proxy replacement.
Limit #2: upper stack forwardingTraffic to and from the Pod had to be forwarded up the networking stack, causing additional latency and impacting TCP throughput.Leverage BPF-host routing to avoid traffic traversing the networking stack.
Limit #3: undersized TCP packets100G networks are driving the need for bigger TCP packets than the standard 64K size. BIG TCP on Cilium triples the size of TCP packets as it enters the networking stack, improving performance and throughput.
Limit #4: a legacy virtual cableContainers are attached via the legacy veth device, which comes with performance drawbacks, especially for off-node traffic. Enable netkit for fast network namespace switch for off-node traffic.

Final Words

In a recent talk at the Linux Plumbing Conference, Martin KaFai Lau (Software Engineer at Meta) shared some recent test results, comparing the software interrupt (softirq) load for live Facebook production traffic to the optimal baseline against veth and netkit..

As you can see above, while we can observe an impact on performance between veth and baseline, netkit’s performances are indistinguishable from the host. This is leading Martin and his team to exciting plans for netkit:

Our plan is to replace the veth usage with netkit in production.

Martin KaFai Lau, Software Engineer (Meta)

Will you be next?

Learn More

If you’re curious and you’d like to learn more, you can consult the following articles, videos and documentation.

To learn more about high-performance Kubernetes networking, talk to our experts.

Cover Photo Credit: Full Speed by Dario Belingheri

Nico Vibert
AuthorNico VibertSenior Staff Technical Marketing Engineer

Related

Blogs

Accelerate network performance with Cilium BBR

Cilium is the first cloud native networking platform to support BBR, an innovative protocol that accelerates network performance.

By
Nico Vibert
Blogs

BIG Performances with BIG TCP on Cilium

With Cilium, you can now leverage BIG TCP with IPv4 or IPv6 to improve performance through the Linux network stack.

By
Nico Vibert
Blogs

Addressing Bandwidth Exhaustion with Cilium Bandwidth Manager

Deep Dive on Bandwidth Management with Cilium

By
Nico Vibert

Industry insights you won’t delete. Delivered to your inbox weekly.