Back to blog

Cilium 1.13 – Gateway API, mTLS datapath, Service Mesh, BIG TCP, SBOM, SNI NetworkPolicy, …

Thomas Graf
Thomas Graf
Published: Updated: Cilium
Cilium 1.13 – Gateway API, mTLS datapath, Service Mesh, BIG TCP, SBOM, SNI NetworkPolicy, …

Cilium 1.13 is here and it’s packed with exciting new features! This release brings you a fully-conformant Gateway API implementation. If you don’t feel like switching over to Gateway API just yet, you can take a look at the support for new annotations that allow users to configure L7 load-balancing such as per-request gRPC balancing using plain Kubernetes services with a couple of annotations. The mTLS datapath has been merged which builds the foundation for Cilium’s upcoming proxy-free mTLS implementation. It will be available for integration with many different identity management providers. On the networking side, we have implemented BIG TCP, IPAM for LoadBalancer Kubernetes Services, SCTP support, further NAT46/64 improvements, and are giving you a preview into the upcoming veth replacement that will make container networking as fast as host networking. 

We are never standing still on the security side: the supply chain security of Cilium has been improved by signing all images with cosign and by creating SBOMs for each image. But even more exciting, we have added support to NetworkPolicy to match TLS SNI server names. Tetragon’s adoption and feature sets continue to expand, with file integrity monitoring and enhanced L3/L4 Network Observability.

In Cilium 1.13, we also begin to see the fruits of the partnership between Isovalent and Grafana, with Grafana users benefiting from more detailed insights into the network traffic and Grafana Tempo tracing data enriched with Hubble L7 HTTP Metrics.

Let’s review some of the major changes.

Service Mesh and Ingress

Gateway API

Cilium now provides a fully conformant Gateway API implementation. Gateway API is the new standard for North-South Load-Balancing and Traffic Routing into Kubernetes Clusters and is the long-term successor to the Ingress API. We plan on maintaining and expanding Ingress support introduced in the Cilium 1.12 release (see below for some of the new Ingress features) but Gateway API represents the future of traffic management.

The development of Gateway API stemmed from the realization that the Ingress API had some limitations: Firstly, it doesn’t provide the advanced load-balancing capabilities users need to define. It only natively supports simple content-based request routing of HTTP traffic.

Secondly, it became impractical for users to manage: vendors ended up addressing the lack of functionality in the Ingress APIs by leveraging annotations. But annotations, while extremely powerful, ended up creating inconsistencies from one Ingress to another. 

Thirdly, as the Ingress API is a single API resource, it suffers operational constraints: it simply is not well-suited for multi-team clusters with shared load-balancing infrastructure.

The Gateway API was designed from the ground up to address the Ingress API limitations. The team behind the Gateway API is a Kubernetes SIG-Network project, with, amongst them, our own Nick Young.

The Gateway API was designed to provide all core routing requirements and to address some of the operational lessons learned from using Ingress resources for 5+ years. One of its key design principles is the fact it is role-oriented. The API resource model reflects the separation of responsibilities seen during the creation and management of traffic management. It enabled network architects to build a shared networking infrastructure that can be used by many different teams and non-coordinating teams. 

In Cilium 1.13, the Cilium Gateway API passes all Gateway API conformance tests (v0.5.1) and supports use cases such as:

  • HTTP Routing
  • TLS Termination
  • HTTP Traffic Splitting / Weighting
  • HTTP Header Modification

A Gateway can simply be deployed with the following manifest:

apiVersion: gateway.networking.k8s.io/v1beta1
kind: Gateway
metadata:
  name: my-gateway
spec:
  gatewayClassName: cilium
  listeners:
  - protocol: HTTP
    port: 80
    name: web-gw
    allowedRoutes:
      namespaces:
        from: Same

App developers can then deploy specific routes to their applications and the Gateway will route the traffic accordingly:

apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: http-app-1
spec:
  parentRefs:
  - name: my-gateway
    namespace: default
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /details
    backendRefs:
    - name: details
      port: 9080

Let’s review a couple of Gateway API demos, starting a short walkthrough of Gateway API.

Another use case for Gateway API is HTTP traffic splitting engine. When introducing a new version of an app, operators would often start pushing some of the traffic to a new backend (while keeping other user traffic on the existing backend) and see how users react to the changes. It’s also known as A/B testing, blue-green deployments, or canary releases.

You can now do it natively, with Cilium Gateway API weights. No need to install another tool or service mesh, as the following video shows:

For users interested in migrating from Ingress API to Gateway API, we have tested the experimental Ingress2Gateway tool that helps users migrate their Ingress configuration to Gateway API configuration. While still a prototype, the tool accurately converted simple Ingress Resources to Gateway API Resources.

Watch this brief video to learn more:

We will continue working closely on the Gateway API project and the GAMMA initiative to ultimately help users simplify their network traffic requirements, whether North-South (ingress into the cluster) or East-West (within the cluster).

Meanwhile, to learn more you can read the Cilium docs or try the Cilium Gateway API lab.

New call-to-action

L7 Load-Balancing for Kubernetes Services with annotations

In Cilium 1.13, you can now use Cilium’s embedded Envoy proxy to achieve L7 load-balancing for existing Kubernetes services, with a simple annotation.

Simply apply the annotation "service.cilium.io/lb-l7": "enabled" and Cilium’s embedded Envoy proxy will automatically apply L7 load-balancing for Kubernetes services.

This feature is applicable within a cluster (for traffic East-West) but also across multi-cluster (it works seamlessly with Cilium ClusterMesh).

This feature came from user requests around the operational complexity involved with gRPC load-balancing. Typically, as explained in this excellent blog post on gRPC load balancing, it is not natively supported by Kubernetes, and an additional tool – such as a proxy or a service mesh – is required. That is because the load-balancing decision has to be done at Layer 7, and not at Layer 3/4.

Many users didn’t want to have to install an additional tool for this use – they ideally wanted Cilium to apply L7 Load-Balancing whenever they would configure a particular annotation to a Kubernetes service. With Cilium 1.13, we are leveraging Cilium’s embedded Envoy proxy to provide this capability. 

It’s worth mentioning that we’ve been huge fans of the Envoy proxy for a long time – we presented our Cilium & Envoy integration as far back as EnvoyCon 2018 – and Envoy is a critical piece of our service mesh architecture. 

In the demo below, we:

  • Deploy a gRPC-based application
  • Make gRPC requests to a backend service, without Envoy.
  • Enable L7 Load-Balancing with Envoy and re-try gRPC requests. We can see in the header that the traffic was forwarded to Envoy and we can verify with Hubble that traffic is forwarded over to the proxy.

If you’d like to learn more, try the lab below or check out the Cilium docs.

New call-to-action

Shared LoadBalancer for Ingress Resources

We have seen fast adoption of Ingress API Resources since its launch in Cilium 1.12. Engineers from Isovalent and the community have contributed some additional functionality to the Ingress feature. 

Let’s start with a simple feature that will provide evident cost benefits for users: Ingress Resource can now share Kubernetes LoadBalancer Resources. 

In 1.12, for every Ingress Resource created, a LoadBalancer Resource would be created and dedicated to that Ingress and an external IP would be allocated. And we all know that cloud load balancers and public IPs do not come for free.

In 1.13, Cilium Ingress can also be deployed in Shared LoadBalancer mode. In this mode, all Ingress Resources share the same LoadBalancer Resource and the same IP.  Ingress resources would therefore get a single LoadBalancer and a single IP, drastically reducing the overall cost for cloud engineers.

To learn more, head down to Isovalent Resource Library for a demo or read the docs.

mTLS datapath

Many users of existing service meshes consider pod mutual authentication to be a critical feature of a service mesh, and are looking for Cilium to provide an implementation of this feature as part of the Cilium Service Mesh offering. In Cilium 1.13, we are introducing mTLS support on the datapath level. This enables Cilium to authenticate endpoints on peer nodes in the cluster and control data plane connectivity based on successful mutual authentication.

While this is not a user-facing change for now, it builds the foundation for future development of mTLS to become user-ready. We wrote a lengthy post on how we envision mutual authentication with Cilium Service Mesh shortly before Cilium 1.12 was released and with today’s release we are pleased to share our progress.

Existing mutual authentication implementations put several restrictions upon users in order to attain additional security guarantees. These can range from requiring apps to use TCP+TLS for authentication/encryption to requiring (sidecar) proxies in the data plane. These restrictions increase the baseline complexity and processing cost of implementing mutual authentication. We believe that by combining the best of session-based authentication and network-based authentication, we can provide a secure implementation that is faster and more flexible than prior implementations.

Our long-term goals are:

  • To provide a mechanism for pods to opt into peer authentication
  • Pluggable for existing certificate management systems (SPIFFE, Vault, SMI, Istio, cert-manager, …)
  • Configurable certificate granularity
  • Leverage existing encryption protocol support for the data plane

If you are not familiar with the Cilium Service Mesh, you can go and try it out here:

New call-to-action

Networking

BIG TCP

Cilium is being adopted by many cloud providers, financial institutions, and telecommunications providers that all have something in common: they all want to extract as much performance from the network as possible and they are constantly looking out for marginal performance gains. 

These organizations are building networks capable of 100Gbps and beyond but with the adoption of 100Gbps network adapters comes the inevitable challenge: how can a CPU deal with eight million packets per second (assuming an MTU of 1,538 bytes)? That leaves only 120 nanoseconds per packet for the system to handle, which is unrealistic. There is also a significant overhead with handling all these packets from the interface to the upper protocol layer. 

What if we could group packet payload into a super-size packet? Having bigger packets would mean reducing the overhead and would theoretically improve throughput and reduce latency. 

That’s what BIG TCP can do.

With Cilium 1.13, you can now use BIG TCP on your cluster and benefit from enhanced performance.

If you would like to learn more about it, read this blog post, check out the Cilium docs, watch the video below or try out the new lab:

veth replacement

Ever since its creation, the Cilium engineers have intended to re-imagine Linux networking. Leveraging eBPF enabled high-performance networking use cases such as:

  • Kube-proxy replacement enabled users to overcome iptables limitations.
  • Support for XDP provided high performance traffic by bypassing parts of the internal network stack. 
  • Virtual Ethernet Device Optimization in Cilium 1.9 further optimized traffic flows within the stack. 

While the gains of bypassing the upper network stack can be phenomenal (as highlighted in our previous case study with Seznam), there is still room for improvement.

The pinnacle for Pod networking performance is to be as fast as the host network. With the upcoming meta devices, we are on the verge of reaching our goal.

How? By executing the eBPF program inside the Pod.

The new component, called meta device, is a replacement for veth devices. The goal of meta devices is to have eBPF programs as part of the device inside the Pod, controlled from the host namespace. This helps to reduce latency even further. With meta devices, the eBPF programs are shifted from the Linux Traffic Control (TC) layer to the device itself, eliminating additional queuing and rescheduling. The main goal is to reduce latency and increase performance, making it similar to running the application on the host.

While not yet available on Cilium 1.13, the preliminary test results exceeded expectations: we saw Pod performance on par with the host. To find out more about benchmark testing and to learn more about meta netdevices, watch this recent FOSDEM 2023 talk:

We hope to introduce meta devices in a future Cilium release – meanwhile, we would love to hear from you.

IPAM for LoadBalancer Services and BGP Services Advertisement

In Cilium 1.13, we are expanding the features of Cilium that enable integration and connectivity with external workloads and environments. We are introducing two major features that simplify networking operations and facilitate integration between your existing network fabric and your Cilium-managed Kubernetes clusters: LoadBalancer IP Address Management and BGP Service Advertisement.

LoadBalancer IP Address Management (LB-IPAM) is a new elegant feature that lets Cilium provision IP addresses for Kubernetes LoadBalancer Services. To allocate IP addresses for Kubernetes Services that are exposed outside of a cluster, you need a resource of the type LoadBalancer. 

When you use Kubernetes on a cloud provider, these resources are automatically managed for you, and their IP and/or DNS are automatically allocated. However, if you run on a bare-metal cluster, in the past, you would have needed another tool like MetalLB to allocate that address. But maintaining yet another networking tool can be cumbersome and in Cilium 1.13, this is no longer needed: Cilium can allocate IP Addresses to Kubernetes LoadBalancer Services.

In the quick demo below, we deploy a Cilium IP Pool, and our LoadBalancer Service automatically picks up an IP address from that pool. We also show how to be more specific and assign IP addresses from a specific range, depending on labels, Service names, or namespaces.

Learn more by watching this short video below, take the lab, or read the official docs:

In addition to this feature, we are also introducing BGP Service Advertisement.

In Cilium 1.12, we introduced IPv6 support in Cilium BGP. That support was possible by leveraging GoBGP as the BGP engine, with a long-term view to replacing MetalLB as the engine supporting BGP routes advertisement. At the time, only Pod CIDRs could be advertised over BGP.

In 1.13, Cilium BGP was enhanced with the introduction of Service address advertisements. Working seamlessly with the LB-IPAM feature, users can now advertise IP addresses of Kubernetes Services of the type LoadBalancer over BGP. 

This feature works for both IPv4 and IPv6. 

Learn more by watching this short video below:

You can also read out the official Cilium docs or take the lab:

New call-to-action

NAT46/64

When Cilium was initially built, it was actually designed to run only on IPv6 but due to customer demand, it became evident that IPv4 support was required. But it seems that – and apologies if you’ve heard this many times before – IPv6 adoption is finally reaching an unprecedented level of maturity and adoption.

But with IPv6 adoption comes the inescapable T-word: Transition.

Dual Stack is one way to address it and it’s great news that IPv4/IPv6 Dual Stack support is now GA with Kubernetes 1.23. We fully support this model. Read this tutorial or do this lab to learn more about Dual Stack and IPv6 with Cilium and Hubble.

However, Dual Stack should remain a stop-gap solution and operators should aspire to a full IPv6 environment. But even if you, as an architect or engineer, aspire to build an IPv6 Kubernetes cluster, you will soon realize that the rest of the world is not quite ready for you. Even some of the most used services are not IPv6 ready.

Initially introduced in Cilium 1.12, NAT46/64 on Cilium has seen significant enhancements in 1.13. 

With the Cilium 1.13 NAT46/64 functionality, users can deploy IPv6-only Kubernetes clusters and access IPv4-only systems (with NAT64) or can be reachable from IPv4-only clients (with NAT46).

Let’s review a couple of different use cases:

In the first scenario, User A wants to run an IPv6-only cluster but knows that some IPv4-only clients will require access to the clusters.

In the illustration below, the external client would make a DNS request for foo.com, which would be resolved to 1.2.3.4. The Cilium NAT46 Gateway, listening on the 1.2.3.4 address, would receive the traffic and would leverage the well-known 64:ff9b: prefix to translate to an IPv6 address before forwarding the packet to an Internal Kubernetes Service.

In the second scenario, User B wants to run an IPv6-only cluster but needs to egress to some IPv4-only sites. 

NAT64 actually requires DNS64 to address this issue.

When an IPv6 client looks up an AAAA record and none is available, we can leverage a DNS64 server (like the public Google ones or coreDNS) to get a satisfactory DNS resolution. The following listing shows an example:

nicovibert:~$ nslookup twitter.com
Server:         192.168.1.254
Address:        192.168.1.254#53

Non-authoritative answer:
Name:	twitter.com
Address: 104.244.42.193

nicovibert:~$ nslookup -query=AAAA twitter.com
Server:         192.168.1.254
Address:        192.168.1.254#53

Non-authoritative answer:
*** Can't find twitter.com: No answer

nicovibert:~$ nslookup -query=AAAA twitter.com 2001:4860:4860::6464
Server:		2001:4860:4860::6464
Address:	2001:4860:4860::6464#53

Non-authoritative answer:
Name:	twitter.com
Address: 64:ff9b::68f4:2ac1

Let’s explain what the 64:ff9b::68f4:2ac1 IPv6 address is: the DNS64 server receives the IPv4 address from the DNS A authoritative server and synthesizes a AAAA record by prefixing the address with its NAT64 prefix (64:ff9bb:) and converts the IPv4 address to hexadecimal (104 is 0x68 in hexadecimal, 244 is 0xF4, 42 is 0x2A and 193 is 0xC1).

When the IPv6 client’s traffic to 64:ff9b::68f4:2ac1 reaches the Cilium NAT64 Gateway, Cilium will automatically translate the destination IP address to the original IPv4 address 104.244.42.193.

The outcome is an effortless user experience – an IPv6 client can seamlessly access the service over IPv6:

root@zh-lab-node-2:/tmp# curl -6 --head https://twitter.com/ciliumproject/status/1625890412853338114
HTTP/1.1 200 OK
date: Wed, 15 Feb 2023 02:02:26 GMT
perf: 7626143928
expiry: Tue, 31 Mar 1981 05:00:00 GMT
[...]

You can learn more about IPv4, IPv6, Dual Stack, and Cilium in our IPv6 tutorial and in our IPv6 lab.

New call-to-action

Introductory support for SCTP on Kubernetes

SCTP is a transport layer protocol often used in the telecommunications industry to support voice-over-IP (VoIP) and other real-time services.

While SCTP support was introduced in Kubernetes 1.20, a CNI was still required to support it. With Cilium 1.13, Cilium can be that CNI.

As for now, Cilium 1.13 provides basic SCTP support for:

  • Pod <-> Pod communication
  • Pod <-> Service communication, and
  • Pod <-> Pod communication with network policies applied to SCTP traffic.
  • SCTP Flow monitoring with Hubble

If you want to learn more about how SCTP looks like in action, watch the following demo where we connect, observe and secure SCTP traffic:

You can also try out the lab or read the official Cilium docs on SCTP.

New call-to-action

Kubernetes Internal Traffic Policy Support

In Cilium 1.13, we are introducing support for a recent Kubernetes Traffic Engineering feature: Service Internal Traffic policy. As explained concisely in the Kubernetes 1.26 blog post, Local Internal Traffic Policy is a feature that optimizes traffic originating within the cluster and is now available with Cilium 1.13.

Let’s review this feature. By default, Services use a cluster-wide internalTrafficPolicy, where traffic is randomly distributed to all endpoints.

When internalTrafficPolicy is set to Local, Cilium will forward internal traffic for a Service only if there is an available endpoint that is local to the same Node.

The use case is typically used to direct traffic to a logging daemon or metrics agent on the same node (and avoid latency and potential traffic transfer costs when reaching out to a different node). Hubble itself actually leverages this feature for optimal traffic routing.

Watch the demo below to learn more or check out the Cilium docs:

SRv6 (Isovalent Cilium Enterprise)

Segment Routing is a powerful and flexible network routing technology that is gaining widespread adoption in the cloud-native application and service provider space. It is a flexible, scalable, and efficient method of routing network traffic that enables service providers to deliver highly reliable and performant network services to their customers.

Segment Routing over IPv6 (SRv6) has been adopted incredibly quickly by many service providers worldwide and enables network programming like overlay networks and VPNs. With Cilium 1.13, we now support the L3VPNv4/L3VPNv6 use case. These behaviors can be programmed by Cilium CRDs or received via BGP using GoBGP.

Currently, Cilium supports a single SID (Segment Identified), either in an SRH (Segment Routing Header) or in Reduced Encapsulation mode. While uSIDs (micro SIDs) or multiple SIDs are not supported yet, we are actively evaluating support for these features.

To learn more, watch the introductory demo to SRv6 on Cilium by Isovalent engineer Louis DeLosSantos or the KubeCon North America 2022 session below, with Louis and Daniel Bernier (Technical Director, Bell Canada).

Hubble & Observability

In October Grafana and Isovalent announced a partnership and unveiled a Cilium integration with Grafana that enables observing the health and performance of the network connectivity between cloud-native applications in Kubernetes using Prometheus metrics, Grafana dashboards and alerts, and Grafana Tempo traces.

In Cilium 1.13, we begin to see the fruits of the partnership in more detail as outlined in the next sections.

If you want to dive right into a hands-on example, check out our zero trust observability lab!

New call-to-action

Hubble Support for Tracing-enabled applications and Grafana Tempo Integration

If you want to troubleshoot distributed systems, you need tracing. Many application developers let their apps already send traces to their preferred tracing backend such as Grafana Tempo. That way, if they see an unexpected change in their Grafana dashboard, they can click on it and jump directly to the corresponding traces.

However, this so far was only true for app native metrics. All the network metrics were not included. With Cilium 1.13, we bridge this gap!

In Cilium 1.13, we’ve updated Hubble’s Layer 7 HTTP visibility feature to automatically extract the existing, app-specific OpenTelemetry traceParent headers from Layer 7 HTTP requests into a new field in Hubble flows, effectively linking our Hubble metrics to the traceID from the application. 

With this, users can see their application’s traceIDs in Hubble’s L7 HTTP metrics, allowing them to take advantage of the metrics produced by Hubble, while still being able to jump from metrics to traces.

To learn more, watch this eCHO show or head out to this GitHub repo to try it yourself.

Grafana Plugin in Hubble

Grafana integrates with many different data sources through an extensive plugin system. That allows users to achieve a “single pane of glass” view, regardless of where their data is originated and stored.

With Cilium 1.13, we are adding one more piece to this ecosystem – Hubble Datasource Plugin. It will enable Grafana users to get detailed insights into the network traffic and correlate it with application-level telemetry.

The plugin integrates with Prometheus storing Hubble metrics, Tempo storing distributed traces, as well as Hubble Timescape, the Isovalent Cilium Enterprise observability platform. In the first beta release, it supports a few visualizations:

  • HTTP service map with RED (requests, errors, duration) statistics. It’s generated from Hubble metrics and contains links to detailed Prometheus queries.
  • Raw Hubble network flows with optional links to Tempo traces (enterprise – requires Hubble Timescape)
  • Process Ancestry Tree (enterprise – requires Hubble Timescape & Tetragon)

If you would like to learn more, watch this session by Isovalent engineer Anna Kapuścińska from Open Observability Day 2022:

Tetragon

File Integrity Monitoring (Enterprise)

Malicious activity often is visible on the file system level: attackers access content or modify files.

To help operators in such circumstances, Tetragon introduced File Integrity Monitoring (FIM). FIM is a feature that monitors and detects file changes that could be indicative of malicious activity.

In FIM, we set up a baseline for the monitored files and observe if and when they change, how they change, which user and process changed them, and so on. FIM can be used to monitor static files on servers, databases, email client configurations, network devices, directory servers, and so on and thus can help to achieve regulatory compliance standards like PCI-DSS, NIST, or HIPAA. It can also be used to meet best practice frameworks like the CIS security benchmarks.

Tetragon’s File Integrity Monitoring which is based on eBPF transparently handles all different ways to perform file I/O. We generate events while monitoring file I/O in the following cases:

  • Using read/write family system calls. These include: read, readv, pread64, preadv, preadv2, write, writev, pwrite64, pwritev, and pwritev2 system calls.
  • Optimized ways to copy files inside the kernel, including: copy_file_range, sendfile, and splice system calls.
  • Asynchronous ways to do I/O, including io_uring and aio
  • File access through memory-mapped files (i.e. mmap)
  • fallocate* system calls.

As an example, the following listing shows a File Integrity Monitoring Policy:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "file-monitoring"
spec:
  file:
    file_paths:
    - "/etc/"
    - "/root/"
    - "/usr/bin/"
    - "/usr/sbin/testfile"
    file_paths_exclude:
    - "/etc/blog1"
    - "/etc/locale.alias"

In this policy, we monitor the files and directories under “file_path”. The files and directories under “file_path_exclude” include prefixes that the user wants to exclude directly, for example, to reduce the amount of noise generated by the events.

Once the above policy is applied, when reading the content of “/etc/passwd” and executing “cat /etc/passwd”, the following JSON event will be generated by Tetragon. And if a user were to rename the “/usr/sbin/testfile” to “/usr/bin/testfile.old” via executing “mv testfile testfile.old”, this comprehensive JSON event would be generated.

L3/L4 Network Observability & IPv6 support (Enterprise)

What happens on a socket can tell a lot about the health of the app/the security of the environment.

For that reason, Tetragon’s L3/L4 Network Observability was improved by the Process Socket Statistics feature. This allows collecting information about socket statistics in bytes related to inbound or outbound connections made by processes/pods. The extracted information makes it possible to determine the connections and amount of data transferred made by a process. Process Socket Statistics support UDP, and observe the sent and received bytes per socket at the network stack and the application layer. This allows us to see if there was a difference in the number of bytes between the network and the application layer. For example, the networking and operations team would be able to detect if an application was not reading as fast as the data is arriving or equally it’s not reading data as fast as it should be sending. 

An example, the following listing shows a “process_sock_stats” JSON event in the case of a UDP connection:

“process_sock_stats”: {
  “process”: {
   “exec_id”: “OjQzNDAwMDAwMDA6MTA5MA==“,
   “pid”: 1090,
   “uid”: 115,
   “cwd”: “/etc/avahi”,
   “binary”: “/usr/sbin/avahi-daemon”,
   “flags”: “procFS auid”,
   “start_time”:2022-12-12T16:06:10.115240681Z”,
   “auid”: 4294967295,
   “parent_exec_id”: “OjE4MDAwMDAwMDox”,
   “refcnt”: 18
  },
  “parent”: {
   “exec_id”: “OjE4MDAwMDAwMDox”,
   “pid”: 1,
   “uid”: 0,
   “cwd”: “/”,
   “binary”: “/usr/lib/systemd/systemd”,
   “arguments”: “splash”,
   “flags”: “procFS auid rootcwd”,
   “start_time”:2022-12-12T16:06:05.955240984Z”,
   “auid”: 4294967295,
   “parent_exec_id”: “OjE6MA==“,
   “refcnt”: 143
  },
  “socket”: {
   “source_ip”: “ff02::fb”,
   “source_port”: 5353,
   “destination_ip”: “fe80::c9e:b4cf:ccc0:18e0,
   “destination_port”: 5353,
   “sock_cookie”:32776",
   “protocol”: “UDP”
  },
  “stats”: {
   “bytes_received”:232,
   “segs_in”: 1,
   “bytes_consumed”:232",
   “segs_consumed”: 1,
   “rtt”: {},
   “udp_latency”: {}
  }
 },
 “time”:2023-02-09T12:02:46.758846680Z”
}

The “bytes_received” field represents the amount of data received at the network stack, meanwhile, the “bytes_consumed” field represents the amount of data processed at the application layer. In this case, both are 232, which means all the received bytes were consumed as well, the application is reading as fast as it can, and there is no latency between the network stack and the application layer.

The example also shows IPv6 support, as the “source_ip” and the “destination_ip” fields are “ff02::fb” and “fe80::c9e:b4cf:ccc0:18e0”

RTT / SRTT histograms

How can you best measure the health of a connection? You watch the entire flow of packets. To help operations teams with this, Tetragon can now observe the round-trip time (RTT) and smooth round-trip time (SRTT) for each packet during a network socket lifecycle. 

RTT represents the duration in milliseconds (ms) it takes for a network packet to go from a starting point to a destination and back again to the starting point, while SRTT provides a more stable value for the round-trip time by utilizing a smoothing algorithm. Based on this information, operation teams can analyze the health of the connection and can diagnose the speed and reliability of network connections.

As an example, the following listing shows an RTT/SRTT Histogram Policy:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "tcp-rtt-histogram"
spec:
  parser:
    tcp:
      enable: true
      statsInterval: 10
      histogram:
        enable: true
        min: 0
        max: 4000

This policy sets the minimum RTT to 0 (represented by the “min” field), while the maximum RTT to 4000 in milliseconds (represented by the “max” field). Based on these data, the RTT for each packet is calculated. Depending on the observed RTT, Tetragon puts the events into buckets: <0%, 1%, 10%, 25%, 50%, 75%, 90%, 99%>, where the percentage is appropriate to the maximum RTT value, in our case to 4000 ms. 

As an example, have a look at a “process_sock_stats” JSON event generated by the policy above. As we can see, there were 23 packets with RTTs above 99%. Meanwhile, the SRTT was 178 ms. Based on these events, operation teams can also create graphs as seen above, and see where the majority of the packets’ RTT’s are, or how RTT changed over time for each pod.

Interface Metrics (Enterprise)

Information about the traffic on certain interfaces can be crucial for monitoring. Platform and operations teams for example often require to set up baselines on the normal amount of packets in the queue, monitoring the actual state to detect deviations from it.

Tetragon now introduces interface metrics, allowing users to observe all network interfaces on a particular node and collect statistics on them by using advanced in-kernel filtering. 

Since it has direct access to the interface struct in the kernel, we can collect complex metrics including information about the sent and received bytes in the data frames as well as information about the number of the packets in the queue to be processed.

As an example, the following Interface Metrics Policy observes the amount of sent and received data as well as the queue length every time a packet is sent by the interface and emits an “interface_stats” event into user space in every 10s:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "interface-stats"
spec:
  parser:
    interface:
      enable: true
      statsInterval: 10
      packet: true

The following JSON event will be generated by Tetragon.

{
   "interface_stats":{
      "interface_name":“enp2s0”,
      "interface_ifindex":2,
      "bytes_sent":13001",
      “bytes_received”:7580",
      "packets_sent":27",
      “packets_received”:61",
      "netns":4026531840"\",
      “qlen”: {
        “buckets”: [
        {
           “size”: 1,
           “count”:27        },
        {
           “percentile”: 1,
           “size”: 9
        },
        {
           “percentile”: 10,
           “size”: 15
        },
        {
           “percentile”: 25,
           “size”: 25
        },
        {
           “percentile”: 50,
           “size”: 25
        },
        {
           “percentile”: 75,
           “size”: 15
        },
        {
           “percentile”: 90,
           “size”: 9
        },
        {
           “percentile”: 99,
           “size”: 1
        }
       ]
     }
   }
 }

The listing shows that Tetragon reports the number of bytes sent and received, as well as the number of packets and the queue length whenever a packet is about to be sent.

As a second step, it also creates buckets – <1%, 10%, 25%, 50%, 75%, 90%, >99% –  which represent the percentage of the baseline queue length, and increments a counter whenever a corresponding length is observed. 

Based on this information, operations, and networking teams can create graphs and determine what was the average or minimum/maximum queue length for each pod and network interface and can analyze what can be an indication of an application slowdown.

Security

Cilium and Tetragon container image signing using cosign

Signing process of images during the development process.

As developers have leaned into cloud-native projects for scale and maintainability, the popularity of containers has exploded. This comes with a lot of security risks. Most containers available today are vulnerable to supply chain attacks because they can be published with nothing more than an API key. If that key leaks, an attacker can publish a legitimate-looking container that contains malware.

One of the best ways to protect users from these kinds of attacks is by signing the image at creation time. Leveraging image signing gives users confidence that the images they got from the container registry are the trusted code that the maintainer built and published.

Starting with Cilium 1.13, all Cilium & Tetragon container images are signed using cosign.

In the following demo, we verify a Cilium image’s signature using the cosign verify command:

$ COSIGN_EXPERIMENTAL=1 cosign verify --certificate-github-workflow-repository cilium/cilium --certificate-oidc-issuer https://token.actions.githubusercontent.com --certificate-github-workflow-name "Image Release Build" --certificate-github-workflow-ref refs/tags/v1.13.0 quay.io/cilium/cilium:v1.13.0 | jq

Verification for quay.io/cilium/cilium:v1.13.0 --
The following checks were performed on each of these signatures:
  - The cosign claims were validated
  - Existence of the claims in the transparency log was verified offline
  - Any certificates were verified against the Fulcio roots.
[
  {
    "critical": {
      "identity": {
        "docker-reference": "quay.io/cilium/cilium"
      },
      "image": {
        "docker-manifest-digest": "sha256:6544a3441b086a2e09005d3e21d1a4afb216fae19c5a60b35793c8a9438f8f68"
      },
      "type": "cosign container image signature"
    },
    "optional": {
      "1.3.6.1.4.1.57264.1.1": "https://token.actions.githubusercontent.com",
      "1.3.6.1.4.1.57264.1.2": "push",
      "1.3.6.1.4.1.57264.1.3": "c9723a8df3cfa336da1f8457a864105d8349acfe",
      "1.3.6.1.4.1.57264.1.4": "Image Release Build",
      "1.3.6.1.4.1.57264.1.5": "cilium/cilium",
      "1.3.6.1.4.1.57264.1.6": "refs/tags/v1.13.0",
      "Bundle": {
        "SignedEntryTimestamp": "MEUCIGBE5GR/tKCotd0A9qowKJtloTLq2HcOGy/KF15jypHXAiEAge67JGyA0EySSman1x6vErvj/WkOy57dgLW3WjhTu1I=",
        "Payload": {
          "body": "eyJhcGlWZXJzaW9uIjoiMC4wLjEiLCJraW5kIjoiaGFzaGVkcmVrb3JkIiwic3BlYyI6eyJkYXRhIjp7Imhhc2giOnsiYWxnb3JpdGhtIjoic2hhMjU2IiwidmFsdWUiOiI1OGZiMDk3ZTA4ODk4ZjAyYmUyN2U1YzdjMWEyYTUwMjQxYmY4Y2ViZDk1ZWM0MTViZDFiMDYxMjM1MjQ1ZTg5In19LCJzaWduYXR1cmUiOnsiY29udGVudCI6Ik1FVUNJRHZiN0IzT2dkQm1MV1RRQnlEcnNqakdod1BuRU5qK21FTkFJdEp6Y0M1dEFpRUF2c1pBVUVrYTM4dUxUZzNjeDJJdXpzVVBpQ2MvYjZpdlpRbmUvMEF6aTlVPSIsInB1YmxpY0tleSI6eyJjb250ZW50IjoiTFMwdExTMUNSVWRKVGlCRFJWSlVTVVpKUTBGVVJTMHRMUzB0Q2sxSlNVUnhSRU5EUVhrclowRjNTVUpCWjBsVllYVnZXVzk0Y0ZWd1kySm5NSGRuZERGb05VeDNRMWRQYkhGUmQwTm5XVWxMYjFwSmVtb3dSVUYzVFhjS1RucEZWazFDVFVkQk1WVkZRMmhOVFdNeWJHNWpNMUoyWTIxVmRWcEhWakpOVWpSM1NFRlpSRlpSVVVSRmVGWjZZVmRrZW1SSE9YbGFVekZ3WW01U2JBcGpiVEZzV2tkc2FHUkhWWGRJYUdOT1RXcE5kMDFxUlRGTlZGRXdUMVJCTTFkb1kwNU5hazEzVFdwRk1VMVVVVEZQVkVFelYycEJRVTFHYTNkRmQxbElDa3R2V2tsNmFqQkRRVkZaU1V0dldrbDZhakJFUVZGalJGRm5RVVZTY1dKaVRDOWtUMDFMTVdjeVFUWjRaVWhuUTBjdk1UQlRSVTFuTTJkblltTlNXbllLYzJWbFZ6VnRZVnBqTm1OdVVrUTRabk5hZVRWc2FWWk5RV3c1VUdKUGVITnZNSGcxVUcxTFpIbzJia2RQVW5OaVZYRlBRMEZyTkhkblowcExUVUUwUndwQk1WVmtSSGRGUWk5M1VVVkJkMGxJWjBSQlZFSm5UbFpJVTFWRlJFUkJTMEpuWjNKQ1owVkdRbEZqUkVGNlFXUkNaMDVXU0ZFMFJVWm5VVlZxYzBWMUNreGpaWFZyZURsc1pGZHJVemRVZFhVMmEyaHhTV3ByZDBoM1dVUldVakJxUWtKbmQwWnZRVlV6T1ZCd2VqRlphMFZhWWpWeFRtcHdTMFpYYVhocE5Ga0tXa1E0ZDJKUldVUldVakJTUVZGSUwwSkhUWGRaV1ZwbVlVaFNNR05JVFRaTWVUbHVZVmhTYjJSWFNYVlpNamwwVERKT2NHSkhiREZpVXpscVlWZDRjQXBrVnpCMlRHMWtjR1JIYURGWmFUa3pZak5LY2xwdGVIWmtNMDEyV1c1V2NHSkhVWFJoVnpGb1dqSldla3hZU214aVIxWm9ZekpXZWt4dWJHaGlWM2hCQ21OdFZtMWplVGt3V1Zka2Vrd3pXWGhNYWtWNlRHcEJkMDlSV1V0TGQxbENRa0ZIUkhaNlFVSkJVVkZ5WVVoU01HTklUVFpNZVRrd1lqSjBiR0pwTldnS1dUTlNjR0l5TlhwTWJXUndaRWRvTVZsdVZucGFXRXBxWWpJMU1GcFhOVEJNYlU1MllsUkJVMEpuYjNKQ1owVkZRVmxQTDAxQlJVTkNRVkozWkZoT2J3cE5SRmxIUTJselIwRlJVVUpuTnpoM1FWRk5SVXRIVFRWT2VrbDZXVlJvYTFwcVRtcGFiVVY2VFhwYWExbFVSbTFQUkZFeFRqSkZORTVxVVhoTlJGWnJDazlFVFRCUFYwWnFXbTFWZDBsUldVdExkMWxDUWtGSFJIWjZRVUpDUVZGVVUxY3hhRm95VldkVmJWWnpXbGRHZWxwVFFrTmtWMnh6V2tSQllrSm5iM0lLUW1kRlJVRlpUeTlOUVVWR1FrRXhhbUZYZUhCa1Z6QjJXVEpzYzJGWVZuUk5RamhIUTJselIwRlJVVUpuTnpoM1FWRlpSVVZZU214YWJrMTJaRWRHYmdwamVUa3lUVk0wZUUxNU5IZE5TVWRNUW1kdmNrSm5SVVZCWkZvMVFXZFJRMEpJTUVWbGQwSTFRVWhqUVROVU1IZGhjMkpJUlZSS2FrZFNOR050VjJNekNrRnhTa3RZY21wbFVFc3pMMmcwY0hsblF6aHdOMjgwUVVGQlIwZFdXWG8wVFdkQlFVSkJUVUZUUkVKSFFXbEZRVzVCV2xrd1JHMUhSVGxyU2tGM1NGZ0tXWEphY1daSVZEZzJWRzkxVWtkVlJHMVdia1pSUVZaRlQwNXJRMGxSUXl0S2VVVlJNVm81TkRrMVFWWXlkRkpWY2l0U1UwSlFZM2xXUVdveGJWVkpjQXBITVdGWEwxRm1SVkpxUVV0Q1oyZHhhR3RxVDFCUlVVUkJkMDV1UVVSQ2EwRnFRbWxSU1hwQ1dVY3diMmhvVDJGRFVIRkNSbmhvZWs0d09VOHZPSEJRQ2t0cVpVdHBValpsYkVkVlpHaDZaME1ySzFGVU1FbG9XQzlCVm5oRlQyaGlRV0pCUTAxQlJsQnNWRXRWYW00eVYyOVVRMUJMUkhSbVRDOU1aVUZOZVZnS1MyVlRRa2xGYlhsMWFVZ3paRUV4VG1OUVJsSmtRa1F6ZEhwTmVYUnpUVzVEYUZsdVJuYzlQUW90TFMwdExVVk9SQ0JEUlZKVVNVWkpRMEZVUlMwdExTMHRDZz09In19fX0=",
          "integratedTime": 1676472547,
          "logIndex": 13401801,
          "logID": "c0d23d6ad406973f9559f3ba2d1ca01f84147d8ffc5b8445c224f98b9591801d"
        }
      },
      "Issuer": "https://token.actions.githubusercontent.com",
      "Subject": "https://github.com/cilium/cilium/.github/workflows/build-images-releases.yaml@refs/tags/v1.13.0",
      "githubWorkflowName": "Image Release Build",
      "githubWorkflowRef": "refs/tags/v1.13.0",
      "githubWorkflowRepository": "cilium/cilium",
      "githubWorkflowSha": "c9723a8df3cfa336da1f8457a864105d8349acfe",
      "githubWorkflowTrigger": "push"
    }
  }
]

To learn more, read the official Cilium docs.

Cilium and Tetragon images include a Software Bill of Material (aka SBOM)

Open Source Software (OSS) changed modern application development: these days, developers can leverage a vast set of OSS packages to compose new applications. This increase in speed in software engineering comes with a heavy price that organizations need to guard against: vulnerabilities in those OSS dependencies leading to attacks like the Solorigate backdoor malware popularly called Sunburst.

Cilium 1.13 introduces Software Bill of Materials (SBOM) to mitigate those issues.

Software Bill of Materials (SBOM) are generated lists of all the dependencies. While the SBOM provides the required data to cryptographically verify software components, the SBOM itself needs to be signed to enable consumers to check the integrity of the document itself.

Starting with Cilium 1.13, all Cilium and Tetragon images include an SBOM. The SBOM is generated in the Software Package Data Exchange (SPDX) format using the bom tool.

We can verify that the SBOM is available and tamper-proof (its signature can be verified using the cosign verify command). The SBOM image was signed using GitHub Actions in the Cilium repository from the Issuer and Subject fields of the output.

$ COSIGN_EXPERIMENTAL=1 cosign verify --certificate-github-workflow-repository cilium/cilium --certificate-oidc-issuer https://token.actions.githubusercontent.com --attachment sbom quay.io/cilium/cilium:v1.13.0 | jq

Verification for quay.io/cilium/cilium:sha256-6544a3441b086a2e09005d3e21d1a4afb216fae19c5a60b35793c8a9438f8f68.sbom --
The following checks were performed on each of these signatures:
  - The cosign claims were validated
  - Existence of the claims in the transparency log was verified offline
  - Any certificates were verified against the Fulcio roots.
[
  {
    "critical": {
      "identity": {
        "docker-reference": "quay.io/cilium/cilium"
      },
      "image": {
        "docker-manifest-digest": "sha256:8ad7d7ee6e0fe13695d2ed86cedc1c31b429655b8159cad9b0a9e05504c2a00f"
      },
      "type": "cosign container image signature"
    },
    "optional": {
      "1.3.6.1.4.1.57264.1.1": "https://token.actions.githubusercontent.com",
      "1.3.6.1.4.1.57264.1.2": "push",
      "1.3.6.1.4.1.57264.1.3": "c9723a8df3cfa336da1f8457a864105d8349acfe",
      "1.3.6.1.4.1.57264.1.4": "Image Release Build",
      "1.3.6.1.4.1.57264.1.5": "cilium/cilium",
      "1.3.6.1.4.1.57264.1.6": "refs/tags/v1.13.0",
      "Bundle": {
        "SignedEntryTimestamp": "MEQCID8aEQ+xWBNHV4IfjfcwWFiYc2J84nq1hMquJvt6hTzCAiAT7seq6iaPtVLyTw5Bb32Z2k+GwrbjNz8k322qmThWmg==",
        "Payload": {
          "body": "eyJhcGlWZXJzaW9uIjoiMC4wLjEiLCJraW5kIjoiaGFzaGVkcmVrb3JkIiwic3BlYyI6eyJkYXRhIjp7Imhhc2giOnsiYWxnb3JpdGhtIjoic2hhMjU2IiwidmFsdWUiOiI3MzE1NjhiMjU2OWM0NjMwNzIzOWRlYjhjNzljNjM3ZDRiMTk0NzIyZmU2NjRmN2M0OGU0NDIwYTEzZDQzNWQzIn19LCJzaWduYXR1cmUiOnsiY29udGVudCI6Ik1FVUNJUURGNmg0SGpNZlVaTXhraHpTNm5Ka1hQRHpsWktUSzVZbm5zbmNldUdubVF3SWdlRUVJUGVXUVU4Unc3OW9ibHR4cXdYSWdOcjdUTVp3WmdBTjJSaEdJOEdnPSIsInB1YmxpY0tleSI6eyJjb250ZW50IjoiTFMwdExTMUNSVWRKVGlCRFJWSlVTVVpKUTBGVVJTMHRMUzB0Q2sxSlNVUnhSRU5EUVhreVowRjNTVUpCWjBsVlFWbFVLMmRxUm5kVmJrSkZZMlJYYUdJd04wNW1iMnBJYjJkTmQwTm5XVWxMYjFwSmVtb3dSVUYzVFhjS1RucEZWazFDVFVkQk1WVkZRMmhOVFdNeWJHNWpNMUoyWTIxVmRWcEhWakpOVWpSM1NFRlpSRlpSVVVSRmVGWjZZVmRrZW1SSE9YbGFVekZ3WW01U2JBcGpiVEZzV2tkc2FHUkhWWGRJYUdOT1RXcE5kMDFxUlRGTlZGRXdUMVJKTlZkb1kwNU5hazEzVFdwRk1VMVVVVEZQVkVrMVYycEJRVTFHYTNkRmQxbElDa3R2V2tsNmFqQkRRVkZaU1V0dldrbDZhakJFUVZGalJGRm5RVVV3U1dSWWJIWmxiRWhaWWpKbmNrMUNiWEp1YUhWVWNESm1WVTR2T0hoTU9VbFZURWNLYVRKaWRUbDBNSFZKVGs0eGFHdHJSVTVvZEU0ekwxWkVVWEIxTkZrdlpGUTBaVGxwTDNnekwzSkxMelZYYUhaUVprdFBRMEZyZDNkblowcEpUVUUwUndwQk1WVmtSSGRGUWk5M1VVVkJkMGxJWjBSQlZFSm5UbFpJVTFWRlJFUkJTMEpuWjNKQ1owVkdRbEZqUkVGNlFXUkNaMDVXU0ZFMFJVWm5VVlZtYjFsVUNrNDBaUzk1Um5GQmJtYzRLMmxhVnpCaGJXTmFRWEpuZDBoM1dVUldVakJxUWtKbmQwWnZRVlV6T1ZCd2VqRlphMFZhWWpWeFRtcHdTMFpYYVhocE5Ga0tXa1E0ZDJKUldVUldVakJTUVZGSUwwSkhUWGRaV1ZwbVlVaFNNR05JVFRaTWVUbHVZVmhTYjJSWFNYVlpNamwwVERKT2NHSkhiREZpVXpscVlWZDRjQXBrVnpCMlRHMWtjR1JIYURGWmFUa3pZak5LY2xwdGVIWmtNMDEyV1c1V2NHSkhVWFJoVnpGb1dqSldla3hZU214aVIxWm9ZekpXZWt4dWJHaGlWM2hCQ21OdFZtMWplVGt3V1Zka2Vrd3pXWGhNYWtWNlRHcEJkMDlSV1V0TGQxbENRa0ZIUkhaNlFVSkJVVkZ5WVVoU01HTklUVFpNZVRrd1lqSjBiR0pwTldnS1dUTlNjR0l5TlhwTWJXUndaRWRvTVZsdVZucGFXRXBxWWpJMU1GcFhOVEJNYlU1MllsUkJVMEpuYjNKQ1owVkZRVmxQTDAxQlJVTkNRVkozWkZoT2J3cE5SRmxIUTJselIwRlJVVUpuTnpoM1FWRk5SVXRIVFRWT2VrbDZXVlJvYTFwcVRtcGFiVVY2VFhwYWExbFVSbTFQUkZFeFRqSkZORTVxVVhoTlJGWnJDazlFVFRCUFYwWnFXbTFWZDBsUldVdExkMWxDUWtGSFJIWjZRVUpDUVZGVVUxY3hhRm95VldkVmJWWnpXbGRHZWxwVFFrTmtWMnh6V2tSQllrSm5iM0lLUW1kRlJVRlpUeTlOUVVWR1FrRXhhbUZYZUhCa1Z6QjJXVEpzYzJGWVZuUk5RamhIUTJselIwRlJVVUpuTnpoM1FWRlpSVVZZU214YWJrMTJaRWRHYmdwamVUa3lUVk0wZUUxNU5IZE5TVWRLUW1kdmNrSm5SVVZCWkZvMVFXZFJRMEpJYzBWbFVVSXpRVWhWUVROVU1IZGhjMkpJUlZSS2FrZFNOR050VjJNekNrRnhTa3RZY21wbFVFc3pMMmcwY0hsblF6aHdOMjgwUVVGQlIwZFdXVEZQWVhkQlFVSkJUVUZTYWtKRlFXbEJTV280THpSNFQwMXZlR0o0VTI5Rk5tTUtUVTltTW5kMWEzVXlOVWhoVGk5dGNWTXpXVXh6YlVZMGVuZEpaME5IUXpKQ1RFNUZjVFoyYWtkdk5IcFZaemd4ZHpoV05ISmFiVFp2UmpCb09VZGtjd294VDJWTFlTdG5kME5uV1VsTGIxcEplbW93UlVGM1RVUmhVVUYzV21kSmVFRkxkbE52ZWpkNFl5OXRkM0JyVEUwdmMxWmtWbmxFT1VnMlVrWTRZU3RLQ25rNWN6RnVRVzFLUzFwNlZYSmlaelpQVmxJemIxRXlSa2QxYzFwMllWbEZlVUZKZUVGTVYyTlVhalkxZWk5aU1scEtPRzVCV0hkVFZFdGpRV2hrTnpRS01uVTRjVkpsVFhGWWNGQTFhMjB2VTI5RmMwd3pOakJoWW1Od1kxcHBVV05oT0dGcFRIYzlQUW90TFMwdExVVk9SQ0JEUlZKVVNVWkpRMEZVUlMwdExTMHRDZz09In19fX0=",
          "integratedTime": 1676472569,
          "logIndex": 13401821,
          "logID": "c0d23d6ad406973f9559f3ba2d1ca01f84147d8ffc5b8445c224f98b9591801d"
        }
      },
      "Issuer": "https://token.actions.githubusercontent.com",
      "Subject": "https://github.com/cilium/cilium/.github/workflows/build-images-releases.yaml@refs/tags/v1.13.0",
      "githubWorkflowName": "Image Release Build",
      "githubWorkflowRef": "refs/tags/v1.13.0",
      "githubWorkflowRepository": "cilium/cilium",
      "githubWorkflowSha": "c9723a8df3cfa336da1f8457a864105d8349acfe",
      "githubWorkflowTrigger": "push"
    }
  }
]

To learn more, read the official Cilium docs.

TLS SNI support

Server Name Indication (“SNI”) is an extension of the Transport Layer Security (TLS) protocol that allows for multiple domain names to be served by a single IP address. In the context of Kubernetes, this means that multiple Services can share the same IP address and still be able to terminate the client’s SSL/TLS connection and establish a secure connection between the client and the correct service.

In Cilium 1.13, we are introducing support for SNI in Network Policies. This allows operators to restrict the allowed TLS SNIs in their network and provide a more secure environment.

This feature is implemented as an Envoy redirect, which means that the connection is redirected to a different endpoint based on the SNI value in the client’s “Client Hello” message. It’s important to note that this feature does not require TLS termination, as the SNI extension is in clear text in the Client Hello message.

Starting with Cilium 1.13, we have added a new field “ServerNames” to the Cilium Network Policy that allows you to specify a list of allowed TLS SNI values. If the field is not empty, then TLS must be present and one of the provided SNIs must be indicated in the TLS handshake. This feature adds more granularity to your network security controls and allows you to enforce security policies based on the SNI value in the client’s connection request.

With the following policy, you can allow traffic to the amit.cilium.rocks SNI:

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
 name: "l7-visibility-tls"
spec:
 description: L7 policy with TLS-SNI
 endpointSelector:
   matchLabels:
     org: empire
     class: mediabot
 egress:
 - toFQDNs:
   - matchName: "amit.cilium.rocks"
   toPorts:
   - ports:
     - port: "443"
       protocol: "TCP"
     serverNames:
     - "amit.cilium.rocks"
 - toPorts:
   - ports:
     - port: "53"
       protocol: ANY
     rules:
       dns:
         - matchPattern: "*"

With Hubble, we can verify that traffic to this SNI is allowed…

… while traffic to google.com is dropped by the policy:

Resilience & Troubleshooting

CiliumNodeConfig

An area of focus for the Cilium team in 2023 is to provide more migration pathways to customers. We very frequently hear – at conferences, on social media, or on the eBPF & Cilium Slack – requests to provide migration tooling from CNIs such as Calico and Flannel to Cilium. 

Per-Node Cilium Configuration has been built for that purpose and for several other use cases highlighted below. It enables users to tweak the configuration on specific nodes, instead of all Cilium nodes sharing the Cilium configuration defined in a ConfigMap.

By rolling out the configuration to specific nodes step-by-step, we can address a number of use cases:

  • User A wants to transition a production cluster with kube-proxy-replacement enabled. As it is a potentially disruptive change, it must be rolled out gradually.
  • User B wants to migrate from another network plugin to Cilium.
  • User C owns and runs a cluster managed by a separate platform team. User C would like to tweak the Cilium configuration for their specific use case. 
  • User D needs to enable features that require specific hardware – such as the aforementioned IPv6 BIG TCP or XDP.

To support such requirements, we are introducing in Cilium 1.13 a new Custom Resource Definition: CiliumNodeConfig.

apiVersion: cilium.io/v2alpha1
kind: CiliumNodeConfig
metadata:
  namespace: kube-system
  name: kube-proxy-replacement
spec:
  selector: 
    matchLabels:
      io.cilium.kube-proxy-replacement: "" 
  rawConfig:
    kube-proxy-replacement: strict

The CiliumNodeConfig object lets us override the global Cilium configuration. 

For example, if a user wanted to enable XDP only on nodes meeting the software and hardware requirements, they would have to apply a CiliumNodeConfig such as:

apiVersion: cilium.io/v2alpha1
kind: CiliumNodeConfig
metadata:
  namespace: kube-system
  name: enable-xdp
spec:
  nodeSelector:
    matchLabels:
      io.cilium.xdp-offload: "true"
  defaults:
    bpf-lb-acceleration: native

They would then label the nodes with a matching io.cilium.xdp-offload: "true” label and the node configuration would be overridden.

To learn more, read the official Cilium docs.

Project Hyperjump

During the v1.13 release cycle, a lot of effort was put into refactoring Cilium CI for running tests. The CI was identified among Cilium contributors as a major pain point when it comes to Cilium development. The result of the refactoring is a completely new CI infrastructure and a change in the test suites.

One of the challenges of the Cilium CI infrastructure is that it needs to have wide coverage of multiple kernels. Previously, we ran the Cilium end-to-end (E2E) tests on VirtualBox VM’s provisioned with Vagrant. The tests were implemented with the Ginkgo framework. The combination of both and the fact that we used to change Cilium configurations in the same K8s test cluster during the execution of a test suite led to test flakiness. 

Another issue was the long provisioning time of test infrastructure.

To address these two issues:

  • We created a lightweight qemu-based VM runner called “little-vm-helper” [1]. 
  • Such a VM gets started by a Github Action runner, and then the Kind K8s cluster is installed. For each Cilium configuration we create a separate cluster. Finally, we run cilium-cli connectivity tests [2]. 
  • An example of such a CI pipeline is “ci-datapath” [3].
    • Another big win is that the e2e tests run under 30 min, while the e2e on the old CI for some cases could run for above 3 hours.
  • Finally, some e2e tests were converted into the eBPF unit tests [4] and the control-plane integration tests [5].
    • At the time of writing, we still run the old CI, as not all tests have been migrated. We expect that we will be able to discontinue the old CI by the end of the v1.14 release cycle.
  • The same VM tool was used to implement multi-kernel tests [6] for Tetragon [7] as well.

Community

2022 Annual Report

2022 was a big year for the Cilium project or as we like to call it Year of the CNI. The number of comments on issues and PRs increased by 60% and blogs about Cilium went up by over 4x. You can find all of these numbers and the rest of the highlights from the project in the 2022 Cilium Annual Report. It also really shows how end users across diverse industries like finance, retail, software, and telecommunications are all realizing the benefits of Cilium and eBPF and have shown that it is production ready at scale. These users can use Cilium with the assurance with Cilium is a “well-secured project”, as the recent CNCF-commissioned Security Audit of Cilium highlighted.

CiliumCon

The first ever CiliumCon will be held at KubeCon + CloudNativeCon EU in Amsterdam on 18th April. It will cover stories from end users sharing what they learned running Cilium in production and from contributors diving into Cilium’s technology and its use of eBPF to provide high-performance networking, observability, and security. Attendees will also have the opportunity to meet-the-maintainers to discuss proposals, PRs, and issues.

Graduation

The Cilium community has applied to become a CNCF graduated project by creating a PR in the cncf/toc repository. This is a major milestone for the Cilium community and users. The entire community is grateful to everyone who has helped to get the Cilium this far and we are looking forward to working through the graduation process with the CNCF community. Add a 👍 to the PR to show your support!

KubeCon + CloudNativeCon North America

Cilium had 13 talks, a project meeting, and a booth at KubeCon + CloudNativeCon NA in Detroit. Besides applying for graduation, Microsoft announced it was picking Cilium as the CNI for AKS and Grafana released a new integration with Cilium. You can read more about it in the wrap up blog post.

User Stories

Since the 1.12 release, many new end users have stepped forward to tell their stories running Cilium in production including:

  • Cosmonic – Runs Cilium on Nomad clusters for a Wasm PaaS
  • Datadog – Scaling to 10,000,000,000,000+ data points per day across more than 18,500 customers
  • Nexxiot – 0 network outage with 100,000+ devices in the field
  • Form3 – Cilium connects clusters across multiple clouds for failover
  • Hetzner – Massive increase in RPS and throughput while reducing CPU usage for ingress
  • PostFinance – solved iptables challenges around scale, observability, and latency
  • Publishing Company – Secures 100,000+ RPS in a Multi-Tenant Environment
  • Retail – Connects 390+ Stores and 4.3 Billion Website Visitors with Cilium
  • S&P Global – Cilium is their multi cloud super highway
  • Seznam.cz – Reduced load balancer CPU load by 72x
  • Utmost – Implementing zero trust networking at 4,000 flows per second
  • VSHN – Reduced support burden with Isovalent Cilium Enterprise

Getting Started

If you are interested to get started with Cilium, use one of the resources below:

Previous Releases

Thomas Graf
AuthorThomas GrafCTO & Co-Founder Isovalent, Co-Creator Cilium, Chair eBPF Governing Board

Industry insights you won’t delete. Delivered to your inbox weekly.