Cilium 1.12 – Ingress, Multi-Cluster, Service Mesh, External Workloads, and much more

Jul 20, 2022Cilium

Today, we are excited to announce the release of Cilium 1.12. This is yet another massive release. While writing this blog, we have jokingly mentioned that we will soon have to publish the release blog as a book. The release includes many exciting new features from new contributors, several features maturing to stable, significant enhancements to Cilium’s security model, bug fixes, and a new community program around the documentation.

Among the highlights is a new sidecar-free Service Mesh datapath option that complements the existing sidecar-based Istio integration. It is powered by Envoy and eBPF under the hood and brings new and simpler control plane options integrated into Kubernetes, with Gateway API & SPIFFE support coming soon. It also features a new Envoy CRD to bring the full raw power of Envoy to all our users.

We are also shipping a fully conformant Ingress Controller in this release to provide L7 load-balancing while offering seamless visibility and NetworkPolicy integration.

The release also contains significant multi-cluster and external workload improvements including service affinity for multi-cluster load-balancing as well as major enhancements on the integration of external non-Kubernetes workloads. Earlier this year, we also launched Tetragon as part of the Cilium family which brings the power of eBPF to runtime security for both observability and enforcement use cases.

Huge shoutout and congrats to everybody in the community. This is yet another massive milestone for Cilium.

Want to learn more about Cilium and Isovalent Cilium Enterprise? We are hosting a Cilium 1.12 – Features Overview & Deep Dive webinar on August 4 at 18:00 CEST | 9AM US Pacific with several maintainers of CIlium.

Cilium 1.12 – New Features at a Glance

Service Mesh & Ingress:

  • Integrated Ingress Controller: A fully compliant Kubernetes Ingress controller embedded into Cilium. Additional annotations are supported for more advanced use cases. (More details)
  • Sidecar-free datapath option: A new datapath option for Cilium Service Mesh as an alternative to the Istio integration allowing to run a service mesh without sidecars. (More details)
  • Envoy CRD: A new Kubernetes CRD making the full power of Envoy available whereever Cilium runs. You can express your requirements in an Envoy Configuration and inject it anywhere in the network. (More details)
  • Gateway API: The work on supporting Gateway API has started. If you are interested in contributing, reach out on Slack.

Cluster Mesh:

  • Topology-aware routing and service affinity: With a single line of YAML annotation, services can be configured to prefer endpoints in the local or remote cluster, if endpoints in multiple clusters are available. (More details)
  • Simplified cluster connections with Helm: New simplified user experience to connect Kubernetes clusters together using Cilium and Helm. (More details)
  • Lightweight Multi-Cluster for External Workloads: Cluster Mesh now supports special lightweight remote clusters. This allows running lightweight clusters for use by external workloads. (More details)

External Workload Improvements:

  • Egress Gateway promoted to Stable: Cilium enables users to route selected cluster-external connections through specific Gateway nodes, masquerading them with predictable IP addresses to allow integration with traditional firewalls that require static IP addresses. A new CiliumEgressGatewayPolicy CRD improves the selection of the Gateway node and its egress interface. Additional routing logic ensures compatibility with ENI interfaces. (More details)
  • Improved BGP control plane: IPv6 support has been added to the BGP control plane. By leveraging a new feature-rich BGP engine, Cilium can now set up IPv6 peering sessions and advertise BGP IPv6 Pod CIDRs. (More details)
  • VTEP support: This new integration allows Cilium to peer with VXLAN Tunnel Endpoint devices in on-prem environments. (More details)

Security:

  • Running Cilium as unprivileged Pod: You can now run Cilium as an unprivileged container/Pod to reduce the attack surface of a Cilium installation. (More details)
  • Reduction in required Kubernetes privileges: The required Kubernetes privileges have been greatly reduced to the least needed for Cilium to operate. (More details)
  • Network policies for ICMP: Cilium users can now allow a subset of ICMP traffic on egress and ingress, with the usual CiliumNetworkPolicies and CiliumClusterwideNetworkPolicies. (More details)

Load-Balancing:

  • L7 Load-balancing: With the addition of Ingress support, Cilium has become capable of performing L7 load-balancing. (More details)
  • NAT46/64 Support for Load Balancer: Cilium L4 load-balancer (L4LB) now supports NAT46 and NAT64 for services. This allows exposing an IPv6-only Pod via an IPv4 service IP or vice versa. This is particularly useful to load-balance IPv4 client traffic at the edge to IPv6-only clusters. (More details)
  • Quarantining Service backends: A new API to quarantine service backends that are unreachable, and unable to do load-balancing. The latter can be utilized by health checker components to drain traffic from unstable backends, or backends that should be placed into maintenance mode. (More details)
  • Improved Multi-Homing for Load Balancer: Cilium’s datapath is extended to support multiple native network devices and multiple paths. (More details)

Networking:

  • BBR congestion control for Pods: Cilium is now the first CNI to support TCP BBR congestion control for Pods in order to achieve significantly better throughput and lower latency for Pods exposed to lossy networks such as the Internet. (More details)
  • Bandwidth Manager promoted to Stable: The bandwidth manager used to rate-limit Pod traffic and optimize network utilization has been promoted to stable (More details)
  • Dynamic Allocation of Pod CIDRs (beta): A new IPAM mode that improves Pod IP address space utilization due to dynamic assignment of Pod CIDRs to nodes. The latter can now allocate additional Pod CIDR pools dynamically to each node based on their usage. (More details)
  • Send ICMP unreachable host on Pod deletion: When a Pod is deleted, Cilium can now install a route which informs other endpoints that the Pod is now unreachable. (More details)
  • AWS ENI prefix delegation: Cilium now supports the AWS ENI prefix delegation feature, which effectively increases the allocatable IP capacity when running in ENI mode. (More details)
  • AWS EC2 instance tag filter: A new Cilium Operator flag improves the scalability in large AWS subscriptions. (More details)

Tetragon:

  • Initial release: Initial release of Tetragon that provides security observability and runtime enforcement using eBPF. (More details)

User Experience:

  • Automatic Helm Value Discovery: Cilium CLI is now capable of automatically discovering cluster types and generating the ideal helm config values for the discovered cluster type. (More details)
  • AKS BYOCNI support: Cilium and the Cilium CLI now support AKS clusters created with the new Bring-Your-Own Container Network Interface (BYOCNI) mode. (More details)
  • Improved chaining mode support: Improved integration between Cilium and cloud CNI plugins for Azure and AWS in chaining mode. (More details)
  • Better troubleshooting with Hubble CLI: Many improvements to the Hubble CLI including a better indication of whether a particular connection has been allowed or denied. (More details)

Isovalent Cilium Enterprise

The following new features have been developed during the release window of Cilium 1.12 and are available as part of Isovalent Cilium Enterprise 1.10-CE and 1.11-CE.

  • DNS Proxy HA: A new highly-available version of the DNS proxy to enforce FQDN policies. This keeps DNS connectivity available while the Cilium agent restarts or is upgraded. (More details)
  • OpenShift Certification updates: New certified releases have been published for Isovalent Cilium Enterprise 1.10-CE and 1.11-CE. This includes newly added support for offline installation. (More details)
  • Hubble Timescape: Timescape, a time-series database to store Hubble and Tetragon observability data over time, has seen numerous improvements including performance improvements, additional RBAC support, as well support to store additional types of observability data. (More details)
  • Historic Views in Hubble UI: The Hubble UI can now show the service map as well as flow data from Timescape to look back in time. (More details)
  • Advanced Network Observability: New additions to Tetragon Enterprise to provide new observability insights into protocols such as TCP, TLS, and HTTP. (More details)

Features marked with (Beta) are available for testing and should not be used in production without prior testing.


Service Mesh and Ingress

Kubernetes Ingress

Cilium now provides a fully conformant implementation of Kubernetes Ingress out of the box. Ingress services are essential to implement features such as path-based routing, TLS termination, or sharing a single load-balancer IP for many services.

Kubernetes Ingress controller

The integrated Ingress controller, which uses Envoy and eBPF under the hood, can be applied to traffic entering a Kubernetes cluster as well as to cluster within and across clusters to enable rich L7-aware load-balancing and traffic management. Simply set the helm option ingressController.enabled to enable Ingress in Cilium.

The below manifest is an example on how to configure Ingress to implement path-based routing and tie the ingress configuration to the Cilium ingress class. For more examples, please refer to Getting Started.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: minimal-ingress
spec:
  ingressClassName: cilium
  rules:
    - http:
        paths:
          - path: /testpath
            pathType: Prefix
            backend:
              service:
                name: test
                port:
                  number: 80

Sidecar-free Service Mesh Datapath Option

With Cilium 1.12, users have a new option to run a service mesh completely without sidecars while supporting a variety of different control plane options. This option complements the existing Istio integration which will continue to operate Cilium in a sidecar model as well.

Choice of service mesh datapath, control planes, and observability integrations
Choice of service mesh datapath, control planes, and observability integrations

This datapath currently supports Ingress and the Envoy CRD as control plane options as well as a simple option to enable L7 visibility with Prometheus and OpenTelemetry as output. The SPIFFE integration and Gateway API are currently in the works and will be coming next. For more details, check out the now publicly available Service Mesh Roadmap. We are eager to hear your feedback and thoughts.

Envoy CRD – CiliumEnvoyConfig (CEC)

CiliumEnvoyConfig (CEC) is a new CRD powering the Cilium Ingress implementation. CEC allows Cilium Ingress Controller to specify Envoy listeners and other resources, making it possible to transparently redirect traffic destined to specific Kubernetes services to these Envoy listeners. Cilium will automatically inject Cilium policy filters to Envoy listeners in CECs so that Cilium policy enforcement keeps working as expected.

CiliumEnvoyConfig is considered a feature for power users. It is used to simplify the integration of additional service mesh control planes such as Ingress, Gateway API, SMI, and potentially others. Interested users may observe the CEC resources as created by Cilium Ingress controller to see how CiliumEnvoyConfigs are laid out.

Tracing with OpenTelemetry

While Cilium and Hubble already natively included metrics and tracing capabilities, we have enhanced observability with Cilium 1.12.  

Observability data from Hubble and Cilium Service Mesh can now be exported using OpenTelemetry collector with hubble-otel. It supports both log and trace data formats and maps every L3-L7 flow to a log or a span. Trace IDs are generated automatically for related flows, based on source/destination address, port, and Cilium identity. When an L7 flow contains common trace ID headers, those will be respected. Then these trace and flow data can be imported with tools such as Jaeger or other OpenTelemetry supported observability platforms.


Cluster Mesh (Multi-Cluster)

Service Affinity for Multi-Cluster Load-Balancing

Cluster Mesh has become a highly popular Cilium feature to connect Kubernetes clusters together with minimal complexity and overhead. One of its benefits is global services, which leverage standard Kubernetes services with annotations to define services to span across multiple clusters.

With this new release, global services are now capable of understanding topology and affinity. This means that instead of balancing evenly across load-balancing backends, you can choose to prefer backends only in the local or remote clusters.

Cluster Mesh Load Balancing before 1.12

You can steer traffic to prefer a local cluster (for optimal latency and reduced cross-cluster traffic), to a remote cluster (for canary or maintenance of a local service) or to stay with the default model.

The newly added annotation io.cilium/service-affinity: "local|remote|none" can be used to specify the preferred endpoint destination.

apiVersion: v1
kind: Service
metadata:
  name: rebel-base
  annotations:
    io.cilium/global-service: 'true'
    # Possible values:
    # - local
    #    preferred endpoints from local cluster if available
    # - remote
    #    preferred endpoints from remote cluster if available
    # none (default)
    #    no preference. Default behavior if this annotation does not exist
    io.cilium/service-affinity: 'local'
spec:
  type: ClusterIP
  ports:
    - port: 80
  selector:
    name: rebel-base
Local Service Affinity

The above example will make sure that the incoming traffic will be load-balanced to local healthy endpoints. 

If none of the local endpoints are healthy, the endpoints from remote clusters will be used. 

Local Service Affinity in the event of a failure

Users and cluster operators can check endpoint details with the cilium service list CLI command.

kubectl exec -ti ds/cilium -- cilium service list --clustermesh-affinity

ID   Frontend            Service Type   Backend
1    10.96.173.113:80    ClusterIP      1 => 10.244.2.136:80 (active)
                                        2 => 10.244.1.61:80 (active) (preferred)
                                        3 => 10.244.2.31:80 (active) (preferred)
                                        4 => 10.244.2.200:80 (active)

If you would like to learn more, you can read the docs or watch a demo.

Simplified Cluster Connections with Helm

Primary external contributor: Samuel Torres (Form3)

Previously, to configure Cluster Mesh connections on a cluster there were two options:

  1. Using the cilium client cilium clustermesh connect command.
  2. Manually create the Cluster Mesh etcd ConfigMap and patch the Cilium agent DaemonSet with the Cluster Mesh peers host aliases.

These options are not ideal if we want to configure Cluster Mesh declaratively without needing to access all the clusters. This is specially important when Cilium is configured declaratively using GitOps.

With the 1.12 release, we are adding support to configure Cluster Mesh connections through the Helm Chart by adding a configuration section for explicit declaration of the Cluster Mesh connections.

Lightweight Multi-Cluster for External Workloads

Primary external contributor: Adam Bocim (Seznam.cz)

Starting with 1.12, Cilium clustermesh-apiserver can be configured to use remote Kubernetes API server (using the --k8s-kubeconfig-path command-line option).

Previously, CiliumExternalWorkload (CEW) resources could only be deployed into the same Kubernetes where clustermesh-apiserver was running. Now we can operate multiple “user” lightweight Kubernetes clusters (e.g. k3s) where Cilium CRDs (CEW, Cilium Network Policies, etc.) are deployed into, on top of the “infrastructure” Kubernetes cluster where Cilium components are running.

Each of these k3s servers can be used as a control plane for OpenStack (using cilium-openstack plugin in process of development by Ondřej Blažek, Seznam.cz) and bare-metal clusters, so we can enforce CNPs and CCNPs across all of these Kubernetes, OpenStack, and bare-metal environments.


External Workloads

Egress Gateway Promoted to Stable

The Egress Gateway was first introduced as a beta feature in Cilium 1.10. It uses the CiliumEgressNATPolicy CRD to forward cluster-external IPv4 connections through specific Gateway nodes, and masquerades them with predictable IP addresses that are associated with the Gateway node. This allows for the integration with legacy firewalls that require static IP addresses.

Egress Gateway

With Cilium 1.12 this feature is promoted to Stable status. A new CiliumEgressGatewayPolicy CRD improves the selection of the Gateway node (by using a NodeSelector) and allows to specify the Masquerade IP either directly or by its associated network interface. The –install-egress-gateway-routes option can be used in combination with ENI interfaces, so that traffic leaves the Gateway node on the correct ENI network interface.

In the example below, all connections from test-app to external endpoints in the 192.0.2.0/24 CIDR are routed through myEgressNode, get masqueraded with 192.0.2.2 and leave the cluster on the associated network interface.

apiVersion: cilium.io/v2
kind: CiliumEgressGatewayPolicy
metadata:
  name: egress-sample
spec:
  selectors:
    - podSelector:
        matchLabels:
          app: test-app
  destinationCIDRs:
    - 192.0.2.0/24
  egressGateway:
    nodeSelector:
      matchLabels:
        kubernetes.io/hostname: 'myEgressNode'
    # specify either 'egressIP' or 'interface':
    egressIP: '192.0.2.2'
    # interface: "eth0"

IPv6 Support for BGP

BGP is not just the foundational protocol behind the Internet; it is now the standard within data centers. Modern data center network fabrics are typically based on a “leaf-and-spine” architecture where BGP is typically used to propagate endpoint reachability information. Given that such endpoints can be Kubernetes Pods, it was evident that Cilium should introduce support for BGP. 

BGP capabilities came in the 1.10 release, with further enhancements in 1.11. It enabled platform operators to advertise Pod CIDRs and ServiceIP and to provide routing peering between Cilium-managed Pods and an existing networking infrastructure. Crucially, it enabled connectivity between Pods and service IPs without the need to install additional components.

These features were for IPv4 only and leveraged MetalLB.

New BGP support in Cilium 1.12

With IPv6’s adoption continuing to grow, it became clear that Cilium would require BGP IPv6 capabilities – including Segment Routing v6 (SRv6). While MetalLB had some limited IPv6 support via FRR, it remained experimental. The Cilium team evaluated various options and decided to pivot instead to the more feature-rich GoBGP package. 

Enabling this feature only requires a flag (--enable-bgp-control-plane=true). 

With this new release comes a new BGP CRD (“CiliumBGPPeeringPolicy”) that will provide more granularity and scalability for BGP operators. 

The example below enables users to deploy label-based BGP configuration: using nodeSelectors, the same BGP configuration can be applied to multiple nodes.

Each BGP configuration defined in virtualRouters can be thought of as a BGP router instance, with its own local autonomous system number (ASN) and peering parameters.

When set to true, the exportPodCIDR flag will advertise all Pod CIDRs dynamically, removing the need to manually specify which prefixes need to be advertised.

Users will be able to peer with either an IPv4 or IPv6 BGP neighbor by specifying its details in peerAddress and peerASN.

---
apiVersion: "cilium.io/v2alpha1"
kind: CiliumBGPPeeringPolicy
metadata:
 name: rack0
spec:
 nodeSelector:
   matchLabels:
     rack: rack0
 virtualRouters:
 - localASN: 65010
   exportPodCIDR: true
   neighbors:
   - peerAddress: "10.0.0.1/32"
     peerASN: 65010

From 1.12, GoBGP will be used to provide Pod CIDR advertisement, for both IPv4 and IPv6.

If you would like to learn more, check out the docs or watch the demo below:

Users that require LoadBalancer Virtual IP advertisement should keep using the MetalLB implementation until that functionality is ported to GoBGP (expected to come in the next releases).

VTEP support

Primary external contributor: Vincent Li (F5 Networks)

For users who run their own data centers, one frequent deployment model involves running a tunnel network such as VXLAN (or Geneve) to connect Kubernetes applications with one another. With this model, a question arises about how to connect existing applications in the physical network to applications in the Kubernetes cluster. One solution to this is to introduce a VTEP device that can translate traffic from the physical network into the virtual network, and vice-versa.

VTEP support added in Cilium 1.12 allows the user to configure Cilium with the details about the VTEPs that are deployed in the network, and decide how to route traffic towards these VTEPs based on the IP prefixes. Additionally, by configuring the VTEP with the details about the Cilium virtual network identifiers (VNI), the VTEP can act as a LoadBalancer for traffic coming in from the network. This way, incoming requests from existing network applications can be routed to backends in the Kubernetes cluster.

If you would like to know more, check out the docs.


Security

Running Cilium as an unprivileged Pod

The Cilium DaemonSet was always run as containers in privileged mode. Starting with Cilium 1.12, it is now possible to run Cilium as an unprivileged container. By adopting a least-privilege model, we can now give the exact Linux capabilities required by the Cilium DaemonSet instead of full permission. This reduces any potential cyber attack surface.

securityContext:
  privileged: false
  capabilities:
    drop:
      - ALL
    add:
      # Use to set socket permission
      - CHOWN
      # Used to terminate envoy child process
      - KILL
      # Used since cilium modifies routing tables, etc...
      - NET_ADMIN
      # Used since cilium creates raw sockets, etc...
      - NET_RAW
      # Used since cilium monitor uses mmap
      - IPC_LOCK
      # Used in iptables. Consider removing once we are iptables-free
      - SYS_MODULE
      # We need it for now but might not need it for >= 5.11 specially
      # for the 'SYS_RESOURCE'.
      # In >= 5.8 there's already BPF and PERMON capabilities
      - SYS_ADMIN
      # Could be an alternative for the SYS_ADMIN for the RLIMIT_NPROC
      - SYS_RESOURCE
      # Both PERFMON and BPF requires kernel 5.8, container runtime
      # cri-o >= v1.22.0 or containerd >= v1.5.0.
      # If available, SYS_ADMIN can be removed.
      #- PERFMON
      #- BPF

Reduction in required Kubernetes privileges

The 1.12 release contains changes resulting in a significant reduction of required Kubernetes resource privileges of Cilium’s ClusterRole for both Cilium and standard Kubernetes resources.

A Role defines which permissions and resources are attributed to a particular service account. In the Cilium DaemonSet, we have changed those Roles to only be able to “read” the exact Kubernetes resources that it needs to access.

ICMP Network Policies

Primary external contributor: Tomoki Sugiura (student at NAIST, part of our Google Summer of Code)

Previous versions of Cilium allowed ICMP traffic through only if all L3 traffic was allowed.

With ICMP network policies added in Cilium 1.12, users can control the specific ICMP types they want to allow. This extension of our Cilium network policies could for example be used to deny ICMP and ICMPv6 echo messages while allowing all other ICMP messages:

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "deny-icmp-echo"
spec:
  description: "ICMP echo policy deny"
  endpointSelector:
    matchLabels:
      id: app1
  egressDeny:
    icmps:
    - fields:
      - type: 8
      - type: 128
        family: IPv6

These new ICMP rules are supported for all our policies, including deny policies as shown above, but also CiliumClusterwideNetworkPolicies and host policies. ICMPv6 messages can be allowed and denied in the same manner.

If you would like to learn more, check out the docs.


Load-Balancing

NAT46/NAT64 support for Load Balancer

Nat64/Nat46

In Cilium 1.10, we paved the way for supporting a generically programmable, high-performance layer 4 load balancer (L4LB) framework, supporting features such as XDP and Maglev in a standalone flavor. As recently described by Seznam.cz, such load-balancers typically live at the edge of the datacenter, replacing expensive dedicated hardware load-balancers, and forward traffic into Kubernetes clusters at massively improved performance.

As users migrate from an IPv4 or IPv4/IPv6 dual-stack setup over to an IPv6 single-stack deployment for better scalability and flexibility, the need for a translation layer arises for routing IPv4 service traffic at the edge of the data center into the IPv6-only cluster. Similarly, traffic originating from the IPv6-only cluster might still require connecting to cluster-external 3rd-party services which are still backed by IPv4 nodes. As such, NAT46 and NAT64 functionality becomes more and more crucial for operators to allow cluster interaction with the rest of the “legacy” world.

Both NAT46 and NAT64 never found their way into the Linux kernel’s iptables/ netfilter subsystem given the complex nature of the networking stack and related address family-specific meta data which is tied to a packet (and hard to migrate from one to the other, e.g. GRO/GSO-related information). Given eBPF in the Linux kernel has a simpler model, we implemented NAT46 and NAT64 support for eBPF in the tc layer of the Linux kernel through the bpf_skb_change_proto() helper (fun fact: Not only do we use this functionality, but also Android’s netd’s CLAT component is using it today to connect mobile phones to IPv6-only cell networks).

The Cilium standalone L4LB now supports NAT46 and NAT64 for both XDP and non-XDP operating modes as well as for its data path under Maglev and Random backend selection. Through Cilium’s RPC API, service VIPs in IPv4 with a set of backends in IPv6 can be programmed, or vice versa, service VIPs in IPv6 with a set of backends in IPv4. The edge load balancer then acts as a gateway connecting the two worlds via SNAT’ing the network traffic on the path to a backend.

The following snippet shows a NAT46 load-balancer example entry with an IPv4 frontend address backed by three IPv6 backends…

cilium service list 
ID   Frontend     Service Type   Backend                      
1    1.2.3.4:80   ExternalIPs    1 => [f00d::1]:60 (active)   
                                 2 => [f00d::2]:70 (active)   
                                 3 => [f00d::3]:80 (active)   

…and, vice versa, below is a NAT64 load-balancer example entry with an IPv6 frontend address backed by IPv4 backends:

cilium service list 
ID   Frontend       Service Type   Backend                      
1    [cafe::1]:80   ExternalIPs    1 => 1.2.3.4:8080 (active)   
                                   2 => 4.5.6.7:8090 (active) 

If you want to learn more about the L4LB performance, read the user report by Seznam.cz about their path from hardware load-balancers to an in-house IPVS-based solution and their migration to Cilium’s standalone L4LB.

Quarantining Service Backends

Users can now quarantine service backends using the newly added Cilium services API. Cilium datapath will not consider such backends to do load-balancing of service traffic, meaning, traffic to these backends is being drained for ongoing connections and excluded for new connections.

The primary use case is to enable users to inform Cilium about unreachable service backends using their own out-of-band health-checking mechanism. Users can invoke the same API to set the quarantined service backends as active again. 

In order to quarantine a service backend, users will invoke the API to set the state attribute for the backend. Cilium agent will then update all the service(s) that select the backend with the update state. The backend states are restored upon Cilium agent restarts. Setting backend states is supported in both the Cilium standalone L4LB as well as Cilium’s eBPF-based kube-proxy replacement.

The following snippet shows the usage of the API.

$ cilium service update --backends 10.244.0.58:80 --states quarantined
Updating backend states
Updated service with 1 backends
 
$ cilium service list
ID   Frontend          Service Type   Backend
4    10.96.95.246:80   ClusterIP      1 => 10.244.0.58:80 (quarantined)
                                      2 => 10.244.1.59:80 (active)
                                      3 => 10.244.1.8:80 (active)
 
$ cilium service update --backends 10.244.0.58:80 --states active
Updating backend states
Updated service with 1 backends
 
$ cilium service list
ID   Frontend          Service Type   Backend
4    10.96.95.246:80   ClusterIP      1 => 10.244.0.58:80 (active)
                                      2 => 10.244.1.59:80 (active)
                                      3 => 10.244.1.8:80 (active)

Improved Multi-Homing for Load Balancer

Previously, Cilium’s kube-proxy replacement could not properly work on nodes which were reachable via multiple network devices or multiple paths. For example, in a cluster in which nodes were connected via ECMP, not all routes could be utilized for forwarding service requests to remote endpoints.

In Cilium 1.12, we extended the routing lookups in Cilium’s datapath to enable various multi-homing network architectures. One example of such an architecture is shown in the diagram below.


Networking

BBR congestion control for Pods

TCP BBR (“Bottleneck Bandwidth and Round-trip Propagation Time”) is a relatively new congestion control algorithm that achieves higher bandwidths and lower latencies for Internet traffic by responding to actual congestion rather than packet loss. In particular, it achieves big TCP throughput improvements on high-speed, long-haul links, and enables significant reductions in latency in last-mile networks that connect users to the Internet.

For example, take a typical server-class node with a 10 Gbps link, sending over a path with a 100 ms round-trip time (e.g. Chicago to Berlin) with a packet loss rate of 1%. In such a case, BBR’s throughput is 2,700x higher than today’s best loss-based congestion control which is set in Linux by default, that is, CUBIC. CUBIC gets about 3.3 Mbps, while BBR achieves over 9.1 Gbps.

Since the end of last year, we have been experimenting with bringing the benefits of TCP BBR to the wider Kubernetes user base by transparently enabling BBR support for Pods. This is in particular useful for Kubernetes services which are exposed to clients connecting to them over the Internet. 

But there were significant challenges with bringing this exciting technology to Kubernetes. At any time, a TCP connection has one slowest link: the bottleneck. Evidently, the ideal traffic flow behavior is for the sender to send traffic at the bottleneck’s bandwidth rate, but with the lowest possible delay. 

Standard TCP congestion tends to send as much traffic as possible and adjust its rate when it detects some loss which often causes TCP to significantly reduce its rate. It can also lead to severe “bufferbloat” issues, where packets are waiting on network devices with very deep queues.

The reason BBR is so effective at increasing throughput and reducing latency is that it determines the bottleneck’s capacity and monitors the packet RTT (hence its name). BBR can then optimize how much traffic is sent and can pace data packets, while minimizing delays due to packets being stuck in a queue.

Crucially, BBR relies on timestamps for its algorithm. The problem with using BBR on Kubernetes was that timestamps were being cleared as they traversed network namespaces (watch our Linux Plumbers conference talk where we outlined the issue).     

We worked with the kernel networking community to address this issue in v5.18 and newer kernels by explicitly retaining egress packet timestamps instead of clearing them upon network namespace traversal. 

By combining Cilium’s efficient BPF host routing architecture with the kernel enhancements, we can carry the original delivery timestamp and retain the packet’s socket association all the way to the physical devices. Cilium’s Bandwidth Manager’s FQ scheduler can then properly deliver packets with BBR in a stable manner. At this year’s KubeCon EU, we did a deep dive talk of Cilium’s Bandwidth Manager, including the technical aspects around how we fixed BBR for Pods for further details.

With the Cilium 1.12 release, Cilium is now the first CNI to support BBR for Kubernetes Pods. To demonstrate how effective BBR can work, we built a video streaming service with the help of Kubernetes where we compare the Linux default (CUBIC) against Cilium with its new BBR support:

Follow the guide for more details around setup parameters for BBR in docs.cilium.io.

Bandwidth Manager promoted to Stable

Initially released in Cilium 1.9, the Bandwidth Manager feature has been successfully adopted in production environments. We are pleased to announce its promotion to Stable status. 

This innovative feature provides a scalable and efficient way to enforce egress bandwidth limitations for Pods. This is to prevent bandwidth starvation and contention issues with bandwidth-hungry Pods. 

Once the Bandwidth Manager feature is enabled, simply specify the maximum egress bandwidth limit in the Pod manifest:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/egress-bandwidth: 10M
...

As you can see the speed test below, Cilium will prevent the Pod from consuming more than the allocated bandwidth (notice the 9.54Mbps throughput test, with a Pod limited to 10Mbps):

% kubectl apply -f netperf.yaml 
pod/netperf-server created
pod/netperf-client created
% NETPERF_SERVER_IP=$(kubectl get pod netperf-server -o jsonpath='{.status.podIP}')
kubectl exec netperf-client -- \
    netperf -t TCP_MAERTS -H "${NETPERF_SERVER_IP}"
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.108.2.172 (10.108.) port 0 AF_INET
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 87380  16384  16384    10.00       9.54 

To learn more, read the official docs or watch the demo here.

Dynamic Allocation of Pod CIDRs (beta)

When Cilium is the primary CNI, it is responsible for the allocation and management of IP addresses used by Kubernetes Pods. Cilium has many different IP Address Management (IPAM) modes, but the most popular choices for self-managed Kubernetes installations are the Kubernetes and Cluster Pool IPAM modes.

Both those modes work based on the concept of a PodCIDR: When a new node is added to the cluster, it is assigned a static PodCIDR (e.g. a /24 IPv4 subnet) by either the Kubernetes API server or Cilium Operator. This allows Pod IP allocation to be delegated to each node, as each Cilium Agent instance can independently assign local Pod IPs from its designated PodCIDR. Having a designated PodCIDR per node also allows native routing to be performed based on that CIDR, for example via Cilium’s BGP integration or Cilium’s auto-direct-node-routes feature.

Cluster Pool v2

However, the requirement that all Pod IPs are allocated from a single static PodCIDR can also be limiting, because it implies that all nodes have the same maximum number of Pod IPs available, while the actual number of Pods is not. For example, due to machine sizing, not all nodes may have the same capacity for Pods. Another reason can be that some workloads are inherently unbalanced, meaning some nodes can end up running a few large Pods, while other nodes run many smaller Pods.

Therefore, a one-size-fits-all PodCIDR might actually not fit all – and end up being too small or too large for a certain node. If the default PodCIDR size is too small, the number of Pods that a node can run is limited by the number of available IPs. If the PodCIDR is too large, IP addresses end up being wasted, which is a problem if the PodCIDR range is shared with other endpoints on the network. Overly large default PodCIDRs are also problematic because they limit the number of nodes in a cluster (as the global pool of PodCIDRs is exhausted quicker with larger CIDRs).

For these reasons, Cilium 1.12 is introducing a new IPAM mode called Cluster Pool v2 (Beta). It extends the regular cluster-pool IPAM to allow additional PodCIDRs to be dynamically allocated to each node based on usage. With Cluster Pool v2, each Cilium agent instance reports the utilization of its PodCIDRs via the CiliumNode resource:

apiVersion: cilium.io/v2
kind: CiliumNode
metadata:
  name: worker-node-3
spec:
  ipam:
    podCIDRs:
      - 10.0.0.0/26
      - 10.0.1.64/26
      - 10.0.2.192/26
status:
  ipam:
    pod-cidrs:
      10.0.0.0/26:
        status: in-use
      10.0.1.64/26:
        status: depleted
      10.0.2.192/26:
        status: released

This allows the operator to detect if a node is running low on available Pod IPs, and to assign an additional PodCIDR to that node. Likewise, if a node has unused PodCIDRs, it can release it, allowing Cilium Operator to re-assign the released PodCIDR to a different node if needed.

This retains the advantages of PodCIDR-based IPAM (simplified routing and minimal global coordination when allocating Pod IPs), but allows cluster operators to introduce smaller PodCIDRs without limiting the maximum number of Pods each node can run.

To find out more, read the docs here.

Send ICMP unreachable host on Pod deletion

Primary external contributor: Laurent Bernaille (Datadog)

When a Pod is deleted, Cilium currently removes all local routes to the old Pod IP. Clients may still try to connect to the old IP until it is deregistered and when running on providers such as AWS, GCP, or Azure, these attempts will make it to the instance. But packets will be dropped (either by the operating system or by the cloud provider after routing). Clients attempting these connections will retry according to tcp_syn_retries and only abort after close to 2 minutes with default settings.

To address this scenario, we introduced a new flag, enable-unreachable-routes, which will change the behavior on Pod deletion: instead of removing the route to the old IP, Cilium will replace it with an unreachable route. With this setup, any packet to the old IP will trigger an ICMP error message and clients will immediately be notified that this IP is not reachable anymore.

This behavior is currently opt-in but we are thinking of making it the default in the future. If you test it and have questions/feedback, reach out to us!

More details in this pull request, and more context on the consequences of the former behavior in this blog post.

AWS ENI prefix delegation

Primary external contributor: Hemanth Malla (Datadog)

Cilium now supports IPv4 prefix delegation in ENI mode. This feature allows for assigning /28 prefixes to network interfaces instead of individual /32 IP addresses.

This effectively increases the allocatable IP capacity on supported nodes by around 16 times. AWS supports ENI prefix delegation only on instances built on the nitro system. The following are some additional benefits of using prefix delegation:

  • Reduced reliance on the operator for Pod IP allocation.
  • Reduced API calls to AWS resulting in faster Pod startup time and reduced throttling.
  • Reduced cost in Amazon VPC IP Address Manager.

Prefix delegation can be enabled by setting the --aws-enable-prefix-delegation flag on the operator. Please note that currently this would enable prefix delegation only on new nodes that join the cluster. Also note that since the operator now allocates 16 IP addresses at a time, on nodes with lower Pod density, this can be wasteful.

AWS EC2 instance tag filter

Primary external contributor: Prune Sebastien Thomas (The New York Times)

When deployed in AWS EKS and using Cilium IPAM (Cilium is managing IP attribution for Pods), Cilium-Operator has to maintain the list of resources available in AWS, like the VPCs, the subnets, the instances and the ENIs (Network Interfaces). When dealing with large projects, the number of resources that Cilium needs to maintain can get huge and can have an impact on the discovery process, leading to latencies to get an IP for a new Pod, latencies to reconcile the resource list, and a large memory consumption.

An option was added in Cilium 1.8 to filter the resources based on the subnets so Cilium only maintains a subset of the existing resources. Only the ENIs that are part of the subnets defined in –subnet-ids-filter or having a tag defined in –subnet-tags-filter are maintained.

But filtering on subnets is not ideal as some instances may start without any ENI with attached IPs that match the subnet filter. This is the case when using a split network, where the first ENI is only used for the instance networking and the remaining ENIs (up to the limit depending on your instance size) are used to attach Pod IPs. In Cilium 1.12 we added an instance tag filter option that filters instances and ENIs by tags set on the instances, without consideration of the networks attached to them.

Cilium should mostly only manage the resources that are at play inside one cluster. By using the Instance tag filter you can ensure that Cilium will only maintain resources of the instances that are part of the EKS cluster. Use the command line option --instance-tags-filter or the Helm chart value eni.instanceTagsFilter to define the key=value to filter the instances. For example, EKS adds a aws:eks:cluster-name=<name of your cluster> tag to each instance by default, allowing you to easily filter out any unrelated instances with this new option.


Tetragon

Tetragon is the latest open-source project in the Cilium family. It provides eBPF-based transparent security observability combined with real-time runtime enforcement.

Tetragon

The deep visibility is achieved without requiring application changes and is provided at low overhead thanks to smart in-kernel filtering and aggregation logic built directly into the eBPF-based kernel-level collector. The embedded runtime enforcement layer is capable of performing access control on the system call and other enforcement levels.

For more information, check out the Tetragon GitHub repository.


User Experience

Automatic Helm Value Discovery

We have seen great adoption of the Cilium CLI since its introduction in 1.10. With the ability to discover the type of cluster Cilium is being deployed into, Cilium CLI automatically sets the right values to deploy Cilium. This auto-detection logic has been available since v0.11.0. With 1.12, Cilium adds support to using this auto-detection logic to automatically generate the ideal Helm installation values for the targeted cluster.

The generated helm-values file can either be used with helm or even with Cilium CLI itself.

$ cilium install --version v1.12.0 --helm-values my-helm-values.yaml

AKS BYOCNI support

Microsoft recently released a new “Bring your own CNI” mode available when creating AKS clusters, allowing users to create clusters with no CNI plugin pre-installed.

In 1.12, Cilium and the Cilium CLI officially support AKS clusters created in BYOCNI mode. Cluster installation and scaling operations are now much simpler than before as users do not need to manage a complex taint system with multiple node pools anymore. With Cilium being the sole CNI, it is now able to ensure application Pods are not scheduled on new nodes before Cilium agents are ready to handle them.

Please refer to our Getting Started Guides for more details and instructions.

Note: BYOCNI is now the preferred way to run Cilium on AKS, however integration with the Azure stack via Azure IPAM is not available in BYOCNI. If you require Azure IPAM, please stick with the previous Azure IPAM-based Cilium installation.

Improved chaining mode

Primary external contributors: Nitish Malhotra (Microsoft), Heiko Rothe (Idealo)

This release brings improvements to the support for chaining mode when combining Cilium alongside CNI plugins from existing cloud providers. More specifically, Azure users will benefit from the Layer 7 policy support has been added when chaining Cilium on top of Microsoft’s default Azure CNI plugin.

Additionally, IPv6 EKS clusters deployed using Amazon’s aws-cni can now be chained with Cilium to provide Cilium features on top of these IPv6 deployments.

Better Troubleshooting with Hubble CLI

Hubble CLI is commonly used to troubleshoot network policy issues. To help with this use case, we have improved the readability of policy verdict events in the new Hubble CLI release shipped with Cilium 1.12. It now better indicates whether a particular connection has been allowed or denied by policy, and also displays the Cilium-assigned security identity of the source and destination of each occurred network flow.

Better Troubleshooting with Hubble CLI

This release also comes with quality of life improvement specifically designed to help interacting with Hubble Timescape: Time-range based filters (such as –since and –until) now support abbreviated date formats, making it easier to query for flows within a particular range of dates. The Hubble API has also been extended to support a new –first option, allowing users to query for just the first N flows in a particular filter set.

The overall CLI experience has been improved with many small changes, such as better shell completion for various filter flags, better command-line usage descriptions, and support for named identity filters.

Isovalent Cilium Enterprise – Features & Updates

The following new features and updates are available in Isovalent Cilium Enterprise 1.10-CE and 1.11-CE.

DNS Proxy HA

The Highly-Available (HA) version of the DNS Proxy available as part of Isovalent Cilium Enterprise has seen numerous improvements including additional observability events and a reduced memory footprint optimized for the cluster edge.

FQDN Proxy HA

When applications make DNS requests towards specific FQDNs such as isovalent.com, Cilium’s built-in DNS proxy monitors for DNS requests and responses in order to associate the IPs of destinations with the names specified in the network policy. This allows app teams and cluster operators to define network policies based on the domain names used by peers rather than specifying IP addresses in the policy.

With the HA version of DNS proxy, the DNS requests are now load-balanced between the Cilium Agent’s DNS proxy and an external DNS proxy. This external DNS proxy runs independently of the main Cilium agent, so the application’s DNS requests will continue to flow even if the Cilium agent is unavailable. The HA version of the DNS Proxy enables uninterrupted and continued datapath operation of the DNS  proxy to enable the use of FQDN network policies and DNS visibility while the Cilium agent is unavailable or being upgraded.

OpenShift Certification

Cilium OpenShift Certification

Both Isovalent Cilium Enterprise (1.10-CE, 1.11-CE) and Cilium OSS (1.9, 1.10, 1.11, 1.12) are certified on OpenShift. New in the latest releases is the support for offline installation. To get started with Cilium on OpenShift please follow this guide.

Timescape – Observability Storage & Analytics

Timescape, initially announced with the release of Isovalent Cilium Enterprise 1.10, is an observability and analytics platform to store and query all the data that Cilium, Hubble, and Tetragon collect. In other words, Timescape can be seen as a time machine for observability data with powerful analytics capabilities.

Ingestion performance improvements

Primary contributors: Alexandre Perrin, Chance Zibolski, Glib Smaga, Robin Hahling (Isovalent)

As Hubble Timescape has to ingest billions of events in most target environments, collecting and indexing this data should be done as fast as possible. To this end, we’ve implemented smarter ways to list objects in a bucket, reworked our database schemas, improved memory usage and caching patterns, and increased parallelism of the ingestion code. These improvements lead to faster ingestion speed (250k events/s is now easily achieved on a modest single node database instance).

Query performance improvements

Primary contributors: Chance Zibolski, Glib Smaga (Isovalent)

In the area of performance, we also tuned our queries to ensure they return fast even when the database contains billions of events. In particular, the query endpoint that returns events related to a container (e.g. process events) is much faster while consuming less memory. Queries to retrieve Hubble flows are also much faster, especially queries that request a subset of data in a given time-frame (e.g. hubble observe –last 100 –since 1h –from-service ns-a/my-service).

gRPC API updates

Primary contributors: Anna Kapuścińska, Chance Zibolski, Glib Smaga, Kornilios Kourtis (Isovalent)

Hubble Timescape also implements the GetFlows gRPC endpoint that Hubble and Hubble Relay implement. Therefore, users may point their Hubble CLI at the Hubble Timescape service instead of the Hubble Relay one to query historical flow data spanning up to several weeks instead of only the very recent past. The Timescape implementation had suffered from limitations compared to the Hubble Relay API but they have now been addressed. In particular, L7 filters and CIDR range in IP filters are now supported.

Aggregation support for the GetFlows endpoint (an enterprise feature) has also been implemented. Hubble UI users are thus able to select aggregation mode in the history view.

New experimental gRPC endpoints have been added to allow retrieving flow count matching filters, get all ancestors of a given process, and get events related to a container or even construct arbitrary queries.

Hubble RBAC support

Primary contributors: Alexandre Perrin, Chance Zibolski, Robin Hahling (Isovalent)

Hubble RBAC serves as a way to restrict access to the Hubble gRPC APIs, and the metrics endpoints of Cilium and Hubble. Typically, Cilium is deployed in Kubernetes, which uses namespaces as isolation boundaries with RBAC in a multi-tenant Kubernetes environment, so Hubble RBAC also uses namespaces as an isolation boundary for RBAC purposes as well.

With the upcoming Cilium 1.12 Enterprise release, Hubble Timescape data access can be protected with the same fine grained policies. Since the same mechanism is used, all Hubble RBAC features are readily available and can be configured in the same way.

Timescape RBAC also integrates with Hubble UI Enterprise, allowing different teams to have their own view of historical data.

Improvements around deployment and supportability

Primary contributors: Alexandre Perrin, Anna Kapuścińska, Chance Zibolski, Robin Hahling (Isovalent)

Deploying and operating an application such as Hubble Timescape requires good visibility on its operations and ways to secure the deployment. Numerous improvements have been made to improve the experience of running Hubble Timescape. Log messages are now more useful and repetitive messages are rate-limited, health checks have been implemented, Prometheus metrics are now enabled by default for ClickHouse and much more.

While Hubble Timescape allows defining a TTL to prune old data (which defaults to 2 weeks), a new feature that allows to limit the number of events to keep in the database was added. When enabled, it ensures that the database is never overloaded: once the given limit is reached, older events will be pruned. This feature ensures the database is never under capacity even if the amount of data to handle suddenly spikes.

Historic View in Hubble UI via Timescape

Primary contributors: Dmitry Kharitonov, Renat Tuktarov (Isovalent)

Hubble UI can now visualize Service Map based on data stored in Hubble Timescape. This allows us to travel back to the selected date/time range and figure out what communications happened between services. Flows Table supports aggregated mode when each flow is grouped by identity and appears only once.

Advanced Network Observability

The Tetragon Enterprise release included in Isovalent Cilium Enterprise provides new network observability insights at multiple layers in the stack including interface stats, TCP, UDP, and HTTP. These new network observability events include the full enriched Tetragon metadata identifying the process, Kubernetes Pod/container/metadata, and additional network annotations allowing users to filter and do extensive analysis. Users can easily group events by namespace for network attributions down to per process information to identify potentially slow or problematic processes.

The network metadata includes DNS names, destination Pod names, and other translations. This allows for creating visualizations to target for example latency or throughput of a specific application or endpoint. To monitor for traffic bursts — when an application suddenly sends a large amount of bytes — identify DNS errors in the network by application or DNS server, to do traffic accounting, and much more.

Multiple sample dashboards are provided with the enterprise version that show these examples. For example, below we show all interface traffic including interfaces internal to Kubernetes Pods to get a complete view of Kubernetes traffic. Specifically, this allows viewing Pod local traffic to understand network patterns inside a Pod that are traditionally not visible to monitoring applications and the Kubernetes CNI.

Monitoring TCP latency is done through BPF monitoring of the TCP stack as shown below. We can then visualize this per destination DNS name to understand latency spikes in the network. The data can also be shown using the additional data to show per process, per namespace, etc.

Additional TCP metrics include drops, retransmits, bytes, packets and similar UDP events have been added. Using BPF layer 7 protocols can be parsed in BPF and exported as well. Supported in this release we have HTTP, DNS, and TLS protocols. Monitoring TLS handshakes allows Tetragon to create a view of TLS usage in the cluster, including monitoring TLS versions, ciphers, and other negotiated parameters. This is done transparently to the application and does not require a sidecar or proxy to generate the events. Example dashboards are shown below.

If users want to monitor events directly through the CLI, this is also possible and provides live reviewing of system events. Combining HTTP, DNS and network parsers allows users to view the entire flow of an HTTP connection as shown here,

For large clusters, similar dashboards can be created to show common URIs, error codes, latency of HTTP request/responses and so on.

Finally, with the Enterprise release, the above events can be integrated with a customer SIEM, OpenTelemetry, fluentd, Grafana, and other event systems.

Community

Thank You to Contributors & End-Users

We keep saying this for every Cilium release but this is the biggest Cilium release yet. As Cilium community, we have been able to welcome many new contributors and end-users. We have a new Cilium Adopters page listing many of you out there using and relying on Cilium every day. If you want to see yourself listed, simply open a pull request.

Want to Get Involved

The Cilium project is looking for help all the time. We have many awesome opportunities to collaborate and contribute by writing code by picking good-first-issue tasks, helping with documentation, giving Cilium presentations, to contributing to Cilium’s YouTube channel eCHO.

See Get Involved to get started on Cilium contributions.

Favorite eCHO Episodes

The following are our favorite recent eCHO episodes. eCHO stands for eBPF & Cilium Office Hours and is a weekly show hosted by Liz Rice and Duffie Coolie. It covers ideas, updates, and news related to eBPF and Cilium.

Mark your calendars for the upcoming eCHO Livestream on July 29th where Duffie will be joined by a couple of Isovalent engineers to discuss Cilium 1.12.

Subscribe to the eCHO YouTube channel to be notified of future episodes.

Getting Started

If you are interested to get started with Cilium, use one of the resources below:

Previous Releases

AuthorThomas GrafCTO & Co-Founder Isovalent, Co-Creator Cilium, Chair eBPF Governing Board