Next-Generation Mutual Authentication (mTLS) with Cilium Service Mesh

This is a follow-up of the popular blog How eBPF will solve Service Mesh – Goodbye Sidecars which describes how Cilium provides a service mesh without the use of sidecars. In this blog, we are expanding on the topic of mTLS and look into how Cilium provides sidecar-free mTLS-based authentication with excellent security and performance characteristics.

Mutual authentication has been a cornerstone of security, and we rely on it every day by using protocols and technologies such as SSH, mTLS, or IPsec. A more recent development is the desire to use strong mutual authentication to secure service to service communication in Kubernetes and cloud native infrastructure overall. In this blog, we will look into how Cilium and Cilium Service Mesh are leveraging eBPF to offer a new way of doing identity-based mutual authentication for services with a high-performance data plane that can support any network protocol without requiring changes to applications or injecting a sidecar. We’ll also look at how we can massively enhance the security model by extending the identity concept to include processes, binaries, and the execution context to for example only allow certain binaries running in an unprivileged context to authenticate each other.

What is Mutual Authentication?

Mutual authentication is the process of two parties, a sender and a receiver, authenticating each other’s identity to ensure they are both talking to the party they are intending to communicate with. This is not to be mistaken with integrity and confidentiality. Integrity ensures that the exchanged messages have not been tampered with. Confidentiality ensures that messages remain confidential. It is often assumed that “encryption” guarantees all three but they can and do come separately. In fact, we all use TLS every day to achieve confidentiality, integrity, and server authentication but typically don’t rely on mutual authentication, i.e. the TLS session ensures that we talking to the right server but we then rely on a password or a different form of authentication on top to authenticate ourselves with the web service.

Mutual authentication is typically implemented with either public and private key pairs or a single shared key. Both forms rely on performing a handshake using encrypted messages. Below is an example of how the handshake looks like for TLS 1.3:

After the identities at either end of the communication are established through the handshake, an encrypted channel gets set up to carry data between those two identities for the duration of the TLS session.

Mutual TLS or mTLS is one well-known way to achieve mutually authenticated, encrypted traffic, but it’s not the only one. IPsec uses IKE (Internet Key Exchange) as the handshake that authenticates the node endpoints at either end of a communication, and then creates an encrypted data connection between them.

Session vs Network Based Authentication

A key aspect of mutual authentication is the granularity of the identity, i.e., the granularity at which identities and certificates are handed out. For example, consider the example of an identification document such as a passport, a separate passport could be handed out for each family member, one passport for everybody living in the same household combined, or even a single passport identifying everybody living in the same town. Depending on the granularity you choose for the identification, different levels of authentication can be performed.

Mapping the granularity concept network communication, we can look at two typical modes of performing mutual authentication: session and network-based authentication.

mTLS with Sidecar	Node authentication with IPsec/WireGuard
An injected sidecar proxy performs a mutual authentication handshake using TLS1.3 for each connection between services.	Nodes authenticate each other and encapsulate all network traffic to build an authenticated and encrypted virtual network.
Pros Service-level identity/certificate results in a better security postureTLS features a modern, low-latency handshake that specifically designed to cross complex networks as it was designed for the internet.	Pros Completely transparent to applications, no pod modifications. Able to support any network protocol.Significantly more efficient (3x better latency, 2x better requests/s) and scalable due to not requiring sidecar proxies.
Cons Requires sidecar injection causing performance penalty and added complexity.Challenging or impossible to support protocols other than TCP and QUIC.Leaks the application topology and unsupported protocols to the underlying network	Cons Compromised node identity/certificate leads to all workloads on a node being compromised.Limited compatibility with TLS certificate based identity management systems such as SPIFFE

Looking at the above table, it’s clear that both session and network-based authentication offer a set of advantages that we ideally want to combine with each other. Why not combine the two? We want the better authentication granularity of service identities, the properties of the TLS handshake and combine that with the transparency, performance, and wide support of different network protocols of a network-based authentication approach.

Separating Authentication Handshake and Payload

If we split the authentication handshake from the payload transport, we can use TLS 1.3 as the handshake protocol while relying on IPsec or WireGuard as a better-performing, more transparent payload channel:

We gain the benefits of both models and achieve many great properties:

Connections don’t need to be terminated anymore: Whereas a sidecar-based approach requires to convert every TCP connection into a 3 way segment to inject TLS. The sidecar-free approach does not require terminating or manipulating connections.
No sidecars need to be injected: No additional proxies need to be run. The authentication on behalf of the services can be performed by a single node agent. In the case of Cilium, this agent already exists and is aware of all required context. This simplifies management, improves the resource footprint, and improves scalability.
Support Non-TCP & Multicast: While benefit of the great properties of TLS 1.3 such as the low-latency handshake, TLS does not limit transport abilities. UDP, ICMP and any other protocol that can be carried by IP is supported.
Support existing Identity & Certificate Management: Any mTLS-based authentication control plane or identity management system can be plugged in and used to provide certificates for services. This includes SPIFFE, Vault, SMI, Istio, …
Handshake caching & Re-authentication: The handshake can be done once, cached, and communication between authenticated services can happen without introducing additional latency for already authenticated service to service pairs. Even better, authentication can be done on a regular basis to re-authenticate services with each other on a regular basis.
Optional Integrity and confidentiality: The most expensive operation which is to provide integrity and confidentiality is optional. You may want to benefit from authentication but have no interest in paying for the compute required to encrypt all payload following the authentication.

The above diagrams show both models side by side. On the left is the traditional sidecar-based mTLS approach, relying on sidecars to inject TLS into every connection. The right side shows the sidecar-free approach with the payload connection remaining intact while the TLS authentication is performed separately driven by Cilium while controlling the payload connection with the help of eBPF.

Mutual Authentication with Cilium and Cilium Service Mesh

Cilium’s built-in identity concept to identify services and implement network policies is the perfect foundation to integrate advanced identity and certificate management such as SPIFFE, Vault, SMI, cert-manager, or Istio. This allows these existing identity and certificate management layers to be used to manage service identities and generate certificates. The certificates are then used to perform the mutual authentication between Cilium identities representing pods or external workloads (VMs, metal machines, …).

Let’s look at how the above could look like from a configuration perspective. We will use an example of the upcoming SPIFFE integration with Cilium. This allows using SPIFFE identities for the selection of workloads when creating Network Policies.

Example 1: Allowing app2 => app1 communication with mutual authentication

apiVersion: 'cilium.io/v2'
kind: CiliumNetworkPolicy
metadata:
  name: 'auth-rule-spiffe-app1'
spec:
  endpointSelector:
    - matchLabels:
        - spiffe://mycluster/app1: ''
  ingress:
    - fromEndpoints:
        - matchLabels:
            - spiffe://mycluster/app2: ''

As the above example shows, Network Policies specify the permitted set of endpoints by SPIFFE identity. This makes it very straightforward to take existing policies based on endpoint selectors, and harden them to use certificate-based identities.

An additional Layer of Security

It’s worth noting that mutual authentication at the service level doesn’t simply replace network policy at the network layer. It adds an extra layer of security. Cilium still identifies individual endpoints as it does today, and network segmentation still gets applied for these individual endpoints. If a network policy specifies both the SPIFFE identity and endpoint selectors, it’s not sufficient for a malicious workload to impersonate that service using a compromised service-level certificate.

For all traffic leaving a pod or service, the destination must be allowed by the egress policy of the pod. Let’s say a particular pod manages to steal the certificate representing the identity of another pod, that malicious pod cannot simply authenticate itself with another even if the certificate would allow this. The egress policy will block such an attempt.
Assuming the egress policy layer passes, the destination get authenticated. Besides validating the certificate of the destination which is expected from mTLS, this steps also performs additional validation which can be done because Cilium is in a unique position to authenticate on behalf of the service: Does the certificate of the destination belong to a workload that is intended to run on the destination node? This prevents identity theft and requires an attacker to not only steal the service certificate and network identity but also requires the attacker to run impersonating workloads on a node that is supposed to run the service.
Same as in step 2 but for the receiver authenticates the sender. Again, validating that the certificate used by the sender is coming from a node that is supposed to run this workload.
Finally, the ingress policy must allow the traffic. If the certificate representing a service has been compromised, the attacker must also be able to impersonate an allowed network identity.

Performance

How will all of this additional security impact performance? Below are measurements comparing Cilium running on GKE with nighthawk for HTTP benchmarking in different models:

Without additional mutual authentication (baseline)
With WireGuard enabled for integrity and confidentiality
Sidecar mTLS model provided by Istio

Update: The below benchmarks have been updated to also include numbers for Istio with protocol sniffing disabled to force Istio into a TCP-only mode. The same effect can be achieved by setting name: tcp in the Servive.

The above graph show P95 latency measurements for a baseline without any HTTP processing, Cilium with an HTTP filter configuredd and Istio in default configuration (protocol sniffing) which will automatically perform HTTP parsing when detected.

These numbers are more or less in line with the latency benchmarks in the lstio documentation.

This second measurement limits involvement to TCP and disables all HTTP processing. The HTTP filter in Cilium is removed. For Istio this is slightly trickier as the protocol sniffing that is enabled by default can lead to HTTP processing even when no intent to parse HTTP is declared. Therefore protocol sniffing is disabled for this test run.

For all measurements, Istio has been tuned by removing the default concurrency limits as well as by removing the default TCP filter. The scripts to reproduce the performance benchmarks can be found in this git repository.

Conclusion

As Cilium Service Mesh is marked stable with version 1.12, this mutual authentication architecture is the next point of attention of Cilium’s service mesh focus. We are convinced that we can not only bring great integration with existing identity management solutions like SPIFFE, cert-manager or even Istio as a control plane but provide a more elegant, higher-performing and more secure implementation of the authentication combined with great datapath properties. The tight integration with NetworkPolicy provides a simple to use but highly secure communication pattern that protects from network impersonation as well as service identity theft. Given we have all the foundation in place already, we expect this mutual authentication functionality to become available in 1.13.

Next-Generation Mutual Authentication (mTLS) with Cilium Service Mesh

Table of contents

What is Mutual Authentication?

Session vs Network Based Authentication

Separating Authentication Handshake and Payload

Mutual Authentication with Cilium and Cilium Service Mesh

Example 1: Allowing app2 => app1 communication with mutual authentication

An additional Layer of Security

Performance

Conclusion

Further Reading

Table of contents

Related

Cilium 1.14 – Effortless Mutual Authentication, Service Mesh, Networking Beyond Kubernetes, High-Scale Multi-Cluster, and Much More

How eBPF will solve Service Mesh – Goodbye Sidecars

Mutual Authentication with Cilium

Industry insights you won’t delete. Delivered to your inbox weekly.