Cilium Service Mesh – Everything You Need to Know

Jul 20, 2022Isovalent

Today we are announcing the availability of the first release of Cilium Service Mesh. It introduces an option to run the service mesh completely without sidecars while supporting various control plane options. It complements the existing Istio integration based on sidecars that has been available as part of Cilium so far.

With this, we aim to reduce complexity and overhead in the service mesh layer by introducing choice for our users. Users can decide based on their needs whether to run a service mesh with or without sidecars based on what best meets the requirements of their platform.

Enterprise Grade Service Mesh

As more enterprises adopt Kubernetes, we see that the need for an enterprise-grade service mesh has become increasingly important. This brings many new enterprise networking requirements into the service mesh concept that originally spun out of web-scale application teams. This new set of requirements has brought the Kubernetes networking/CNI layer and the service mesh layer closer together and created demand for a new layer delivering a combination of the two while providing the following:

With a larger adoption of kubernetes, enterprise requirements bring Kubernetes networking/CNI layer a lot closer to the Service Mesh layer, thereby evolving into Cilium Service Mesh.
  • Well integrated into public cloud and on-prem: Similar to Kubernetes, service mesh was primarily focused on deployments with infrastructure backed in the public cloud. With enterprises starting to consider adoption, the need for equivalent functionality on-prem and the ability to connect cloud and on-prem together is rising quickly. Equally important is the multi-cloud and multi-cluster aspect to provide connectivity, security, and observability across clouds and premises decoupled from the underlying infrastructure provider.
  • Operate at the network level: Control of the network layer is not only key to integrating with existing enterprise networking components on-prem, but it is also essential to fulfill compliance requirements in the cloud regarding segmentation, encryption, and visibility. This includes providing functionality such as network policies, egress gateways, transparent encryption, BGP, SRv6, and integration with traditional firewalls.
  • Operate at the application protocol level: Understanding of application protocols such as HTTP and gRPC is required to meet the requirements of modern application development principles by implementing functionality such as traffic management, canary rollouts, tracing, and L7 authorization. This is achieved by implementing standards like Ingress and Gateway API.

As Isovalent, we have created the highly successful CNCF project Cilium which has become the de-facto standard for cloud native networking and security. Cilium is powering infrastructure at major enterprises such as Adobe, Bell Canada, Capital One, and IKEA, a majority of managed Kubernetes platforms including products from Google Cloud and AWS, and is the default CNI in numerous Kubernetes distributions. With the introduction of Cilium Service Mesh, we are extending the capabilities on the application protocol level.

What is Service Mesh?

With the introduction of distributed applications, additional visibility, connectivity, and security requirements have surfaced. Application components communicate over untrusted networks across cloud and premises boundaries, load-balancing is required to understand application protocols, resiliency is becoming crucial, and security must evolve to a model where sender and receiver can authenticate each other’s identity. In the early days of distributed applications, these requirements were resolved by directly embedding the required logic into the applications. A service mesh extracts these features out of the application and offers them as part of the infrastructure for all applications to use and thus no longer requires to change each application.

Illustration of the basic Service Mesh as a function.

Looking at the feature set of a service mesh today, it can be summarized as follows:

  • Resilient Connectivity: Service to service communication must be possible across boundaries such as clouds, clusters, and premises. Communication must be resilient and fault tolerant.
  • L7 Traffic Management: Load balancing, rate limiting, and resiliency must be L7-aware (HTTP, REST, gRPC, WebSocket, …).
  • Identity-based Security: Relying on network identifiers to achieve security is no longer sufficient, both the sending and receiving services must be able to authenticate each other based on identities instead of a network identifier.
  • Observability & Tracing: Observability in the form of tracing and metrics is critical to understanding, monitoring, and troubleshooting application stability, performance, and availability.
  • Transparency: The functionality must be available to applications in a transparent manner, i.e. without requiring to change application code.

Why Cilium Service Mesh?

Since its early days, Cilium has been well aligned with the service mesh concept by operating at both the networking and the application protocol layer to provide connectivity, load-balancing, security, and observability. For all network processing including protocols such as IP, TCP, and UDP, Cilium uses eBPF as the highly efficient in-kernel datapath. Protocols at the application layer such as HTTP, Kafka, gRPC, and DNS are parsed using a proxy such as Envoy. Lastly, for service mesh use cases that go beyond the capabilities of Cilium, Cilium is offering an Istio integration. It brings all Istio features to Cilium while allowing Cilium to enforce L7 policies via the Istio-managed sidecar. In parallel, Cilium automatically optimizes some aspects of Istio such as shortening the sidecar network path injection and avoiding unencrypted data exposure between application and the sidecar.

Eliminating the need for pods or proxies per node by moving Service Mesh layer into the kernel using eBPF and Cilium Service Mesh.

When the Cilium community started to discuss and debate the topic of providing a Cilium native service mesh, we conducted various end-user surveys and listened to our customers. The feedback we have received was consistent and clear:

  • Kubernetes-native: Our teams already know how to use Kubernetes. We want to consume service mesh features without learning many new concepts and to provide a Kubernetes-native user experience similar to how Cilium Cluster Mesh uses Kubernetes Services and NetworkPolicy to perform multi-cluster connectivity and security.
  • Reduce Complexity & Overhead: The complexity and overhead impact of sidecars can be severe. Provide us with a simple datapath model that avoids overhead while supporting arbitrary networking protocols. Kelsey Hightower referred to this as “service mess”.
Screenshot of a tweet by Kelsey Hightower, defining Service Mess (noun) as the result of spending more compute resources than your actual business logic dynamically generating and distributing Envoy proxy configs and TLS certificates.

Choice of Sidecar or Sidecar-free

With this first release of Cilium Service Mesh, users now have the choice to run a service mesh with sidecars or without them. When to best use which model depends on various factors including overhead, resource management, failure domain, and security considerations. In fact, the trade-offs are quite similar to virtual machines and containers. VMs provide stricter isolation. Containers are lighter, able to share resources and offer fair distribution of the available resources. Because of this, containers typically increase deployment density, with the trade-off of additional security and resource management challenges. With Cilium Service Mesh, you have both options available in your platform and can even run a mix of the two.

Illustration of Cilium Service Mesh offering of flexibility to run a service mesh layer with or without sidecars.

Performance Impact of a Sidecar

Besides avoiding the sheer amount of proxies that need to be run in a sidecar model, a significant advantage of a sidecar-free model is that we can avoid the requirement to run two proxies in between any connection. This is made possible by using Cilium at the network/node level to encrypt and authenticate or use the upcoming new mTLS model which separates the authentication from the transport (More details in this blog: Next-Generation Mutual Authentication with Cilium Service Mesh)

Chart illustrating the performance impact (request to response latency) for three use cases - Cilium with no HTTP filter, Cilium + Envoy with HTTP filter, and Cilium + Istio with HTTP filter (in that order of increased latency)

Reducing the number of proxies in the network path and choosing the type of Envoy filter has a significant impact on performance.  The above benchmark illustrates the latency cost of HTTP processing with a single Envoy proxy running the Cilium Envoy filter (brown) compared to a two-sidecar Envoy model running the Istio Envoy filter (blue). Yellow is the baseline latency with no proxy with no HTTP processing performed.

eBPF-Native When Possible

Besides the option to remove sidecars, Cilium Service Mesh can perform a variety of service mesh features directly in eBPF to reduce the overhead even further. When possible, the processing is performed in eBPF at a fraction of the cost. If eBPF is not capable of processing the request, for example when connections need to be spliced, requests need to be rate-limited, or TLS termination is required, the handling falls back to Envoy running in either a sidecar or sidecar-free model. This gives the best of both worlds – eBPF processing when possible for increased performance and reduced latency, with the ability always to fall back to Envoy as needed.

Illustration of Cilium Service Mesh offering a sidecar-free implementation with a "fallback" to Envoy for those use cases that need a sidecar model.

One particularly powerful use case is HTTP/2 visibility which powers tracing, and metrics use cases to, for example, build golden signal dashboards with Prometheus and Grafana. The cost reduction in terms of latency and thus compute power are very significant:

Chart illustrating HTTP benchmark measuring P95 latency.  Baseline (no visibility) and eBPF-based HTTP visibility are pretty close but Proxy-based HTTP Visibility has a significant latency.

You see above an HTTP request/response benchmark measuring the P95 latency. It compares the impact on latency when running an eBPF- based HTTP/2 parser (brown), a sidecar approach (blue), compared to the baseline (yellow) which has no visibility enabled. The eBPF-based HTTP/2 parser is available in Isovalent Cilium Enterprise. The choice of sidecar proxy does not matter much (Envoy was used in this example) but the results were almost identical for other proxies that we tested, because the main cost stems from the injection of the proxy and the requirement to terminate connections and traverse the data between up and downstream.

What can be done in eBPF? When is a Proxy needed?

The table below lists the most common service mesh features and whether they need to be routed through a proxy running in either sidecar or sidecar-free mode:

Table listing out most common Service Mesh features and clearly separating when eBPF-native (sidecar-free) approach works and the use cases where a proxy-based (with sidecars) approach is relevant.

Bring your own Control Plane

To address the second big requirement of users to reduce the complexity and learning curve when adopting service mesh, Kubernetes has been exceptionally good at providing different abstractions at different levels of complexity, and Cilium Service Mesh allows users to do the same. We are extending the number of supported service mesh control planes in addition to the existing Istio integration to bring the new sidecar-free datapath option to existing service mesh standards.

Illustration of the new and existing service mesh control planes supported by Cilium Service Mesh.

Supported Integrations

Istio (Existing)

Istio is the existing service mesh control plane that is supported. It currently requires to be run with the sidecar-based datapath. We are considering bringing the sidecar-free datapath to Istio if there is interest. Talk to us if you think this is interesting.

Prometheus / OpenTelemetry (Existing)

L7 observability has always been a feature of Cilium. The visibility is exported in the forms of metrics and tracing data using standard Prometheus and OpenTelemetry.

Kubernetes Ingress (New)

This first release of Cilium Service Mesh includes a fully compliant Kubernetes Ingress Controller, enabling application teams to use L7 load-balancing and traffic management capabilities via the standardized Kubernetes Ingress resource. Kubernetes Ingress load-balancing can be applied for traffic into the cluster, within, and across clusters. (See Getting Started with Kubernetes Ingress)

Envoy Configuration CRD (New)

A new exciting Envoy Configuration CRD is available, making the entire Envoy proxy feature set available anywhere in the network. This enables users to write Envoy Configuration directly, and apply this anywhere in the network to enable advanced use cases that are not even covered by service meshes such as Istio.

Gateway API (Work In Progress)

We are hard at work to support the Kubernetes Gateway API standard as the next supported control plane. It brings additional capabilities on top of Kubernetes Ingress and is likely a feasible option for many application and platform teams as it strikes a good balance between capability and complexity.

SPIFFE (Coming Next)

Finally, the SPIFFE integration is already on the way to bringing service-specific certificates and thus proxy-based mTLS support.

mTLS for Any Network Protocol

By splitting the authentication handshake from the payload transport, we can use TLS 1.3 as the handshake protocol while relying on IPsec or WireGuard as a better-performing, more transparent payload channel:

Illustration of Identity and Certification Management with Cilium and eBPF for mTLS termination.

We gain the benefits of both models and achieve many great properties:

  • Connections don’t need to be terminated anymore: Whereas a sidecar-based approach requires to convert every TCP connection into a 3 way segment to inject TLS. The sidecar-free approach does not require terminating or manipulating connections.
  • No sidecars need to be injected: No additional proxies need to be run; the authentication on behalf of the services can be performed by a single node agent. In the case of Cilium, this agent already exists and is aware of all required context. This simplifies management, improves the resource footprint, and improves scalability.
  • Support Non-TCP & Multicast: While benefitting from the great properties of TLS 1.3 such as the low-latency handshake, TLS does not limit transport abilities. UDP, ICMP, and any other protocol that can be carried by IP is supported.
  • Support existing Identity & Certificate Management: Any mTLS-based authentication control plane or identity management system can be plugged in and used to provide certificates for services. This includes SPIFFE, Vault, SMI, Istio, etc.
  • Handshake caching & Re-authentication: The handshake can be done once, cached, and communication between authenticated services can happen without introducing additional latency for already authenticated service to service pairs. Even better, authentication can be done on a regular basis to re-authenticate services with each other on a regular basis.

Conclusion

We are excited about this initial release of Cilium Service Mesh on top of the existing networking, security, and observability function of Cilium. It gives users choice:

  • Control Plane: Choice of control plane options for the ideal balance of complexity and richness. From simpler options such as Ingress and Gateway API, to richer options with Istio, to the full power of Envoy via the Envoy CRD.
  • Sidecar vs Sidecar-free: Choice of a datapath with or without sidecars. Sidecars with VM-style resource isolation at increased overhead and cost, or container-style shared resources at the cost of requiring to manage the shared resource usage.

We are grateful to the community and customers who have been guiding us and are looking forward to continuing the collaboration.

Author

Thomas Graf

CTO & Co-Founder Isovalent, Co-Creator Cilium, Chair eBPF Governing Board

Thomas Graf is a Co-Founder of Cilium and the CTO & Co-Founder of Isovalent, the company behind Cilium. Before that, Thomas spent 15 years as a kernel developer working on the Linux kernel in networking, security and eventually eBPF.