Back to blog

Networks Are Under AI Pressure: Can Cilium Provide Relief?

Nico Vibert
Nico Vibert
Published: Updated: Cilium
Networks Are Under AI Pressure: Can Cilium Provide Relief?

It’s indisputable: the AI era has well and truly started. The public launch of ChatGPT in late 2022 was the landmark that brought Generative AI and large language models to our consciousness. We are witnessing one of the most transformative technological shifts, one that will drastically alter most industries and many aspects of society.

Any AI model not only requires a massive volume of data – it expects the underlying network infrastructure to support the I/O-heavy traffic and protect the highly confidential trained model.

In this blog post, we will highlight why many of the leading artificial intelligence platforms and services rely on Cilium to provide high-scale network performance and security for their cloud native applications.

The Networking Demands of AI

According to the IBM Global AI Adoption Index 2023, 42% of IT professionals at large organizations report that they have actively deployed AI while an additional 40% are actively exploring using the technology.

Most companies running GPU workloads in high-performance computing environments and building advanced training models are using Kubernetes: it’s now the de facto standard for managing GPU-based accelerated computing infrastructure.

But given the cost of GPUs and building training models, the vast majority of AI adopters will consume AI-as-a-service by leveraging the APIs of solutions such as OpenAI ChatGPT, Microsoft Copilot, or Google Bard/Gemini.

Cilium is popular with both the artificial intelligence research institutions that build sophisticated training models and their consumers—let’s call them the “AI publishers” and the “AI consumers.”

Let’s review some of the networking and security challenges for both:

“AI publishers”…

  • ….need networking platforms, whether virtual or physical, built to withstand the high I/O networking requirements and provide the low latency and high bandwidth that Large Language Models (LLMs) training requires.
  • …need massively scalable networking platforms as most infrastructure underpinning AI models are enormous (3 years ago, even before ChatGPT was even released, OpenAI’s Kubernetes infrastructure already exceeded 7,500 nodes).
  • …need highly reliable networks – network outages might force a complete restart of model training (unless checkpointing has been set up to resume the training from a snapshot).
  • …must protect their intellectual property. AI models result from significant research and CAPEX investments (GPUs are not cheap), and AI model thefts could effectively destroy the publisher’s business model and cause irreparable reputation damage.

“AI consumers”…

  • … need to securely access the APIs of AI publishers.
  • … need to manage and observe API traffic to monitor network performance and security.

How do Cilium and Isovalent Enterprise for Cilium address these requirements? Let’s dive in.

High-Performance Cloud Native Networking

High-Performance Computing (HPC) predates AI, but the requirements and expectations of the underlying networks are similar. HPC has traditionally been powered by technologies such as InfiniBand: a standard that addresses the high networking throughput and low latency requirements to solve advanced computation problems.

Large Language Models running on Kubernetes clusters will also require high-performance networking from the container networking platform, which is where Cilium excels.

Cilium was built on a revolutionary Linux technology called eBPF and provides multiple performance benefits:

  • Through eBPF, Cilium drastically removes some of the overhead of standard Linux networking. It is the preferred option for large Kubernetes clusters, removing some of its intrinsic limitations, such as kube-proxy.
  • Cilium natively supports eXpress Data Path (XDP), providing DPDK-like performance while simultaneously efficiently using CPU resources by running eBPF programs directly inside the driver layer. The graph below, from a Cilium case study with Seznam, illustrates a x70 CPU gain when using Cilium for XDP and load balancing. Seznam is Czech’s biggest search engine and is currently building its own language model.
  • Cilium is the first CNI (Container Network Interface) to support BIG TCP, a technology that enables the transfer of extremely large packets through the Linux networking stack. As you can see from the graph below, we observe a 42% increase in transactions per second when enabling BIG TCP (read more about BIG TCP in this LWN article).

Let’s delve into an example: Meltwater, a global leader in media, social and consumer intelligence. They have been building machine learning models for nearly 20 years and use AI at the heart of their operations for use cases such as natural language processing, speech processing, clustering and summarization, etc..

The platform supporting their hundreds of Kubernetes nodes, 10k+ pods and 200M+ weekly searches? Cilium (you can read more about it in Meltwater Case Study).

Security At Scale

Language models are only as good as the data they’re based on—and GenAI companies are accumulating a lot of it. Artificial intelligence and analytics companies are running sensitive information at scale: adopting a least-privilege approach and zero-trust principles isn’t optional—it’s vital.

For many organizations we work with, the AI models they build are fundamental to their business model. If their model were to be stolen and acquired by competing organizations, their competitive advantage would be lost. Model theft isn’t the only threat: data poisoning and prompt injection attacks could intentionally inject wrong information within the training dataset and cause severe vulnerabilities, potentially leading to unauthorized access, data breaches, and compromised systems.

Cloud-native networking security at scale is one of Cilium’s strengths and why some of the leading AI companies adopted Cilium for use cases such as:

  • Identity-aware service to service security and observability with mutual authentication
  • Advanced network policies with native HTTP and DNS protocol support (read more below)
  • Efficient datapath encryption using in-kernel IPsec or WireGuard
  • TLS Enforcement via Network Policy, enabling operators to restrict the allowed TLS SNIs in their network and provide a more secure environment.
  • Low-overhead container runtime security and observability with Tetragon.

Encryption of data in transit within a cluster Cilium is effortless with Cilium (which is why we called it Transparent Encryption): create a Kubernetes secret with your pre-shared key and the encryption algorithm of your choice, install Cilium with the encryption option enabled and that’s it – the traffic between nodes will be encrypted:

$ PSK=($(dd if=/dev/urandom count=20 bs=1 2> /dev/null | xxd -p -c 64)) 
$ echo $PSK
$ kubectl create -n kube-system secret generic cilium-ipsec-keys \
    --from-literal=keys="3 rfc4106(gcm(aes)) $PSK 128"
secret/cilium-ipsec-keys created
$ cilium install \
  --set encryption.enabled=true \
  --set encryption.type=ipsec
🔮 Auto-detected Kubernetes kind: kind
✨ Running "kind" validation checks
✅ Detected kind version "0.20.0"
ℹ️  Using Cilium version 1.14.5
🔮 Auto-detected cluster name: kind-kind
🔮 Auto-detected kube-proxy has been installed
$ cilium status
 /¯¯\__/¯¯\    Cilium:             OK
 \__/¯¯\__/    Operator:           OK
 /¯¯\__/¯¯\    Envoy DaemonSet:    disabled (using embedded mode)
 \__/¯¯\__/    Hubble Relay:       disabled
    \__/       ClusterMesh:        disabled

Deployment             cilium-operator    Desired: 1, Ready: 1/1, Available: 1/1
DaemonSet              cilium             Desired: 4, Ready: 4/4, Available: 4/4
Containers:            cilium             Running: 4
                       cilium-operator    Running: 1
Cluster Pods:          3/3 managed by Cilium
Helm chart version:    1.14.5
Image versions         cilium    4
                       cilium-operator 1
$ cilium config view | grep enable-ipsec
enable-ipsec                                      true

In this recent CNCF case study, explained how they used Cilium Transparent Encryption to support their use of Apache Spark, the analytics engine for large-scale data processing.

Spark was the other reason that Cilium became a killer feature that we needed to roll out across every cloud. Spark is a great tool, but sometimes their built in encryption will fail at random. Statistically, at some point it will crash so if you’re dealing with a 12 hour job, it’s gonna fail on hour 11 and that is a terrible thing to try to explain to the customer. Cilium with IPsec doesn’t have that problem. Why have Spark be doing encryption when what we really want Spark to be doing is processing data. We chose to have a reasonable isolation of priorities and responsibilities and have Spark be focused on data processing and have the network layer that is responsible for encryption.

Joe Stevens, Member of the Technical Staff, Ascend.

Protecting and Observing APIs

Let’s face it—not everyone can be an OpenAI, a Microsoft, or a Google. Operating enormous clusters of GPU-enabled nodes and training an LLM from scratch is an extremely expensive and complex undertaking. Even fine-tuning an LLM or supplementing one through the unfortunately named RAG (Retrieval-Augmented Generation) is still costly in infrastructure resources and data science skills.

Instead, most organizations would simply leverage the APIs of the likes of Microsoft Azure CoPilot, Google Bard/Gemini, and OpenAI. Given that API-to-API connectivity is a common pattern for micro-services running in Kubernetes, Cilium has been designed to secure API connectivity and offers features such as:

FeatureUse CaseAI API Example
Network Policies based on Fully Qualified Domain Name (FQDN) Enable users to only allow connection to specific DNS names. Only allow access to if using the Azure OpenAI APIs
Network Policies based on HTTP Path and MethodEnable users to specify the API call used by services.Only allow HTTP traffic to /openai/deployments/YOUR_DEPLOYMENT_NAME/chat/completion if using Azure OpenAI for chat completion
Inspecting TLS Encrypted Connections with CiliumEnable Cilium API-aware visibility and policy to function even for TLS connections , such as when a client accesses the API service via HTTPS.Intercept and inspect traffic to if using the Azure OpenAI APIs

Observing and monitoring API communications is as essential as securing them. Cilium’s network observability platform Hubble can, for example, tell us when Jupyter Notebooks requests are being dropped. It can also help us drill down into:

  • DNS performances and issues, such as the volume of DNS queries, missing DNS responses and errors, and top DNS queries.
  • HTTP performances, including OpenTelemetry traces, HTTP latency, HTTP request rates, and failed errors.

Isovalent Enterprise for Cilium provides High Availability capabilities for DNS-aware policies, minimizing downtime in the event of a failure.

Final Thoughts

With the recently approved EU AI Act, providers of high-risk AI systems (used for critical infrastructure, education and vocation training, healthcare and banking, etc…) are required to meet higher standards of training data quality and resilience to error, interruptions and cyberattacks. Failing to meet it could see companies fined with GDPR-like penalties.

With most organizations leveraging Kubernetes to build AI-based applications, they will require an underlying networking platform that provides high-performance networking at scale, built-in encryption and API security.

If you’d like to know more about why Cilium and our Enterprise Edition of Cilium can help, feel free to get in touch with our architects for a demo.

Nico Vibert
AuthorNico VibertSenior Staff Technical Marketing Engineer

Industry insights you won’t delete. Delivered to your inbox weekly.