It’s indisputable: the AI era has well and truly started. The public launch of ChatGPT in late 2022 was the landmark that brought Generative AI and large language models to our consciousness. We are witnessing one of the most transformative technological shifts, one that will drastically alter most industries and many aspects of society.
Any AI model not only requires a massive volume of data – it expects the underlying network infrastructure to support the I/O-heavy traffic and protect the highly confidential trained model.
In this blog post, we will highlight why many of the leading artificial intelligence platforms and services rely on Cilium to provide high-scale network performance and security for their cloud native applications.
The Networking Demands of AI
According to the IBM Global AI Adoption Index 2023, 42% of IT professionals at large organizations report that they have actively deployed AI while an additional 40% are actively exploring using the technology.
Most companies running GPU workloads in high-performance computing environments and building advanced training models are using Kubernetes: it’s now the de facto standard for managing GPU-based accelerated computing infrastructure.
But given the cost of GPUs and building training models, the vast majority of AI adopters will consume AI-as-a-service by leveraging the APIs of solutions such as OpenAI ChatGPT, Microsoft Copilot, or Google Bard/Gemini.
Cilium is popular with both the artificial intelligence research institutions that build sophisticated training models and their consumers—let’s call them the “AI publishers” and the “AI consumers.”
Let’s review some of the networking and security challenges for both:
“AI publishers”…
- ….need networking platforms, whether virtual or physical, built to withstand the high I/O networking requirements and provide the low latency and high bandwidth that Large Language Models (LLMs) training requires.
- …need massively scalable networking platforms as most infrastructure underpinning AI models are enormous (3 years ago, even before ChatGPT was even released, OpenAI’s Kubernetes infrastructure already exceeded 7,500 nodes).
- …need highly reliable networks – network outages might force a complete restart of model training (unless checkpointing has been set up to resume the training from a snapshot).
- …must protect their intellectual property. AI models result from significant research and CAPEX investments (GPUs are not cheap), and AI model thefts could effectively destroy the publisher’s business model and cause irreparable reputation damage.
“AI consumers”…
- … need to securely access the APIs of AI publishers.
- … need to manage and observe API traffic to monitor network performance and security.
How do Cilium and Isovalent Enterprise for Cilium address these requirements? Let’s dive in.
High-Performance Cloud Native Networking
High-Performance Computing (HPC) predates AI, but the requirements and expectations of the underlying networks are similar. HPC has traditionally been powered by technologies such as InfiniBand: a standard that addresses the high networking throughput and low latency requirements to solve advanced computation problems.
Large Language Models running on Kubernetes clusters will also require high-performance networking from the container networking platform, which is where Cilium excels.
Cilium was built on a revolutionary Linux technology called eBPF and provides multiple performance benefits:
- Through eBPF, Cilium drastically removes some of the overhead of standard Linux networking. It is the preferred option for large Kubernetes clusters, removing some of its intrinsic limitations, such as kube-proxy.
- Cilium natively supports eXpress Data Path (XDP), providing DPDK-like performance while simultaneously efficiently using CPU resources by running eBPF programs directly inside the driver layer. The graph below, from a Cilium case study with Seznam, illustrates a x70 CPU gain when using Cilium for XDP and load balancing. Seznam is Czech’s biggest search engine and is currently building its own language model.
- Cilium is the first CNI (Container Network Interface) to support BIG TCP, a technology that enables the transfer of extremely large packets through the Linux networking stack. As you can see from the graph below, we observe a 42% increase in transactions per second when enabling BIG TCP (read more about BIG TCP in this LWN article).
Let’s delve into an example: Meltwater, a global leader in media, social and consumer intelligence. They have been building machine learning models for nearly 20 years and use AI at the heart of their operations for use cases such as natural language processing, speech processing, clustering and summarization, etc..
The platform supporting their hundreds of Kubernetes nodes, 10k+ pods and 200M+ weekly searches? Cilium (you can read more about it in Meltwater Case Study).
Security At Scale
Language models are only as good as the data they’re based on—and GenAI companies are accumulating a lot of it. Artificial intelligence and analytics companies are running sensitive information at scale: adopting a least-privilege approach and zero-trust principles isn’t optional—it’s vital.
For many organizations we work with, the AI models they build are fundamental to their business model. If their model were to be stolen and acquired by competing organizations, their competitive advantage would be lost. Model theft isn’t the only threat: data poisoning and prompt injection attacks could intentionally inject wrong information within the training dataset and cause severe vulnerabilities, potentially leading to unauthorized access, data breaches, and compromised systems.
Cloud-native networking security at scale is one of Cilium’s strengths and why some of the leading AI companies adopted Cilium for use cases such as:
- Identity-aware service to service security and observability with mutual authentication
- Advanced network policies with native HTTP and DNS protocol support (read more below)
- Efficient datapath encryption using in-kernel IPsec or WireGuard
- TLS Enforcement via Network Policy, enabling operators to restrict the allowed TLS SNIs in their network and provide a more secure environment.
- Low-overhead container runtime security and observability with Tetragon.
Encryption of data in transit within a cluster Cilium is effortless with Cilium (which is why we called it Transparent Encryption): create a Kubernetes secret with your pre-shared key and the encryption algorithm of your choice, install Cilium with the encryption option enabled and that’s it – the traffic between nodes will be encrypted:
In this recent CNCF case study, Ascend.io explained how they used Cilium Transparent Encryption to support their use of Apache Spark, the analytics engine for large-scale data processing.
Spark was the other reason that Cilium became a killer feature that we needed to roll out across every cloud. Spark is a great tool, but sometimes their built in encryption will fail at random. Statistically, at some point it will crash so if you’re dealing with a 12 hour job, it’s gonna fail on hour 11 and that is a terrible thing to try to explain to the customer. Cilium with IPsec doesn’t have that problem. Why have Spark be doing encryption when what we really want Spark to be doing is processing data. We chose to have a reasonable isolation of priorities and responsibilities and have Spark be focused on data processing and have the network layer that is responsible for encryption.
Joe Stevens, Member of the Technical Staff, Ascend.
Protecting and Observing APIs
Let’s face it—not everyone can be an OpenAI, a Microsoft, or a Google. Operating enormous clusters of GPU-enabled nodes and training an LLM from scratch is an extremely expensive and complex undertaking. Even fine-tuning an LLM or supplementing one through the unfortunately named RAG (Retrieval-Augmented Generation) is still costly in infrastructure resources and data science skills.
Instead, most organizations would simply leverage the APIs of the likes of Microsoft Azure CoPilot, Google Bard/Gemini, and OpenAI. Given that API-to-API connectivity is a common pattern for micro-services running in Kubernetes, Cilium has been designed to secure API connectivity and offers features such as:
Feature | Use Case | AI API Example |
Network Policies based on Fully Qualified Domain Name (FQDN) | Enable users to only allow connection to specific DNS names. | Only allow access to openai.azure.com if using the Azure OpenAI APIs |
Network Policies based on HTTP Path and Method | Enable users to specify the API call used by services. | Only allow HTTP traffic to /openai/deployments/YOUR_DEPLOYMENT_NAME/chat/completion if using Azure OpenAI for chat completion |
Inspecting TLS Encrypted Connections with Cilium | Enable Cilium API-aware visibility and policy to function even for TLS connections , such as when a client accesses the API service via HTTPS. | Intercept and inspect traffic to openai.azure.com if using the Azure OpenAI APIs |
Observing and monitoring API communications is as essential as securing them. Cilium’s network observability platform Hubble can, for example, tell us when Jupyter Notebooks requests are being dropped. It can also help us drill down into:
- DNS performances and issues, such as the volume of DNS queries, missing DNS responses and errors, and top DNS queries.
- HTTP performances, including OpenTelemetry traces, HTTP latency, HTTP request rates, and failed errors.
Isovalent Enterprise for Cilium provides High Availability capabilities for DNS-aware policies, minimizing downtime in the event of a failure.
Final Thoughts
With the recently approved EU AI Act, providers of high-risk AI systems (used for critical infrastructure, education and vocation training, healthcare and banking, etc…) are required to meet higher standards of training data quality and resilience to error, interruptions and cyberattacks. Failing to meet it could see companies fined with GDPR-like penalties.
With most organizations leveraging Kubernetes to build AI-based applications, they will require an underlying networking platform that provides high-performance networking at scale, built-in encryption and API security.
If you’d like to know more about why Cilium and our Enterprise Edition of Cilium can help, feel free to get in touch with our architects for a demo.
Prior to joining Isovalent, Nico worked in many different roles—operations and support, design and architecture, and technical pre-sales—at companies such as HashiCorp, VMware, and Cisco.
In his current role, Nico focuses primarily on creating content to make networking a more approachable field and regularly speaks at events like KubeCon, VMworld, and Cisco Live.