Tetragon 1.0: Kubernetes Security Observability & Runtime Enforcement with eBPF
Cilium Tetragon 1.0 has arrived! Tetragon has taken the security world by storm since it was first announced 18 months ago at KubeCon EU 2022. Since then, Tetragon’s path to this noteworthy 1.0 milestone has been marked by an exciting level of adoption and maturity.
We have seen more than 2K commits from over 150 contributors, 2.7k stars on Github, and a growing Slack community sharing tracing policies and explorations with Tetragon. Thank you to the community for all the support with code contributions, valuable feedback, conference talks, and sharing your excitement about the project.
One of our major focuses for 1.0 was to continue improving performance and overhead. We wanted Tetragon to be invisible not only from a transparency perspective but also by imposing a barely noticeable performance overhead. We will dive into the specific overhead numbers for particular use cases as we go through the blog, but it is exciting to understand that Tetragon can provide very meaningful security observability intelligence, as seen in the dashboard below, while imposing minimal performance overhead on the system.
Along the way, technology leaders have become a community of early adopters, with Bell, GitHub, GResearch, Nationwide, Palantir, Ripple, and others implementing Tetragon for a variety of runtime security and observability use cases. A defining milestone, the 1.0 release marks Tetragon as production ready for platform and security teams to gather rich Kubernetes-aware security and observability data efficiently at very low overhead.
What is Tetragon?
Tetragon is a flexible and simple-to-use Kubernetes-native security observability and runtime enforcement tool that applies policy and filtering directly with eBPF. eBPF enables deep observability with minimal performance overhead while tracking process execution, privilege escalations, and file and network activity, among many other things.
Tetragon is simple to use by providing a policy library covering a wide range of easy to use out of the box security observability policies to extract security relevant insights into data such as privileged process and container execution, kubectl exec audit trails, namespace and privilege escalations, system call activity, network observability with DNS/TLS/HTTP protocol support, file integrity monitoring, and much more.
Runtime policies are enforced in-kernel using eBPF rather than out-of-band to implement least privilege runtime security postures to prevent unauthorized actions in the system without being vulnerable to TOCTOU (time-of-check to time-of-use) attacks.
Tetragon operates transparently, without requiring any changes to existing code and without applications knowing they are being monitored. Exporting is easy too, seamlessly integrate Tetragon’s JSON logs with other alerting tools and visualization systems like Prometheus, Grafana, Splunk, Fluentd, Elasticsearch, and OpenTelemetry for enhanced real-time insights, comprehensive monitoring, and proactive security measures.
Smart in-kernel collection logic allows filtering and aggregating data as soon as they are generated in the kernel with the eBPF-based collector. This approach minimizes overhead by reducing the need for unnecessary data transfer between kernel and user space or relying on static kernel probes.
Notably, the Cilium Tetragon open source project has been third-party audited by Cure53, who performed a combination of a penetration test and a code audit where testers examined the code line-by-line. The Cure53 team found no Critical, High, or Medium severity issues (CVEs) during their test, instead reporting only five Low severity findings, noting in their report, “the fact that the components in scope evaded any Critical or even High rated limitations was noted with distinction.” The full security report is planned to be published once the Low severity issues have been resolved.
Tetragon’s Core Principles?
The journey to 1.0 has brought many amazing conversations and code commits. Along the way, guiding principles have emerged of what Tetragon has come to represent and how it will continue to evolve.
- Kubernetes Native: Built for Kubernetes and Linux workloads, Tetragon is aware of Kubernetes and runs natively in Kubernetes as a DaemonSet. All security observability events are automatically enriched with Kubernetes metadata such as pod names, labels, namespace information, and container SHAs. Observability and enforcement policies can be applied in a fine-grained manner to only apply to certain Kubernetes workloads. All aspects of Tetragon run natively in a distributed Kubernetes environment.
- Minimal Overhead: A result of using eBPF as the core mechanism for observing and filtering events in the kernel, Tetragon is highly efficient and remains invisible in terms of footprint. By limiting the data transferred to userland to only relevant events, Tetragon reduces overhead, removes noise, and eliminates race conditions and unnecessary delay when taking enforcement action.
- Simplified Observability: Installing Tetragon for observability is as simple as selecting policies from the policy library to immediately get a data rich view of what’s happening inside your Linux machines and Kubernetes clusters.
Getting Started with Tetragon Lab
In this lab, learn about Tetragon and explore detecting a container escape!Start the Tetragon Lab
Tetragon has been designed specifically for security teams looking for a Kubernetes-native runtime security platform with minimal performance overhead that can provide comprehensive security observability data and rich enforcement capabilities. The revolutionary eBPF technology is what made Tetragon possible. eBPF is the foundation that allows Tetragon to significantly reduce overhead compared to prior solutions while also allowing it to mature and evolve together with Kubernetes and surrounding technologies.
Besides performance overhead, the Tetragon team wanted to keep the security capabilities of Tetragon approachable for everybody. While Tetragon makes full use of the powerful eBPF engine, it exposes a simple policy engine that can be fed with out of the box observability and enforcement policy rules to achieve common use cases while allowing for community collaboration to evolve the rulesets.
Observability Benchmarks: Understanding Tetragon Performance
The early design decisions to use eBPF for in-kernel monitoring and filtering have come to fruition. A core tenet of Tetragon is to optimize for collection and filtering in-kernel, and only transfer events of interest into user space. This significantly reduces the operational overhead seen in other tools, where a large amount of CPU cycles are spent in moving events from kernel space to user space before filtering.
These early designs around smart collection have resulted in benchmarking that shows near baseline overhead across key use cases such as tracing every executable in the system and monitoring for suspicious activity.
Process Execution Tracking at <2% Overhead
A standard benchmark for Tetragon is to trace all process executions in an environment, which is a core function Tetragon performs. This is used to look for suspicious process execution, privilege tracking and to create forensics audit trails.
In this first benchmark, we evaluate a worst-case scenario with a stress test by building the 6.1.13 Linux kernel which is an incredibly process execution heavy operation leading to approximately ~1.5M security relevant events at a rate of 2.6K events/s. A typical production workload will not expose this volume of exec and exit events, meaning we are benchmarking a worst-case scenario so this should provide a good upper ceiling of maximum overhead expected.
The above graph shows the resulting overhead while compiling the kernel (baseline), with Tetragon process execution tracking (yellow) adding a minimal 1.68% overhead and only 2.46% overhead when also writing all process execution events as JSON to disk (green).
Scalable File Monitoring at Minimal Cost
The following benchmark looks at the performance impact of file monitoring. File monitoring is useful to track unauthorized access to files as well as modification of file content.
Monitoring file I/O is traditionally challenging because there is typically an incredibly high number of I/O operations happening on a system, in particular when databases or other data heavy workloads are being run. This benchmark test runs a workload that performs a high volume of reads. We use a workload performing 1K reads in sequence as fast as possible, then sleeping for one nanosecond. This sequence is performed across 32 threads in parallel, resulting in a very I/O intensive workload.
Tetragon is configured with a File Integrity Monitoring filter that matches on no activity produced by the executed workload. This represents a policy that is looking for ”needle in a haystack” suspicious activity and quantifies Tetragon’s ability to filter out noise while monitoring a large number of file operations.
In this benchmark, where no suspicious events occur, it is important that the observability tool can filter out the high-volume of uninteresting reads (from trusted binaries or on files that are interesting from a security policy) and maintain low overhead. This small delta between the workload and Tetragon is all thanks to the in-kernel filters. By pushing as much as we can to the kernel, we continue to optimize and reduce CPU cycles to manage these events while still capturing all useful events and metrics.
Tetragon’s low overhead in-kernel filtering means security teams no longer have a resource concern driving decisions on how many files to monitor or whether to enable FIM on systems with extensive I/O operations such as on database servers. Where legacy FIM solutions are overwhelmed by large file paths or scoping due to filtering in userspace, Tetragon filters out non-relevant events that are uninteresting to the policy, repetitive, or part of the normal expected behavior to minimize overhead. The traditional solution bar depicts an equivalent policy in a system using eBPF to gather data, but without in-kernel filtering.
TCP CRR: Logging or Monitoring High Volume Traffic
A crucial part of security observability is being able to monitor all network activity for suspicious activity. This requires tracking new connections made and ports opened. To benchmark this use case, we run a workload that opens new connections at a very high rate using netperf TCP_CRR to create a worst-case scenario and establish the performance overhead ceiling. A real-world scenario where this load may occur is running a L7 load-balancer or a web frontend. We then measure the overhead for two specific use-cases: 1) monitoring all traffic for suspicious activity which involves tracking connections applying a filter to look for particular activity and 2) logging every connection attempt as an event in the audit log.
In the first use-case, we focus on monitoring network events for threats and unusual behavior. This benchmark demonstrates Tetragon’s minimal impact on monitoring large-scale network events. With Tetragon, the connection rate decreases by 5.88%, maintaining an average rate of approximately ~26,250 requests per second. For comparison, we have run the same workload and benchmark with a well established solution on the market. The benchmark effectively demonstrates the benefit of in-kernel filtering with the ability to ignore irrelevant information as effectively as possible. The more traditional solution requires to stream more events to user space for filtering which is not only more costly but may eventually lead to loss of information if the rate of events exceeds the ability of the system to process events. In such an event, the ability to detect suspicious activity may get compromised.
The second benchmark explores the resource-intensive task of logging every network event in JSON format to disk. This requires an event to be streamed to the agent in user space to log every new connection which is more costly than just looking for suspicious activity and alerting.
Tetragon’s flexibility allows security teams to choose at what granularity information should be collected. For example, it may be beneficial to keep the detection of suspicious network activity enabled on all nodes but exclude nodes running L7 load-balancers from the full connection logging if the connection log maintained by the load-balancer is sufficient.
Simple Observability with Default Observability Policies
Minimal overhead is fantastic but only if usability is not compromised as a trade-off. Tetragon 1.0 introduces a library of default observability policies to cover a wide range of common security use cases from monitoring kernel module loads, detection of execution of binaries stored in /tmp, network audit log of all connections accepted or initiated by sshd, tracking all invocations of sudo, and many more.
Usage of the library is as simple as identifying a default observability policy matching your use cases and applying it in your Kubernetes cluster. The policies can be used as-is or can be used as a template for customization before applying them.
$ kubectl apply https://raw.githubusercontent.com/cilium/tetragon/main/examples/policylibrary/modules.yaml
The observability policy library not only provides you with the policies but also with best-practice
jq filters to format the raw logs in a way that is useful for the particular use case. Here is an example on how the
jq filter provided with kernel module loading policy is used to show how the invocation of
iptables triggered the loading of the
ipt_LOG kernel module.
2023-11-01T04:11:38.390880528Z /sbin/iptables -A OUTPUT -m cgroup --cgroup 1 -j LOG module:ipt_LOG
Kubernetes-Native Observability – Powerful & Efficient
By providing labels and namespace filters in eBPF, Tetragon not only enables teams to filter on Kubernetes metadata and to apply policies to particular Kubernetes workloads, it does so incredibly efficiently. If a particular policy should only be applied to Kubernetes workloads with a particular label, then this logic is directly encoded in the kernel using eBPF which allows to observe and enforce runtime behavior with incredible performance.
In the policy snippet below, we have demonstrated a policy that monitors the
setuid() system call and logs all calls to it if the process is running inside a Kubernetes pod with the label
After applying the policy, we can demonstrate the observability gained by having a malicious user do a
kubectl exec into a payment-service-backend Kubernetes pod which has the
sensitive-workload label set and attempt to gain root access via the
su - command. Tetragon generates events for each action, which can be visualized via the Tetragon CLI. The output below shows the process executions of
kubectl exec spawning a shell and the setuid system call performed by
su with all the Kubernetes identities and workload metadata involved. These events can be tied together to form real time alerts or enforcement points, transferred to a SIEM, or S3 bucket for further threat analysis.
🚀 process sensitive-namespace/payment-backend-service /bin/bash 🚀 process sensitive-namespace/payment-backend-service /usr/bin/su - 🔑 setuid sensitive-namespace/payment-backend-service /usr/bin/su 0 🚀 process sensitive-namespace/payment-backend-service /bin/bash
The benefits shine both when building policies and when correlating events. By bringing Kubernetes identity awareness, the threshold is lowered for creating and applying a robust set of runtime policies. Additionally, when events are generated, the metadata brings context that simplifies observability at scale. Tetragon is also capturing information about processes, as we will explore in the use case below when linking parent and child processes together to build out process trees from container start. This flexible Kubernetes identity awareness makes it easy to implement a set of baseline security observability policies that simply plugin to your environment.
Tetragon Policy Library: Plug in a Simplified K8s Aware Policy!
Explore the Tetragon policy library, with ready to apply observability and tracing policies.Find a Policy!
What are Tetragon’s Use Cases?
Auditing Use of Kubectl Exec
Tetragon can see a tremendous amount of activity in the kernel and in applications. Unless filtered and turned into an actual signal, the total amount of observability data can be overwhelming. It may not be beneficial to log every single process execution into your SIEM.
Instead, let’s look at a use case where we want Tetragon to constantly monitor the entire system but specifically log kubectl exec invocations. This sounds costly and noisy on the surface, but rather than streaming every process exec event to user space and parsing through events there (costing significant overhead and creating race conditions), Tetragon reduces the number of events that require to be streamed to what provides actual signal and is relevant to the user.
In the dashboard above, every kubectl exec command is captured and logged while providing fine-grained details on Kubernetes pod, namespace, binary executed, as well as all arguments to the binary.
Using Tetragon for observability, security teams and system admins are promptly identifying any suspicious or unauthorized activity, and logging all relevant data for a forensic investigation. Tetragon’s ability to intelligently capture events makes it efficient and powerful for Kubernetes security and compliance teams operating in highly sensitive production environments.
Process Ancestry Trees + Correlating Network with Runtime Telemetry
Tetragon removes observability silos and makes it easy to correlate security-related network and runtime events to create alerts, enforce actions, and provide meaningful observability. This provides intelligence in two meaningful ways:
- Network activity is often the origin (network-based attack) or the desired follow-up activity (data exfiltration via the network) of a security attack
- Individual events often do not carry sufficient information on their own to produce a strong signal but by understanding the correlation of lateral movement on the network and security relevant activity on a system, strong intelligence can be gained.
Tetragon is capable of tying particular process activity to network connections. It is also capable of understanding the process ancestry activity using a race-free method to understand which process spawned another process.
The process tree above shows a great example where individual events may not provide a compelling signal, but seeing the whole picture makes it obvious that a reverse shell attack has been performed.
In the top left, we see Tetragon recorded the Kubernetes namespace
tenant-jobs and the exact pod in question,
The pod is being managed by containerd as container runtime which itself has spawned off the
init process. Inside of the pod we are seeing a Node.js app performing some network activity to api.twitter.com and to an elasticsearch cluster hosted in Kubernetes.
Looking further down into the process tree we see interesting events form a lateral movement attack that happened five minutes after the container was started. A netcat (
nc) binary was spawned to establish a reverse shell which was then used to invoke
curl to reach out to ElasticSearch cluster on
TCP port 9200. It is also likely that data was exfiltrated to an S3 bucket using the DNS name “malicious-bucket.s3.amazonaws.com” over
TCP port 80. Instead of viewing these process events and network communication in isolation, Tetragon allows for real-time correlation and presentation of the events in a single compelling view.
By connecting network events with detailed process information, Tetragon enables security professionals to paint a comprehensive and real-time picture of their system’s activity. This means not only identifying potential threats but also understanding how they propagate through the network, from the low-level intricacies of Linux processes to the high-level abstractions of Kubernetes pods and namespaces.
File Integrity Monitoring at Scale
Traditional file monitoring introduces a number of challenges even for simple tasks such as understanding changes to files. These concerns create limitations in the scope and resilience of monitoring sensitive files:
- Traditional approaches relying on scanning files in some interval are fundamentally racy when file content is reverted back before the next scan occurs.
- Resource usage can be quickly exhausted when watching large directories and subdirectories.
- The resource usage makes traditional file integrity monitoring vulnerable to DoS attacks by creating a large number of files or rapidly changing file content.
- Lack of support monitoring hard links.
- File deletions often remove the ability to trace further actions on the respective file path.
Tetragon has distinct advantages that improve FIM from a perspective of accuracy and scale. By implementing FIM directly in the Linux kernel using eBPF, it can perform real-time monitoring of files and directories and eliminate the usual race conditions. The minimal performance overhead eliminates the DoS attack vector and increases the feasibility to establish wide-reaching FIM usage for scanning of suspicious activity.
Tetragon is Kubernetes aware and capable of selectively applying FIM policies based on the Kubernetes workload or other process characteristics of where the File I/O interaction originates.
As shown in the benchmarking results, this level of granular FIM is all achieved at nominal overhead. Below is a snippet of a FIM policy used to monitor a set of sensitive files on the system including critical configuration files and SSH key storage locations. The policy is divided into a set of files that are being monitored for writes or the combination of reads & writes. The policy will generate events when any of the matching files match. This example shows the observability side of Tetragon, a modified version of the policy can of course automatically enforce these rules and prevent malicious file reads and writes automatically.
Let’s evaluate what happens when a Kubernetes workload attempts to read the
/etc/shadow file. This file is a particularly sensitive Linux password file storing hashed passwords. When a Kubernetes pod runs
cat to read the file on the
xwing pod, the following read event is generated by Tetragon showing which pod the event occurred on, what process was run, that the file was read, and that the process exited successfully.
🚀 process default/xwing /bin/bash -c "cat /etc/shadow" 🚀 process default/xwing /bin/cat /etc/shadow 📚 read default/xwing /bin/cat /etc/shadow 💥 exit default/xwing /bin/cat /etc/shadow 0
Monitoring Network Connections for Lateral Movement
An essential security use case is to monitor and control your network activity. Since deviations from expected traffic patterns might indicate unauthorized or potentially malicious network activity, security teams rely on having access to Kubernetes identity aware network activity data captured in real-time with low overhead to detect network attacks fast.
eBPF provides deep network visibility for processes on the system including every Kubernetes workload. Tetragon is capable of tying the observed network connectivity with the process context to understand the sending and or receiving process of all activity. This allows us to build a comprehensive log of understanding every process that either received or initiated network activity or even just started silently listening on a network port. This information is incredibly valuable for threat hunting and incident investigation and is at the basis of understanding and tracing lateral movement during incidents.
A good example is the following vulnerability report “SSRF in Shopify leads to ROOT access in all instances”: an attacker accessed the Google Cloud Metadata to move laterally and gain root access to any container in the environment.
The Kubernetes identity aware runtime security observability data provided by Tetragon gives security teams historical information of all network activity with rich process-level visibility.
Conclusion and Get Started!
At Isovalent, we are incredibly proud of all the progress the Tetragon community has made to get to this key 1.0 milestone and excited for all the places Tetragon will grow over the coming releases.
Tetragon AMA and Deep Dive – December 7th
Interested in connecting with the team behind Tetragon and learning more?
Join our webinar on December 7th. Meet Thomas Graf, Co-Founder and CTO, and Natalia Reka Ivanko, Security Product Manager. Together, they will feature an introduction to Tetragon, demos, and the opportunity to ask any questions.
Tetragon at KubeCon + CloudNativeCon Chicago – November 6th-9th
Tetragon will be featuring heavily at KubeCon + CloudNativeCon, and we would love to chat in-person. Below are two accepted talks by Natalia Reka Ivanko and John Fastabend, Principal Engineer and Tetragon Maintainer.
💥Thursday November 9th, 2:55-3:30pm: Paint the Picture! – Detecting Suspicious Data Patterns in Encrypted Traffic with eBPF and KTLS – Natalia Reka Ivanko & John Fastabend, Isovalent
Isovalent will be at booth number #I11 with book signings by Liz Rice (“What is eBPF?”), Bill Mulligan (“eBPF for Kids”) and Natalia Ivanko (“Security Observability with eBPF”) throughout the week. Join the sessions, come by the booth, and bring all your questions.
Try Tetragon Now
Looking to try out Tetragon right now? Begin the ‘Getting started with Tetragon’ sandbox lab walking through detecting a container escape attack.