• Raymond de Jong
    About the speakerRaymond de Jong

    Field CTO

Egress Gateway High Availability

[15:06] In this video, learn with Raymond de Jong how Egress Gateway HA can provide enterprise users resilience for their egress gateway traffic.

Transcript

Hello everyone, and welcome to this demo of the Egress Gateway HA feature, which is available in Cilium Enterprise. My name is Raymond de Jong, and I’m a Senior Solutions Architect at Isovalent. In this demo, we’ll start with an introduction to the egress gateway AHA feature. After that, we’ll talk about the requirements and the actual setup of the Cilium Egress NAT Policy CRD. Then, we’ll jump into a demo environment that we’ve prepared on AWS, which has a specific implementation of the Egress Gateway HA solution. We’ll explain why and how we’ve configured it to work as an example for an implementation of the Egress Gateway HA feature.

In a perfect Kubernetes environment, you only need to worry about the workloads running inside your Kubernetes clusters. However, in reality, most workloads in enterprise environments running inside Kubernetes clusters need to be able to reach workloads on the physical network or in clouds. This can be, for example, a database running on the physical network or Amazon S3 buckets running on AWS. Additionally, Pod IPs and node IPs are ephemeral. Dealing with pod IPs and node IPs in enterprise environments becomes a challenge. Most likely, in those environments, you would have some kind of routing environment or firewall environment that needs to allow traffic from your Kubernetes clusters to, let’s say, your database environment. You most likely don’t want to allow the full Pod CIDR range access to your production database. This means that your network engineer needs to have some predictable IPs to be allowed from the Kubernetes clusters towards your database.

In short, the egress gateway HA feature allows for redirecting traffic originating in Pods and destined to specific CIDRs outside your cluster to be routed through particular nodes. When the egress gateway feature is enabled, and egress NAT policies are in place, packets leaving the cluster are masqueraded with selected predictable IPs, which are associated with the gateway nodes. In this diagram, you see a gateway group, which are specific nodes being selected to forward traffic from the Kubernetes cluster to external resources, in this case, a legacy app with the IP address 1.2.3.4. This means that traffic from a given Pod trying to reach the legacy app in this example, using eBPF, will be forwarded to a specific gateway group, which consists of a number of nodes configured with some kind of external IP addresses.

Enterprise environments require additional redundancy and high availability of the egress gateway feature in order for the cluster to tolerate node failures. To this end, the Isovalent Cilium Enterprise release brings support for egress gateway by high availability, which allows an egress NAT policy to select a group of gateway nodes instead of a single one. This means that multiple nodes acting as egress gateways will make it possible to both distribute traffic across different gateways and to have additional gateways in a hot standby setup, which are ready to take over in case an active gateway fails.

To summarize the example in the diagram, we have a gateway group consisting of three nodes, of which two actively take part in forwarding traffic, which means that, in case of a failure of one of the two active gateway nodes, the other hot standby gateway node will take over forwarding traffic to the remote resources.

Let’s have a look at the requirements for the egress gateway HA. Cilium must make use of network-facing interfaces and IP addresses present on your designated gateway nodes. These interfaces and IP addresses must be proficient and configured by the operator based on your networking environment. This process is obviously highly dependent on your environment.

When working with AWS or EKS environments, and depending on your requirements, you may need to create one or more elastic network interfaces with one or more IP addresses and attach them to instances that serve as gateway nodes. in order that AWS can adequately route traffic flowing to and from your instances. Other cloud providers may have similar networking requirements and constructs. On physical environments, it may mean that you need to configure your gateway nodes with specific network-facing interfaces, which are connected to some kind of transit network, connecting to routers or external firewalls, which, in turn, can forward traffic to your external resources. Additionally, enabling the egress gateway feature requires that both BPF masquerading and the kube-proxy replacement are enabled.

We will now continue to look at how the Cilium egress NAT policy CRD is created. First of all, the Cilium egress NAT policy is a cluster-scoped CRD. First of all, we need to select the source to which the policy is being applied. We use the “egress” field of the Cilium egress NAT policy to select the source pod, using match labels. This allows you to match a specific label being applied to specific pods. You can also select pods using match expressions as being shown. Multiple pod selectors can be specified, and you can also select Pods belonging to a given namespace using the “io.kubernetes.pod.namespace” label.

Next, we need to select the destination to which this policy is being applied. You can specify one or more destination CIDR selectors with the “destination-cidr-selector” field. In this case, you can have a specific IP – /32 – or subnet – /32 -, or anywhere using “/0,” which would mean any destination. However, note that any IP belonging to these ranges, which is also an internal cluster IP, will obviously be excluded from the egress gateway source NATlogic.

The group of nodes that should act as a gateway node for a given policy can be configured with the “egress-group” field. The nodes are matched using the “node-selector” field, based on their labels. You can limit the number of active gateway nodes using the “max-gateway-nodes” field. For example, limiting the number of active gateway nodes is very useful to set up a hot standby configuration, where only a given amount of nodes is active at the same time, so in case of a failure of one of the nodes, other nodes are ready to take over.

We also need to configure the IP address that should be used to source network traffic leaving the nodes. This can be achieved by either specifying the interface, in which case the first IPv4 address assigned to that interface will be used. In this example, you can leverage the usage of multiple interfaces with a specific IP address, which can be used in specific configuration. Or you can explicitly specify the egress IP address, which requires that the egress IP is actually assigned to a network device on the node. In this case, you would use the egress IP with a specific IP address, or you would just use the first IPv4 address assigned to the interface for your default route on your node. You can also specify multiple egress groups, which can be combined together, and you can use different node selectors within an egress group, so you can match different labels that may have specific configurations for that given group of nodes.

Let’s now look at an example of a Cilium egress NAT policy resource that conforms to the specification. In this example, we want to source net traffic to any destination using the destination CIDRs with the “0.0.0.0/0” address range. Then, we select the pod using the pod selector, using match labels, and we’re looking for the “netshoot” pod in the “namespace-1”.
Then we use the Egress Group to select the nodes. First we select the nodes which has the label a-specific-node.
So in this example, we are using a specific node which has configured the specific IP address to be used source NATTING traffic to any destination. Additionally, we select the nodes which match labels for a specific worker group and a specific zone, and this is specific to, for example, AWS implementation. Then we use the Ethernet one interface on that given group of nodes, and we assume that the first IP address being configured on that node on a given node will be used to source NAT traffic to any destination. And we also specify the max gateway nodes of two. So let’s say if this worker group has three, four, or five nodes in that pool, we only use two nodes actively, which means that the other nodes will be a hot standby.

Now it’s time to have a look at a real implementation. In this case, on AWS, where we have a Kubernetes cluster running. For this demo, we created a specific auto-scaling group existing of exactly the capacity of four nodes. These nodes are solely meant for capabilities like egress gateway, so they’re not running any workloads on the Kubernetes cluster. And they are tainted on the cluster, so they are not schedulable. Here you can see we have a desired capacity of four, and if we look at the instances, we see that we’ve created an instance on each availability zone to make sure we have availability across the zones in the US West region.

Now let’s show that output of ‘kubectl get nodes -o wide.’ Right here, you can see also the nodes. This is, for example, one of the nodes being allocated for the egress gateway capability. And if we do a ‘kubectl describe nodes’ for, for example, this specific node, we can see we’ve made sure they are labeled with the ‘io.cilium/egress-gateway=true’ and also that they have a taint which is ‘no-schedule’ to avoid workloads being scheduled on these nodes.

So this is the Cilium egress nat policy we have applied to this cluster. The name is ‘egress-gateway-nat-policy.’ The destination CIDR is the EKS control plane CIDR, which is basically the VPC IP addresses we use for this purpose. For this demo, we’ve created an EC2 instance as an echo server, which is also running an IP with this destination. Then we select the pods, so we have a specific name ‘egress-gateway-monitor.’ So that would be the pods monitoring, pinging actually, the echo server in the namespace ‘cilium-egress-gateway.’ And then we select the egress groups.

So what we’ve basically done in this implementation is each node in each zone has a specific IP configured for the purpose of egress IP. So all the traffic leaving these nodes are leaving with that specific IP. So we have four nodes similar to our auto-scaling group, and each node has a specific IP we use, and we point. We select the nodes using the labels ‘io.cilium/egress-gateway=true,’ as I’ve shown earlier, but also ‘topology.kubernetes.io/zone=us-west-2a.’ So this selects that specific node in US West 2a zone with that specific label. We’ve also created a nice Grafana dashboard showing evenly distribution of traffic to all four nodes. We can monitor the load time, the request per seconds. We can highlight if there’s any HTTP error out of our request, any wrong outbound IPs being used, and then the HTTP codes responding from the echo server from each of the nodes.

Enterprises require high availability for egress connectivity. For this demo, I will reboot one of the egress gateway nodes. We should see other nodes take over the traffic and rebalance it on the remaining available hosts. Once the rebooted host comes back, it will rejoin the pool and rebalance the traffic across all four available nodes. Let’s select one of the egress gateway nodes and reboot it.

To follow things more closely, I have set up this dashboard to only show the last five minutes. We can closely observe other nodes taking over the load for external connectivity. I am also watching the Cilium BPF egress list, which currently shows four nodes, as I have four nodes participating as equal gateway nodes. We should see one of the nodes being removed as soon as the Cilium health check notices that the node is not available anymore, and it will remove it from the Cilium egress BPF list. On the left is the Pod that is reaching out to the echo server, and it is also providing the metrics to the Grafana dashboard.

As one of the gateway nodes reboots, we can already see that the Cilium BPF egress gateway list has removed that node from the BPF map. It notices this very quickly because the default health check is a period of two seconds. We also saw a few outbound errors from the pod checking the echo server.

In a moment, we should see the load between the three remaining nodes being redistributed. Now we can clearly see that while one of the nodes is rebooting, the other nodes have redistributed the traffic for the egress gateway. Once the node has completed its startup and is available again, it is added back to the BPF egress gateway list and is available again for providing egress connectivity. We can already see that the traffic is being redistributed again between the four available egress gateway nodes.

This concludes the Cilium egress gateway demo and feature presentation. The team at Isovelant is available to answer your questions and support you in designing and implementing the Cilium egress gateway feature to meet your requirements, either on-prem or in the cloud. Thank you for watching, and until the next video!