Back to blog

How to Deploy Cilium and Egress Gateway in Azure Kubernetes Service (AKS)

Amit Gupta
Amit Gupta
Published: Updated: Isovalent
Integrating Kubernetes into Traditional Infrastructure with HA Egress Gateway

Kubernetes changes the way we think about networking. In an ideal Kubernetes world, the network would be flat, and the Pod network would control all routing and security between the applications using Network Policies. In many Enterprise environments, though, the applications hosted on Kubernetes need to communicate with workloads outside the Kubernetes cluster, subject to connectivity constraints and security enforcement. Because of the nature of these networks, traditional firewalling usually relies on static IP addresses (or at least IP ranges). This can make it difficult to integrate a Kubernetes cluster, which has a varying and, at times, dynamic number of nodes, into such a network. Cilium’s Egress Gateway feature changes this by allowing you to specify which nodes should be used by a pod to reach the outside world. This blog post will walk you through deploying Cilium and Egress Gateway in AKS (Azure Kubernetes Service) using BYOCNI as the network plugin.

What is an Egress Gateway?

The egress gateway feature allows redirecting traffic originating in pods destined to specific CIDRs outside the cluster to be routed through particular nodes.

When the egress gateway feature is enabled and egress gateway policies are in place, packets leaving the cluster are masqueraded with selected, predictable IPs associated with the gateway nodes. This feature can be used with legacy firewalls to allow traffic to legacy infrastructure only from specific pods within a given namespace. These pods typically have ever-changing IP addresses. Even if masquerading were to be used to mitigate this, the IP addresses of nodes can also change frequently over time.

Egress IP in Cilium 1.10

For example, with the following resource:

apiVersion: isovalent.com/v1
kind: IsovalentEgressGatewayPolicy
metadata:
  name: egress-sample
spec:
  destinationCIDRs:
  - "192.168.11.0/24"
  selectors:
  - podSelector: {}
  egressGroups:
  -
    nodeSelector:
      matchLabels:
        egress-node: "true"

All pods that match the label=egress-node will be routed through the 192.168.11.4 IP when they reach out to any address in the 192.168.11.0/24 IP range, which is outside the cluster.

What is Isovalent Enterprise for Cilium?

Isovalent Enterprise for Cilium is an enterprise-grade, hardened distribution of open-source projects CiliumHubble, and Tetragon, built and supported by the Cilium creators. Cilium enhances networking and security at the network layer, while Hubble ensures thorough network observability and tracing. Tetragon ties it all together with runtime enforcement and security observability, offering a well-rounded solution for connectivity, compliance, multi-cloud, and security concerns.

Why Isovalent Enterprise for Cilium?

While Egress Gateway in Cilium is a great step forward, most enterprise environments should not rely on a single point of failure for network routing. For this reason, Isovalent introduced Egress Gateway High Availability (HA), which supports multiple egress nodes. The multiple egress nodes can be configured using a egressGroups parameter in the IsovalentEgressGatewayPolicy resource specification that we will detail in Scenario 2 in the tutorial below.

Whenever one of these two groups doesn’t contain healthy nodes, traffic will only be routed through the other one. If both groups contain healthy nodes, traffic will be randomly distributed between them.

Pre-Requisites

The following prerequisites need to be taken into account before you proceed with this tutorial:

  • An Azure account with an active subscription- Create an account for free
  • Azure CLI version 2.48.1 or later. Run az --version to see the currently installed version. If you need to install or upgrade, see Install Azure CLI.
  • If using ARM templates or the REST API, the AKS API version must be 2022-09-02-preview or later.
  • The kubectl command line tool is installed on your device. The version can be the same as or up to one minor version earlier or later than the Kubernetes version of your cluster. For example, if your cluster version is 1.26, you can use kubectl version 1.25, 1.26, or 1.27 with it. To install or upgrade kubectl, see Installing or updating kubectl.
  • Install Cilium CLI.

Limitations to keep in mind?

You must remember certain limitations, which will be added over time.

  • The Egress gateway feature is partially incompatible with L7 policies.
    • Specifically, when an egress gateway policy and an L7 policy both select the same endpoint, traffic from that endpoint does not go through the egress gateway, even if the policy allows it.
  • Egress Gateway is incompatible with Isovalent’s Cluster Mesh feature.

Which network plugin can I use for Egress Gateway in AKS?

Considering two scenarios, we will create an Azure Kubernetes (AKS) Cluster with Bring Your Own CNI (BYOCNI) as the network plugin for this tutorial.

Scenario 1- Egress Gateways in a single Availability Zone.

Pre-Requisites:

  • The AKS cluster is created in VNET A, subnet A
  • The Egress Gateway is created in VNET A, subnet B
    • VNET= 192.168.8.0/22
    • Subnet A= 192.168.10.0/24
    • Subnet B= 192.168.11.0/24
  • A test VM is created in VNET A, subnet B

Set the subscription

Choose the subscription you want to use if you have multiple Azure subscriptions.

  • Replace SubscriptionName with your subscription name.
  • You can also use your subscription ID instead of your subscription name.
az account set --subscription SubscriptionName

AKS Cluster creation

Create an AKS cluster with the network plugin as BYOCNI.

az group create -l eastus -n byocni

az network vnet create -g byocni --location canadacentral --name byocni-vnet --address-prefixes 192.168.8.0/22 -o none

az network vnet subnet create -g byocni --vnet-name byocni-vnet --name byocni-subnet --address-prefixes 192.168.10.0/24 -o none 

az network vnet subnet create -g byocni --vnet-name byocni-vnet --name egressgw-subnet --address-prefixes 192.168.11.0/24 -o none 

az aks create -l eastus -g byocni -n byocni --network-plugin none --vnet-subnet-id /subscriptions/#############################/resourceGroups/byocni/providers/Microsoft.Network/virtualNetworks/byocni-vnet/subnets/byocni-subnet

az aks get-credentials --resource-group byocni --name byocni

Note- You can also create an AKS cluster with BYOCNI using Terraform.

Create an unmanaged AKS nodepool in a different subnet.

Create an AKS nodepool in the egressgw-subnet (created in the previous step).

az aks nodepool add -g byocni --cluster-name byocni -n egressgw --enable-node-public-ip --node-count 1  --vnet-subnet-id /subscriptions/###############################/resourceGroups/byocni/providers/Microsoft.Network/virtualNetworks/byocni-vnet/subnets/egressgw-subnet

Assign a label to the unmanaged nodepool

  • Create a node pool with a label and specify a name for the --name parameters and labels for the --labels Parameter. Labels must be a key/value pair and have a valid syntax.
az aks nodepool update --resource-group byocni --cluster-name byocni --name egressgw --labels io.cilium/egress-gateway=true
  • Check the status of the nodes.
kubectl get nodes -o wide

NAME                               STATUS   ROLES    AGE     VERSION   INTERNAL-IP    EXTERNAL-IP     OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
aks-byocni-35717205-vmss000000     Ready    <none>   5h34m   v1.29.2   192.168.10.4   <none>          Ubuntu 22.04.4 LTS   5.15.0-1060-azure   containerd://1.7.15-1
aks-byocni-35717205-vmss000001     Ready    <none>   5h34m   v1.29.2   192.168.10.5   <none>          Ubuntu 22.04.4 LTS   5.15.0-1060-azure   containerd://1.7.15-1
aks-egressgw-36500661-vmss000000   Ready    <none>   3h52m   v1.29.2   192.168.11.5   52.156.19.241   Ubuntu 22.04.4 LTS   5.15.0-1060-azure   containerd://1.7.15-1
aks-egressgw-36500661-vmss000001   Ready    <none>   3h52m   v1.29.2   192.168.11.6   53.172.12.120   Ubuntu 22.04.4 LTS   5.15.0-1060-azure   containerd://1.7.15-1
  • Note- this doesn’t create a new NIC. It means traffic from the client pod is EGW-redirected to the egress-node: “true” node’s eth0 192.168.11.5, and from there, it’s also automatically NATed to the node’s assigned public IP.

Install Isovalent Enteprise for Cilium

--set egressGateway.enabled=true \
--set enterprise.egressGatewayHA.enabled=true \
--set bpf.masquerade=true \
--set kubeProxyReplacement=true \
--set l7Proxy=false

Restart Cilium Operator and Cilium Daemonset

  • Restart the cilium operator and cilium daemonset for egress gateway changes to take effect.
kubectl rollout restart ds cilium -n kube-system
kubectl rollout restart deploy cilium-operator -n kube-system
  • Check the status of the pods.
kubectl get pods -o wide -A

NAMESPACE     NAME                                  READY   STATUS    RESTARTS        AGE     IP             NODE                               NOMINATED NODE   READINESS GATES
default       busybox                               1/1     Running   0               3h37m   10.0.0.165     aks-byocni-35717205-vmss000000     <none>           <none>
default       server                                1/1     Running   0               3h38m   10.0.1.84      aks-byocni-35717205-vmss000001     <none>           <none>
kube-system   cilium-9q4fx                          1/1     Running   0               3h35m   192.168.10.5   aks-byocni-35717205-vmss000001     <none>           <none>
kube-system   cilium-gjvft                          1/1     Running   0               3h34m   192.168.10.4   aks-byocni-35717205-vmss000000     <none>           <none>
kube-system   cilium-mlxhq                          1/1     Running   0               3h35m   192.168.11.5   aks-egressgw-36500661-vmss000000   <none>           <none>
kube-system   cilium-node-init-kd85n                1/1     Running   0               3h55m   192.168.11.5   aks-egressgw-36500661-vmss000000   <none>           <none>
kube-system   cilium-node-init-p74df                1/1     Running   0               5h27m   192.168.10.4   aks-byocni-35717205-vmss000000     <none>           <none>
kube-system   cilium-node-init-t5mrn                1/1     Running   0               5h27m   192.168.10.5   aks-byocni-35717205-vmss000001     <none>           <none>
kube-system   cilium-operator-7d84bcbbc8-9rxzj      1/1     Running   0               3h34m   192.168.11.5   aks-egressgw-36500661-vmss000000   <none>           <none>
kube-system   cilium-operator-7d84bcbbc8-rrq2f      1/1     Running   0               3h34m   192.168.10.5   aks-byocni-35717205-vmss000001     <none>           <none>
kube-system   cloud-node-manager-288rv              1/1     Running   0               5h38m   192.168.10.4   aks-byocni-35717205-vmss000000     <none>           <none>
kube-system   cloud-node-manager-8hv5x              1/1     Running   0               3h55m   192.168.11.5   aks-egressgw-36500661-vmss000000   <none>           <none>
kube-system   cloud-node-manager-xt222              1/1     Running   0               5h38m   192.168.10.5   aks-byocni-35717205-vmss000001     <none>           <none>
kube-system   coredns-5b97789cf4-b27vp              1/1     Running   0               5h26m   10.0.1.105     aks-byocni-35717205-vmss000001     <none>           <none>
kube-system   coredns-5b97789cf4-s4mbm              1/1     Running   0               5h38m   10.0.0.189     aks-byocni-35717205-vmss000000     <none>           <none>
kube-system   coredns-autoscaler-7c88465478-mccff   1/1     Running   7 (4h36m ago)   5h38m   10.0.0.251     aks-byocni-35717205-vmss000000     <none>           <none>
kube-system   csi-azuredisk-node-75bnm              3/3     Running   0               3h55m   192.168.11.5   aks-egressgw-36500661-vmss000000   <none>           <none>
kube-system   csi-azuredisk-node-85hrg              3/3     Running   0               5h38m   192.168.10.4   aks-byocni-35717205-vmss000000     <none>           <none>
kube-system   csi-azuredisk-node-9h9s2              3/3     Running   0               5h38m   192.168.10.5   aks-byocni-35717205-vmss000001     <none>           <none>
kube-system   csi-azurefile-node-dwqsk              3/3     Running   0               5h38m   192.168.10.5   aks-byocni-35717205-vmss000001     <none>           <none>
kube-system   csi-azurefile-node-lwkrn              3/3     Running   0               3h55m   192.168.11.5   aks-egressgw-36500661-vmss000000   <none>           <none>
kube-system   csi-azurefile-node-tddf5              3/3     Running   0               5h38m   192.168.10.4   aks-byocni-35717205-vmss000000     <none>           <none>
kube-system   konnectivity-agent-844cd49468-dqdkj   1/1     Running   0               4h49m   192.168.10.5   aks-byocni-35717205-vmss000001     <none>           <none>
kube-system   konnectivity-agent-844cd49468-tmw6l   1/1     Running   0               4h49m   192.168.10.4   aks-byocni-35717205-vmss000000     <none>           <none>
kube-system   kube-proxy-85rm2                      1/1     Running   0               3h55m   192.168.11.5   aks-egressgw-36500661-vmss000000   <none>           <none>
kube-system   kube-proxy-qsjxx                      1/1     Running   0               5h38m   192.168.10.4   aks-byocni-35717205-vmss000000     <none>           <none>
kube-system   kube-proxy-tnrh2                      1/1     Running   0               5h38m   192.168.10.5   aks-byocni-35717205-vmss000001     <none>           <none>
kube-system   metrics-server-6bb9c967d6-5cwnh       2/2     Running   0               3h52m   10.0.1.253     aks-byocni-35717205-vmss000001     <none>           <none>
kube-system   metrics-server-6bb9c967d6-8sdft       2/2     Running   0               3h52m   10.0.1.100     aks-byocni-35717205-vmss000001     <none>           <none>

Create an Egress Gateway Policy

  • The API provided by Isovalent to drive the Egress Gateway feature is the IsovalentEgressGatewayPolicy resource.
  • The selectors field of an IsovalentEgressGatewayPolicy resource is used to select source pods via a label selector. This can be done using matchLabels:
selectors:
- podSelector:
    matchLabels:
      labelKey: labelVal
  • One or more destination CIDRs can be specified with destinationCIDRs:
destinationCIDRs:
- "a.b.c.d/32"
- "e.f.g.0/24"
  • The group of nodes that should act as gateway nodes for a given policy can be configured with the egressGroups field. Nodes are matched based on their labels, with the nodeSelector field:
egressGroups:
- nodeSelector:
    matchLabels:
      testLabel: testVal
  • Sample policy as below:
apiVersion: isovalent.com/v1
kind: IsovalentEgressGatewayPolicy
metadata:
  name: egress-sample
spec:
  destinationCIDRs:
  - "192.168.11.0/24"
  selectors:
  - podSelector: {}
  egressGroups:
  -
    nodeSelector:
      matchLabels:
        egress-node: "true"

Testing Egress Gateway

  • Deploy a client pod and apply the IsovalentEgressGatewayPolicy, and observe that the pod’s connection gets redirected through the Gateway node.
  • The client pod gets deployed to one of the two nodes (managed), and the IEGP (Isovalent Egress Gateway Policy) selects one or both the nodes ( depending on the egress gateway IPs specified) as the Gateway node.
  • Sample client pod yaml:
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: busybox
  name: busybox
spec:
  containers:
  - image: nginx
    name: nginx
    command:
    - /bin/sh
    - -c
    securityContext:
      capabilities:
        add:
          - NET_ADMIN  # Add the cap_net_admin capability
    env:
    - name: EGRESS_IPS
      value: 192.168.11.5/24, 192.168.11.4/24
    resources: {}
  dnsPolicy: ClusterFirst
  nodeSelector:
    kubernetes.io/hostname: aks-byocni-35717205-vmss000000
  restartPolicy: Always
status: {}
  • Create the client pod and check that it’s up and running and pinned on one of the worker nodes as specified in the yaml file for the client pod.
kubectl apply -f busyboxegressgw.yaml

kubectl get pods -o wide
NAME      READY   STATUS    RESTARTS   AGE     IP           NODE                             NOMINATED NODE   READINESS GATES
busybox   1/1     Running   0          3h53m   10.0.0.165   aks-byocni-35717205-vmss000000   <none>           <none> 
Apply an Egress Gateway Policy
  • Apply an Egress Gateway Policy
apiVersion: isovalent.com/v1
kind: IsovalentEgressGatewayPolicy
metadata:
  name: egress-sample
spec:
  destinationCIDRs:
  - "192.168.11.0/24"
  selectors:
  - podSelector: {}
  egressGroups:
  -
    nodeSelector:
      matchLabels:
        egress-node: "true"
Label the Egress Gateway Node

To let the policy select the node designated as the Egress Gateway, apply the label, egress-node:true to it:

kubectl label nodes aks-egressgw-36500661-vmss00000 egress-node=true

kubectl get nodes -o wide --show-labels=true | grep egress-node

aks-egressgw-36500661-vmss000000   Ready    <none>   3h57m   v1.29.2   192.168.11.5   52.156.19.241   Ubuntu 22.04.4 LTS   5.15.0-1060-azure   containerd://1.7.15-1   agentpool=egressgw,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_DS2_v2,beta.kubernetes.io/os=linux,egress-node=true,failure-domain.beta.kubernetes.io/region=canadacentral,failure-domain.beta.kubernetes.io/zone=0,io.cilium/egress-gateway=true,kubernetes.azure.com/agentpool=egressgw,kubernetes.azure.com/cluster=MC_byocni_byocni_canadacentral,kubernetes.azure.com/consolidated-additional-properties=04f7d3de-0602-11ef-bb36-22c8fb861105,kubernetes.azure.com/kubelet-identity-client-id=###############################,kubernetes.azure.com/mode=user,kubernetes.azure.com/node-image-version=AKSUbuntu-2204gen2containerd-202404.09.0,kubernetes.azure.com/nodepool-type=VirtualMachineScaleSets,kubernetes.azure.com/os-sku=Ubuntu,kubernetes.azure.com/role=agent,kubernetes.azure.com/storageprofile=managed,kubernetes.azure.com/storagetier=Premium_LRS,kubernetes.io/arch=amd64,kubernetes.io/hostname=aks-egressgw-36500661-vmss000000,kubernetes.io/os=linux,node.kubernetes.io/instance-type=Standard_DS2_v2,storageprofile=managed,storagetier=Premium_LRS,topology.disk.csi.azure.com/zone=,topology.kubernetes.io/region=canadacentral,topology.kubernetes.io/zone=0
Create a test VM in the Egress Gateway subnet.
  • Create a VM in the same subnet as Egress Gateway and run a simple service on port 80 (like NGINX) that will respond to traffic sent from a pod on one of the worker nodes.
  • Test VM IP, in this case, is 192.168.11.4
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:22:48:3c:63:51 brd ff:ff:ff:ff:ff:ff
    inet 192.168.11.4/24 brd 192.168.11.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::222:48ff:fe3c:6351/64 scope link
       valid_lft forever preferred_lft forever
Traffic Generation (towards the server in Egress GW subnet)

Send traffic toward the test VM.

kubectl exec busybox -- curl -I 192.168.11.4
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0   612    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
HTTP/1.1 200 OK
Server: nginx/1.18.0 (Ubuntu)
Date: Mon, 29 Apr 2024 12:44:07 GMT
Content-Type: text/html
Content-Length: 612
Last-Modified: Mon, 29 Apr 2024 08:55:12 GMT
Connection: keep-alive
ETag: "662f6070-264"
Accept-Ranges: byte
Traffic Generation (outside of the cluster towards the Internet)
  • Send traffic to a public service.
    • Note the IP it returns is the egress gateway node’s Public IP.
kubectl exec busybox -- curl ifconfig.me
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    13  100    13    0     0    181      0 --:--:-- --:--:-- --:--:--   183
52.156.19.241
  • Take a tcpdump from one of the egress gateway nodes.
    • Install tcpdump on the egress gateway node via apt-get install tcpdump
    • As you can see 10.0.0.165 is the client-pod IP that the egress gateway node is receiving packets from and 192.168.11.5 is the egress gateway node’s eth0 IP address.
IP 34.117.118.44.80 > 10.0.0.165.45468: Flags [S.], seq 2883623483, ack 1878158107, win 65535, options [mss 1412,sackOK,TS val 1551427192 ecr 3654296066,nop,wscale 8], length 0
IP 10.0.0.165.45468 > 34.117.118.44.80: Flags [.], ack 1, win 507, options [nop,nop,TS val 3654296076 ecr 1551427192], length 0
13:09:10.372213 IP 192.168.11.5.45468 > 34.117.118.44.80: Flags [.], ack 1, win 507, options [nop,nop,TS val 3654296076 ecr 1551427192], length 0
IP 10.0.0.165.45468 > 34.117.118.44.80: Flags [P.], seq 1:76, ack 1, win 507, options [nop,nop,TS val 3654296077 ecr 1551427192], length 75: HTTP: GET / HTTP/1.1
13:09:10.372313 IP 192.168.11.5.45468 > 34.117.118.44.80: Flags [P.], seq 1:76, ack 1, win 507, options [nop,nop,TS val 3654296077 ecr 1551427192], length 75: HTTP: GET / HTTP/1.1
13:09:10.380927 IP 34.117.118.44.80 > 192.168.11.5.45468: Flags [.], ack 76, win 256, options [nop,nop,TS val 1551427202 ecr 3654296077], length 0
IP 34.117.118.44.80 > 10.0.0.165.45468: Flags [.], ack 76, win 256, options [nop,nop,TS val 1551427202 ecr 3654296077], length 0
13:09:10.430485 IP 34.117.118.44.80 > 192.168.11.5.45468: Flags [P.], seq 1:183, ack 76, win 256, options [nop,nop,TS val 1551427251 ecr 3654296077], length 182: HTTP: HTTP/1.1 200 OK
IP 34.117.118.44.80 > 10.0.0.165.45468: Flags [P.], seq 1:183, ack 76, win 256, options [nop,nop,TS val 1551427251 ecr 3654296077], length 182: HTTP: HTTP/1.1 200 OK
IP 10.0.0.165.45468 > 34.117.118.44.80: Flags [.], ack 183, win 506, options [nop,nop,TS val 3654296135 ecr 1551427251], length 0
13:09:10.430983 IP 192.168.11.5.45468 > 34.117.118.44.80: Flags [.], ack 183, win 506, options [nop,nop,TS val 3654296135 ecr 1551427251], length 0
13:09:10.434396 IP 192.168.10.4.44693 > 192.168.11.5.8472: OTV, flags [I] (0x08), overlay 0, instance 53596
IP 10.0.0.165.45468 > 34.117.118.44.80: Flags [F.], seq 76, ack 183, win 506, options [nop,nop,TS val 3654296139 ecr 1551427251], length 0
13:09:10.434476 IP 192.168.11.5.45468 > 34.117.118.44.80: Flags [F.], seq 76, ack 183, win 506, options [nop,nop,TS val 3654296139 ecr 1551427251], length 0
13:09:10.443468 IP 34.117.118.44.80 > 192.168.11.5.45468: Flags [F.], seq 183, ack 77, win 256, options [nop,nop,TS val 1551427264 ecr 3654296139], length 0
IP 34.117.118.44.80 > 10.0.0.165.45468: Flags [F.], seq 183, ack 77, win 256, options [nop,nop,TS val 1551427264 ecr 3654296139], length 0
IP 10.0.0.165.45468 > 34.117.118.44.80: Flags [.], ack 184, win 506, options [nop,nop,TS val 3654296148 ecr 1551427264], length 0
13:09:10.443787 IP 192.168.11.5.45468 > 34.117.118.44.80: Flags [.], ack 184, win 506, options [nop,nop,TS val 3654296148 ecr 1551427264], length 0

Scenario 2- Egress Gateways in a Multi-Availability Zone environment.

Geo redundancy across availability zones is a must, and combined with HA for the Egress GW, it is a solution that enterprises are always willing to consider.

Pre-Requisites:

  • The AKS cluster is created in VNET A, subnet A
  • The Egress Gateway is created in VNET A, subnet B
    • VNET= 192.168.8.0/22
    • Subnet A= 192.168.10.0/24
    • Subnet B= 192.168.11.0/24
  • A test VM is created in VNET A, subnet B

Set the subscription

Choose the subscription you want to use if you have multiple Azure subscriptions.

  • Replace SubscriptionName with your subscription name.
  • You can also use your subscription ID instead of your subscription name.
az account set --subscription SubscriptionName

AKS cluster creation with nodepools across AZ’s

Create an AKS cluster with the network plugin as BYOCNI and nodepools across different Availability Zones.

az group create -l eastus -n byocni

az network vnet create -g byocni --location canadacentral --name byocni-vnet --address-prefixes 192.168.8.0/22 -o none

az network vnet subnet create -g byocni --vnet-name byocni-vnet --name byocni-subnet --address-prefixes 192.168.10.0/24 -o none 

az network vnet subnet create -g byocni --vnet-name byocni-vnet --name egressgw-subnet --address-prefixes 192.168.11.0/24 -o none 

az aks create -l eastus -g byocni -n byocni --network-plugin none --vm-set-type VirtualMachineScaleSets --zones 1 2 --vnet-subnet-id /subscriptions/###############################/resourceGroups/byocni/providers/Microsoft.Network/virtualNetworks/byocni-vnet/subnets/byocni-subnet

az aks get-credentials --resource-group byocni --name byocni

Create an unmanaged AKS nodepool in a different subnet.

Create an AKS nodepool in the egressgw-subnet (created in the previous step).

az aks nodepool add -g byocni --cluster-name byocni -n egressgw --enable-node-public-ip --node-count 2  --vnet-subnet-id /subscriptions/#######################################/resourceGroups/byocni/providers/Microsoft.Network/virtualNetworks/byocni-vnet/subnets/egressgw-subnet --vm-set-type VirtualMachineScaleSets --zones 1 2

Assign a label to the unmanaged nodepool

  • Create a node pool with a label and specify a name for the --name parameters and labels for the --labels Parameter. Labels must be a key/value pair and have a valid syntax.
az aks nodepool update --resource-group byocni --cluster-name byocni --name egressgw --labels io.cilium/egress-gateway=true
  • Check the status of the nodes.
kubectl get nodes -o wide

NAME                               STATUS   ROLES    AGE   VERSION   INTERNAL-IP    EXTERNAL-IP     OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
aks-byocni-22760683-vmss000000     Ready    <none>   16d   v1.29.2   192.168.10.4   <none>          Ubuntu 22.04.4 LTS   5.15.0-1060-azure   containerd://1.7.15-1
aks-byocni-22760683-vmss000001     Ready    <none>   16d   v1.29.2   192.168.10.5   <none>          Ubuntu 22.04.4 LTS   5.15.0-1060-azure   containerd://1.7.15-1
aks-egressgw-27814974-vmss000000   Ready    <none>   15d   v1.29.2   192.168.11.4   20.151.98.78    Ubuntu 22.04.4 LTS   5.15.0-1060-azure   containerd://1.7.15-1
aks-egressgw-27814974-vmss000001   Ready    <none>   15d   v1.29.2   192.168.11.5   4.172.207.202   Ubuntu 22.04.4 LTS   5.15.0-1060-azure   containerd://1.7.15-1
  • Note- this doesn’t create a new NIC. It means traffic from the client pod is EGW-redirected to the egress-node: “true” node’s eth0 192.168.11.5 or 192.168.11.4, and from there, it’s also automatically NATed to the node assigned public IP.
  • Check that all nodes have been created in different Availability Zones.
kubectl describe nodes | grep -e "Name:" -e "topology.kubernetes.io/zone"

Name:               aks-byocni-22760683-vmss000000
                    topology.kubernetes.io/zone=canadacentral-2
Name:               aks-byocni-22760683-vmss000001
                    topology.kubernetes.io/zone=canadacentral-1
Name:               aks-egressgw-27814974-vmss000000
                    topology.kubernetes.io/zone=canadacentral-1
Name:               aks-egressgw-27814974-vmss000001
                    topology.kubernetes.io/zone=canadacentral-2

Install Isovalent Enteprise for Cilium

--set egressGateway.enabled=true \
--set enterprise.egressGatewayHA.enabled=true \
--set bpf.masquerade=true \
--set kubeProxyReplacement=true \
--set l7Proxy=false

Restart Cilium Operator and Cilium Daemonset

  • Restart the cilium operator and cilium daemonset for egress gateway changes to take effect.
kubectl rollout restart ds cilium -n kube-system
kubectl rollout restart deploy cilium-operator -n kube-system
  • Check the status of the pods.
kubectl get pods -o wide -A

NAMESPACE     NAME                                  READY   STATUS    RESTARTS        AGE     IP             NODE                               NOMINATED NODE   READINESS GATES
default       busybox                               1/1     Running   0               3h37m   10.0.0.165     aks-byocni-35717205-vmss000000     <none>           <none>
default       server                                1/1     Running   0               3h38m   10.0.1.84      aks-byocni-35717205-vmss000001     <none>           <none>
kube-system   cilium-9q4fx                          1/1     Running   0               3h35m   192.168.10.5   aks-byocni-35717205-vmss000001     <none>           <none>
kube-system   cilium-gjvft                          1/1     Running   0               3h34m   192.168.10.4   aks-byocni-35717205-vmss000000     <none>           <none>
kube-system   cilium-mlxhq                          1/1     Running   0               3h35m   192.168.11.5   aks-egressgw-36500661-vmss000000   <none>           <none>
kube-system   cilium-node-init-kd85n                1/1     Running   0               3h55m   192.168.11.5   aks-egressgw-36500661-vmss000000   <none>           <none>
kube-system   cilium-node-init-p74df                1/1     Running   0               5h27m   192.168.10.4   aks-byocni-35717205-vmss000000     <none>           <none>
kube-system   cilium-node-init-t5mrn                1/1     Running   0               5h27m   192.168.10.5   aks-byocni-35717205-vmss000001     <none>           <none>
kube-system   cilium-operator-7d84bcbbc8-9rxzj      1/1     Running   0               3h34m   192.168.11.5   aks-egressgw-36500661-vmss000000   <none>           <none>
kube-system   cilium-operator-7d84bcbbc8-rrq2f      1/1     Running   0               3h34m   192.168.10.5   aks-byocni-35717205-vmss000001     <none>           <none>
kube-system   cloud-node-manager-288rv              1/1     Running   0               5h38m   192.168.10.4   aks-byocni-35717205-vmss000000     <none>           <none>
kube-system   cloud-node-manager-8hv5x              1/1     Running   0               3h55m   192.168.11.5   aks-egressgw-36500661-vmss000000   <none>           <none>
kube-system   cloud-node-manager-xt222              1/1     Running   0               5h38m   192.168.10.5   aks-byocni-35717205-vmss000001     <none>           <none>
kube-system   coredns-5b97789cf4-b27vp              1/1     Running   0               5h26m   10.0.1.105     aks-byocni-35717205-vmss000001     <none>           <none>
kube-system   coredns-5b97789cf4-s4mbm              1/1     Running   0               5h38m   10.0.0.189     aks-byocni-35717205-vmss000000     <none>           <none>
kube-system   coredns-autoscaler-7c88465478-mccff   1/1     Running   7 (4h36m ago)   5h38m   10.0.0.251     aks-byocni-35717205-vmss000000     <none>           <none>
kube-system   csi-azuredisk-node-75bnm              3/3     Running   0               3h55m   192.168.11.5   aks-egressgw-36500661-vmss000000   <none>           <none>
kube-system   csi-azuredisk-node-85hrg              3/3     Running   0               5h38m   192.168.10.4   aks-byocni-35717205-vmss000000     <none>           <none>
kube-system   csi-azuredisk-node-9h9s2              3/3     Running   0               5h38m   192.168.10.5   aks-byocni-35717205-vmss000001     <none>           <none>
kube-system   csi-azurefile-node-dwqsk              3/3     Running   0               5h38m   192.168.10.5   aks-byocni-35717205-vmss000001     <none>           <none>
kube-system   csi-azurefile-node-lwkrn              3/3     Running   0               3h55m   192.168.11.5   aks-egressgw-36500661-vmss000000   <none>           <none>
kube-system   csi-azurefile-node-tddf5              3/3     Running   0               5h38m   192.168.10.4   aks-byocni-35717205-vmss000000     <none>           <none>
kube-system   konnectivity-agent-844cd49468-dqdkj   1/1     Running   0               4h49m   192.168.10.5   aks-byocni-35717205-vmss000001     <none>           <none>
kube-system   konnectivity-agent-844cd49468-tmw6l   1/1     Running   0               4h49m   192.168.10.4   aks-byocni-35717205-vmss000000     <none>           <none>
kube-system   kube-proxy-85rm2                      1/1     Running   0               3h55m   192.168.11.5   aks-egressgw-36500661-vmss000000   <none>           <none>
kube-system   kube-proxy-qsjxx                      1/1     Running   0               5h38m   192.168.10.4   aks-byocni-35717205-vmss000000     <none>           <none>
kube-system   kube-proxy-tnrh2                      1/1     Running   0               5h38m   192.168.10.5   aks-byocni-35717205-vmss000001     <none>           <none>
kube-system   metrics-server-6bb9c967d6-5cwnh       2/2     Running   0               3h52m   10.0.1.253     aks-byocni-35717205-vmss000001     <none>           <none>
kube-system   metrics-server-6bb9c967d6-8sdft       2/2     Running   0               3h52m   10.0.1.100     aks-byocni-35717205-vmss000001     <none>           <none>

Create an Egress Gateway Policy

The API provided by Isovalent to drive the Egress Gateway feature is the IsovalentEgressGatewayPolicy resource.

apiVersion: isovalent.com/v1
kind: IsovalentEgressGatewayPolicy
metadata:
  name: egress-sample
spec:
  destinationCIDRs:
  - "192.168.11.0/24"
  selectors:
  - podSelector: {}
  egressGroups:
  -
    nodeSelector:
      matchLabels:
        egress-node: "true"

Testing Egress Gateway

  • Deploy a client pod and apply the IsovalentEgressGatewayPolicy, and observe that the pod’s connection gets redirected through the Gateway node.
  • The client pod gets deployed to one of the two nodes (managed), and the IEGP (Isovalent Egress Gateway Policy) selects one or both the nodes ( depending on the egress gateway IPs specified) as the Gateway node.
  • Sample client pod yaml:
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: busybox
  name: busybox
spec:
  containers:
  - image: nginx
    name: nginx
    command:
    - /bin/sh
    - -c
    securityContext:
      capabilities:
        add:
          - NET_ADMIN  # Add the cap_net_admin capability
    env:
    - name: EGRESS_IPS
      value: 192.168.11.5/24, 192.168.11.4/24
    resources: {}
  dnsPolicy: ClusterFirst
  nodeSelector:
    kubernetes.io/hostname: aks-byocni-22760683-vmss000000
  restartPolicy: Always
status: {}
  • Create the client pod and check that it’s up and running and pinned on one of the worker nodes as specified in the yaml file for the client pod.
kubectl apply -f busyboxegressgw.yaml

kubectl get pods -o wide
NAME      READY   STATUS    RESTARTS   AGE     IP           NODE                             NOMINATED NODE   READINESS GATES
busybox   1/1     Running   0          3h53m   10.0.0.241   aks-byocni-22760683-vmss000000   <none>           <none> 
Apply an Egress Gateway Policy
  • Apply an Egress Gateway Policy
apiVersion: isovalent.com/v1
kind: IsovalentEgressGatewayPolicy
metadata:
  name: egress-sample
spec:
  destinationCIDRs:
  - "192.168.11.0/24"
  selectors:
  - podSelector: {}
  egressGroups:
  -
    nodeSelector:
      matchLabels:
        egress-node: "true"
Label the Egress Gateway Node

To let the policy select the node designated as the Egress Gateway, apply the label, egress-node:true to it:

kubectl label nodes aks-egressgw-27814974-vmss000000 egress-node=true
node/aks-egressgw-27814974-vmss000000 labeled

kubectl label nodes aks-egressgw-27814974-vmss000001 egress-node=true
node/aks-egressgw-27814974-vmss000001 labeled

kubectl get nodes -o wide --show-labels=true | grep egress-node
aks-egressgw-27814974-vmss000000   Ready    <none>   15d   v1.29.2   192.168.11.4   20.151.98.78    Ubuntu 22.04.4 LTS   5.15.0-1060-azure   containerd://1.7.15-1   agentpool=egressgw,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_DS2_v2,beta.kubernetes.io/os=linux,egress-node=true,failure-domain.beta.kubernetes.io/region=canadacentral,failure-domain.beta.kubernetes.io/zone=canadacentral-1,io.cilium/egress-gateway=true,kubernetes.azure.com/agentpool=egressgw,kubernetes.azure.com/cluster=MC_byocni_byocni_canadacentral,kubernetes.azure.com/consolidated-additional-properties=f57efe65-070c-11ef-a4d6-aab4eb6cd74a,kubernetes.azure.com/kubelet-identity-client-id=f22bbec0-4040-4237-b958-6deee20881a3,kubernetes.azure.com/mode=user,kubernetes.azure.com/node-image-version=AKSUbuntu-2204gen2containerd-202404.16.0,kubernetes.azure.com/nodepool-type=VirtualMachineScaleSets,kubernetes.azure.com/os-sku=Ubuntu,kubernetes.azure.com/role=agent,kubernetes.azure.com/storageprofile=managed,kubernetes.azure.com/storagetier=Premium_LRS,kubernetes.io/arch=amd64,kubernetes.io/hostname=aks-egressgw-27814974-vmss000000,kubernetes.io/os=linux,node.kubernetes.io/instance-type=Standard_DS2_v2,storageprofile=managed,storagetier=Premium_LRS,topology.disk.csi.azure.com/zone=canadacentral-1,topology.kubernetes.io/region=canadacentral,topology.kubernetes.io/zone=canadacentral-1
aks-egressgw-27814974-vmss000001   Ready    <none>   15d   v1.29.2   192.168.11.5   4.172.207.202   Ubuntu 22.04.4 LTS   5.15.0-1060-azure   containerd://1.7.15-1   agentpool=egressgw,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_DS2_v2,beta.kubernetes.io/os=linux,egress-node=true,failure-domain.beta.kubernetes.io/region=canadacentral,failure-domain.beta.kubernetes.io/zone=canadacentral-2,io.cilium/egress-gateway=true,kubernetes.azure.com/agentpool=egressgw,kubernetes.azure.com/cluster=MC_byocni_byocni_canadacentral,kubernetes.azure.com/consolidated-additional-properties=f57efe65-070c-11ef-a4d6-aab4eb6cd74a,kubernetes.azure.com/kubelet-identity-client-id=f22bbec0-4040-4237-b958-6deee20881a3,kubernetes.azure.com/mode=user,kubernetes.azure.com/node-image-version=AKSUbuntu-2204gen2containerd-202404.16.0,kubernetes.azure.com/nodepool-type=VirtualMachineScaleSets,kubernetes.azure.com/os-sku=Ubuntu,kubernetes.azure.com/role=agent,kubernetes.azure.com/storageprofile=managed,kubernetes.azure.com/storagetier=Premium_LRS,kubernetes.io/arch=amd64,kubernetes.io/hostname=aks-egressgw-27814974-vmss000001,kubernetes.io/os=linux,node.kubernetes.io/instance-type=Standard_DS2_v2,storageprofile=managed,storagetier=Premium_LRS,topology.disk.csi.azure.com/zone=canadacentral-2,topology.kubernetes.io/region=canadacentral,topology.kubernetes.io/zone=canadacentral-2
Create a test VM in the Egress Gateway subnet.
  • Create a VM in the same subnet as Egress Gateway and run a simple service on port 80 (like NGINX) that will respond to traffic sent from a pod on one of the worker nodes.
  • Test VM IP, in this case, is 192.168.11.4
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:22:48:3c:63:51 brd ff:ff:ff:ff:ff:ff
    inet 192.168.11.4/24 brd 192.168.11.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::222:48ff:fe3c:6351/64 scope link
       valid_lft forever preferred_lft forever
Traffic Generation (towards the server in Egress GW subnet)

Send traffic toward the test VM.

kubectl exec busybox -- curl -I 192.168.11.4
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0   612    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
HTTP/1.1 200 OK
Server: nginx/1.18.0 (Ubuntu)
Date: Mon, 29 Apr 2024 12:44:07 GMT
Content-Type: text/html
Content-Length: 612
Last-Modified: Mon, 29 Apr 2024 08:55:12 GMT
Connection: keep-alive
ETag: "662f6070-264"
Accept-Ranges: byte
Traffic Generation (outside of the cluster towards the Internet)
  • Send traffic to a public service.
    • Note the IP it returns is the egress gateway node’s Public IP.
kubectl exec busybox -- curl ifconfig.me
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    13  100    13    0     0    161      0 --:--:-- --:--:-- --:--:--   162

4.172.200.224

kubectl exec busybox -- curl ifconfig.me
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    13  100    13    0     0    155      0 --:--:-- --:--:-- --:--:--   15620.63.116.127
  • Take a tcpdump from one of the egress gateway nodes.
    • Install tcpdump on the egress gateway node via apt-get install tcpdump.
    • As you can see 10.0.0.225 is the client-pod IP that the egress gateway node is receiving packets from and 192.168.11.5 is the egress gateway node’s eth0 IP address.
08:15:22.713376 IP 168.63.129.16.53 > 192.168.11.5.34878: 42594 1/0/1 A 34.117.118.44 (56)
IP 168.63.129.16.53 > 10.0.2.21.34878: 42594 1/0/1 A 34.117.118.44 (56)
IP 10.0.0.225.42814 > 34.117.118.44.80: Flags [S], seq 102756168, win 64860, options [mss 1410,sackOK,TS val 2905874533 ecr 0,nop,wscale 7], length 0
08:15:22.716722 IP 192.168.11.5.42814 > 34.117.118.44.80: Flags [S], seq 102756168, win 64860, options [mss 1410,sackOK,TS val 2905874533 ecr 0,nop,wscale 7], length 0
08:15:22.725284 IP 34.117.118.44.80 > 192.168.11.5.42814: Flags [S.], seq 1883155317, ack 102756169, win 65535, options [mss 1412,sackOK,TS val 3464590701 ecr 2905874533,nop,wscale 8], length 0
IP 34.117.118.44.80 > 10.0.0.225.42814: Flags [S.], seq 1883155317, ack 102756169, win 65535, options [mss 1412,sackOK,TS val 3464590701 ecr 2905874533,nop,wscale 8], length 0
IP 10.0.0.225.42814 > 34.117.118.44.80: Flags [.], ack 1, win 507, options [nop,nop,TS val 2905874545 ecr 3464590701], length 0
08:15:22.727367 IP 192.168.11.5.42814 > 34.117.118.44.80: Flags [.], ack 1, win 507, options [nop,nop,TS val 2905874545 ecr 3464590701], length 0
IP 10.0.0.225.42814 > 34.117.118.44.80: Flags [P.], seq 1:76, ack 1, win 507, options [nop,nop,TS val 2905874545 ecr 3464590701], length 75: HTTP: GET / HTTP/1.1
08:15:22.727389 IP 192.168.11.5.42814 > 34.117.118.44.80: Flags [P.], seq 1:76, ack 1, win 507, options [nop,nop,TS val 2905874545 ecr 3464590701], length 75: HTTP: GET / HTTP/1.1
08:15:22.735433 IP 34.117.118.44.80 > 192.168.11.5.42814: Flags [.], ack 76, win 256, options [nop,nop,TS val 3464590712 ecr 2905874545], length 0
IP 34.117.118.44.80 > 10.0.0.225.42814: Flags [.], ack 76, win 256, options [nop,nop,TS val 3464590712 ecr 2905874545], length 0
08:15:22.765735 IP 34.117.118.44.80 > 192.168.11.5.42814: Flags [P.], seq 1:183, ack 76, win 256, options [nop,nop,TS val 3464590742 ecr 2905874545], length 182: HTTP: HTTP/1.1 200 OK
IP 34.117.118.44.80 > 10.0.0.225.42814: Flags [P.], seq 1:183, ack 76, win 256, options [nop,nop,TS val 3464590742 ecr 2905874545], length 182: HTTP: HTTP/1.1 200 OK
IP 10.0.0.225.42814 > 34.117.118.44.80: Flags [.], ack 183, win 506, options [nop,nop,TS val 2905874585 ecr 3464590742], length 0
IP 10.0.0.225.42814 > 34.117.118.44.80: Flags [F.], seq 76, ack 183, win 506, options [nop,nop,TS val 2905874585 ecr 3464590742], length 0
08:15:22.768788 IP 192.168.11.5.42814 > 34.117.118.44.80: Flags [.], ack 183, win 506, options [nop,nop,TS val 2905874585 ecr 3464590742], length 0
08:15:22.768857 IP 192.168.11.5.42814 > 34.117.118.44.80: Flags [F.], seq 76, ack 183, win 506, options [nop,nop,TS val 2905874585 ecr 3464590742], length 0
08:15:22.777039 IP 34.117.118.44.80 > 192.168.11.5.42814: Flags [F.], seq 183, ack 77, win 256, options [nop,nop,TS val 3464590753 ecr 2905874585], length 0
IP 34.117.118.44.80 > 10.0.0.225.42814: Flags [F.], seq 183, ack 77, win 256, options [nop,nop,TS val 3464590753 ecr 2905874585], length 0
IP 10.0.0.225.42814 > 34.117.118.44.80: Flags [.], ack 184, win 506, options [nop,nop,TS val 2905874596 ecr 3464590753], length 0
08:15:22.778466 IP 192.168.11.5.42814 > 34.117.118.44.80: Flags [.], ack 184, win 506, options [nop,nop,TS val 2905874596 ecr 3464590753], length 0

Availability Zone Affinity

It is possible to control the AZ affinity of the egress gateway traffic with azAffinity. This feature relies on the well-known, topology.kubernetes.io/zone node label to match or prefer gateway nodes within the same AZ of the source pods (“local” gateways) based on the configured mode of operation.

The following modes of operation are available:

  • disabled: This mode uses all the active gateways available, regardless of their AZ. This is the default mode of operation.
    • By taking a tcpdump from both the egress nodes, we can see that the traffic flows across both the egress nodes.
    • sample egress policy for the mode of operation
apiVersion: isovalent.com/v1
kind: IsovalentEgressGatewayPolicy
metadata:
  name: egress-sample
spec:
  destinationCIDRs:
  - "192.168.11.0/24"
  selectors:
  - podSelector: {}
  azAffinity: disabled
  egressGroups:
  -
    nodeSelector:
      matchLabels:
        egress-node: "true"
        topology.kubernetes.io/zone: canadacentral-1
  -
    nodeSelector:
      matchLabels:
        egress-node: "true"
        topology.kubernetes.io/zone: canadacentral-2
  • localOnly: This mode selects only local gateways. If no local gateways are available, traffic will not pass through the non-local gateways and will be dropped.
    • By taking a tcpdump from both the egress nodes, we can see that the traffic flows across one of the local gateways.
    • sample egress policy for the mode of operation
apiVersion: isovalent.com/v1
kind: IsovalentEgressGatewayPolicy
metadata:
  name: egress-sample
spec:
  destinationCIDRs:
  - "192.168.11.0/24"
  selectors:
  - podSelector: {}
  azAffinity: localOnly
  egressGroups:
  -
    nodeSelector:
      matchLabels:
        egress-node: "true"
        topology.kubernetes.io/zone: canadacentral-1
  -
    nodeSelector:
      matchLabels:
        egress-node: "true"
        topology.kubernetes.io/zone: canadacentral-2
  • localOnlyFirst: This mode selects only local gateways as long as at least one is available in a given AZ. When no more local gateways are available, non-local gateways will be selected.
    • By taking a tcpdump from both the egress nodes, we can see that the traffic flows across one of the local gateways.
    • sample egress policy for the mode of operation
apiVersion: isovalent.com/v1
kind: IsovalentEgressGatewayPolicy
metadata:
  name: egress-sample
spec:
  destinationCIDRs:
  - "192.168.11.0/24"
  selectors:
  - podSelector: {}
  azAffinity: localOnlyFirst
  egressGroups:
  -
    nodeSelector:
      matchLabels:
        egress-node: "true"
        topology.kubernetes.io/zone: canadacentral-1
  -
    nodeSelector:
      matchLabels:
        egress-node: "true"
        topology.kubernetes.io/zone: canadacentral-2
  • localPriority: this mode selects all gateways, but local gateways are picked up first. In conjunction with maxGatewayNodes, this can prioritize local gateways over non-local ones, allowing for a graceful fallback to non-local gateways in case the local ones become unavailable.
    • By taking a tcpdump from both the egress nodes, we can see that the traffic flows across one of the local gateways.
    • sample egress policy for the mode of operation
apiVersion: isovalent.com/v1
kind: IsovalentEgressGatewayPolicy
metadata:
  name: egress-sample
spec:
  destinationCIDRs:
  - "192.168.11.0/24"
  selectors:
  - podSelector: {}
  azAffinity: localPriority
  egressGroups:
  -
    nodeSelector:
      matchLabels:
        egress-node: "true"
        topology.kubernetes.io/zone: canadacentral-1
  -
    nodeSelector:
      matchLabels:
        egress-node: "true"
        topology.kubernetes.io/zone: canadacentral-2

How can you scale the Egress Gateway solution?

With Isovalent Enterprise for the Cilium 1.16 release, we introduced the support for Egress Gateway IPAM and combined it with BGP to cater to the Egress GW scaling needs.

We have also used Azure Route Server to demonstrate how Cilium can use the same Egress GW Node IP’s advertisement for multiple BGP Peers and how application VMs peering with the Azure Route Server from different VNETs can access the services.

Pre-Requisites

  • Azure Route Server has a limit of 8 BGP Peers.
  • Application VM running on-premise.
  • The Egress Gateway configuration is already described in this section.
    • Note the node IPs of the Egress Gateways. These will be used to add as peers in Azure Route Server configuration.
kubectl get nodes -o wide --selector egress-node=true
NAME                                STATUS                     ROLES    AGE   VERSION    INTERNAL-IP    EXTERNAL-IP     OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
aks-egressgw-26124580-vmss000000    Ready                      <none>   30d   v1.29.10   192.168.8.36   20.185.176.35   Ubuntu 22.04.5 LTS   5.15.0-1073-azure   containerd://1.7.22-1
aks-egressgw-26124580-vmss000001    Ready                      <none>   30d   v1.29.10   192.168.8.37   172.174.59.5    Ubuntu 22.04.5 LTS   5.15.0-1073-azure   containerd://1.7.22-1
  • VNET for peering with the Azure Route Server for propagating routes towards an Application VM.
  • Subnet for the Azure Route Server. This subnet will be in the same VNET where egress gateway nodes are also running.
  • VNET Peering across the two VNETs (Azure Route Server and Application VM residing in an another VNET).

Egress Gateway IPAM

Egress Gateway IPAM allows users additional control of the IP address distribution to their Egress Gateway nodes.

  • It removes the complexity of manually targeting each Egress Gateway node with a specific configuration for IP address management.
  • The IPAM feature allows you to specify an IP pool in the IsovalentEgressGatewayPolicy from which Cilium leases egress IPs and assigns them to the selected egress interfaces.
  • The IP pool should be specified in the egressCIDRs field of the IsovalentEgressGatewayPolicy and may be composed of one or more CIDRs:

In the policy example below, you can see the new addition of the key egressCIDR.

apiVersion: isovalent.com/v1
kind: IsovalentEgressGatewayPolicy
metadata:
  name: egress-sample
  labels:
    egw: bgp-advertise
spec:
  destinationCIDRs:
  - 0.0.0.0/0
  egressCIDRs:
  - 10.100.255.50/28
  - 10.100.255.100/28
  selectors:
  - podSelector: {}
  egressGroups:
  -
    nodeSelector:
      matchLabels:
        egress-node: "true"
  • Log in to the nodes to observe the secondary IP on the NIC of the Egress Gateway Node(s).
    • 10.100.255.48/32
    • 10.100.255.49/32
root@aks-egressgw-26124580-vmss000000:/# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 7c:1e:52:02:19:f4 brd ff:ff:ff:ff:ff:ff
    inet 192.168.8.36/27 metric 100 brd 192.168.8.63 scope global eth0
       valid_lft forever preferred_lft forever
    inet 10.100.255.48/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::7e1e:52ff:fe02:19f4/64 scope link
       valid_lft forever preferred_lft forever
3: enP37893s1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master eth0 state UP group default qlen 1000
    link/ether 7c:1e:52:02:19:f4 brd ff:ff:ff:ff:ff:ff
    altname enP37893p0s2
4: cilium_net@cilium_host: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 92:08:19:75:47:71 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::9008:19ff:fe75:4771/64 scope link
       valid_lft forever preferred_lft forever
5: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 6e:68:8e:7e:55:15 brd ff:ff:ff:ff:ff:ff
    inet 10.0.3.96/32 scope global cilium_host
       valid_lft forever preferred_lft forever
    inet6 fe80::6c68:8eff:fe7e:5515/64 scope link
       valid_lft forever preferred_lft forever
6: cilium_vxlan: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 66:0d:7b:77:ab:1c brd ff:ff:ff:ff:ff:ff
    inet6 fe80::640d:7bff:fe77:ab1c/64 scope link
       valid_lft forever preferred_lft forever
12: lxc1a50f83a196a@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 46:65:34:d7:92:04 brd ff:ff:ff:ff:ff:ff link-netns cni-da359c6e-791a-f9ff-9f94-d90dfc6b4605
    inet6 fe80::4465:34ff:fed7:9204/64 scope link
       valid_lft forever preferred_lft forever
14: lxc2e43536e3411@if13: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether be:ab:9f:05:0f:fb brd ff:ff:ff:ff:ff:ff link-netns cni-983d360e-6a5a-19b9-f92e-18e9f58ce5c3
    inet6 fe80::bcab:9fff:fe05:ffb/64 scope link
       valid_lft forever preferred_lft forever
16: lxcd069cda05692@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether da:c4:dd:a3:d8:e4 brd ff:ff:ff:ff:ff:ff link-netns cni-de740a77-85e4-b74a-af97-9a8ee4b10897
    inet6 fe80::d8c4:ddff:fea3:d8e4/64 scope link
       valid_lft forever preferred_lft forever
18: lxc_health@if17: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 9e:99:89:91:9f:4d brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::9c99:89ff:fe91:9f4d/64 scope link
       valid_lft forever preferred_lft forever
root@aks-egressgw-26124580-vmss000001:/# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 7c:1e:52:5b:34:dc brd ff:ff:ff:ff:ff:ff
    inet 192.168.8.37/27 metric 100 brd 192.168.8.63 scope global eth0
       valid_lft forever preferred_lft forever
    inet 10.100.255.49/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::7e1e:52ff:fe5b:34dc/64 scope link
       valid_lft forever preferred_lft forever
3: enP37563s1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master eth0 state UP group default qlen 1000
    link/ether 7c:1e:52:5b:34:dc brd ff:ff:ff:ff:ff:ff
    altname enP37563p0s2
4: cilium_net@cilium_host: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 9e:ab:9f:ad:d7:9f brd ff:ff:ff:ff:ff:ff
    inet6 fe80::9cab:9fff:fead:d79f/64 scope link
       valid_lft forever preferred_lft forever
5: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 26:21:94:f3:0f:c2 brd ff:ff:ff:ff:ff:ff
    inet 10.0.4.182/32 scope global cilium_host
       valid_lft forever preferred_lft forever
    inet6 fe80::2421:94ff:fef3:fc2/64 scope link
       valid_lft forever preferred_lft forever
6: cilium_vxlan: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 66:1a:c7:ad:4c:9d brd ff:ff:ff:ff:ff:ff
    inet6 fe80::641a:c7ff:fead:4c9d/64 scope link
       valid_lft forever preferred_lft forever
12: lxc_health@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 52:70:e0:dd:f0:b8 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::5070:e0ff:fedd:f0b8/64 scope link
       valid_lft forever preferred_lft forever

Azure Route Server configuration

  • Sign in to the Azure portal.
  • In the search box at the portal’s top, enter route server, and select Route Server from the search results.
  • On the Route Servers page, select + Create.
  • Take note of the Azure Route Server peer IP addresses.
    • 192.168.8.68
    • 192.168.8.69

Azure Route Server peering with Cilium

  • Once the deployment is complete, select Go to resource to go to the myRouteServer.
  • Under Settings, select Peers.
  • Select + Add to add a peer.
    • 192.168.8.36 is the IP address of one of the Egress GW nodes.
    • 192.168.8.37 is the IP address of one of the Egress GW nodes.
  • On the Add Peer page, enter the following information:
    • Name
    • ASN
    • IPv4 Address

BGP support for Egress Gateway

BGP is one of the most commonly used methods to connect Kubernetes clusters to existing networks and Cilium’s built-in implementation of BGP enables seamless connectivity between Kubernetes services and the existing network. This feature supports BGP in advertising the Egress NAT IPs to external routers.

BGP support can be enabled by:

enterprise:
  bgpControlPlane:
    enabled: true

Sample BGP peering Policy with Azure Route Server:

---
apiVersion: isovalent.com/v1alpha1
kind: IsovalentBGPClusterConfig
metadata:
  name: rack0
spec:
  nodeSelector:
    matchLabels:
      egress-node: "true"
  bgpInstances:
    - name: "cilium-rs0"
      localASN: 65516
      peers:
        - name: "peer-65515-rs0"
          peerASN: 65515
          peerAddress: "192.168.8.68"
          peerConfigRef:
            name: "peer-config"
        - name: "peer-65515-rs1"
          peerASN: 65515
          peerAddress: "192.168.8.69"
          peerConfigRef:
            name: "peer-config"

---
apiVersion: isovalent.com/v1alpha1
kind: IsovalentBGPPeerConfig
metadata:
  name: peer-config
spec:
  families:
    - afi: ipv4
      safi: unicast
      advertisements:
        matchLabels:
          advertise: "bgp"

---
apiVersion: isovalent.com/v1alpha1
kind: IsovalentBGPAdvertisement
metadata:
  name: bgp-advertisements
  labels:
    advertise: bgp
spec:
  advertisements:
    - advertisementType: "EgressGateway"
      selector:             # <-- select egress gateway policy to advertise
        matchExpressions:
          - { key: egw, operator: In, values: [ bgp-advertise ] }
  • Check the status of BGP peers across Cilium and Azure Route Server.
cilium bgp peers
Node                               Local AS   Peer AS   Peer Address   Session State   Uptime      Family         Received   Advertised
aks-egressgw-26124580-vmss000000   65516      65515     192.168.8.68   established     96h51m40s   ipv4/unicast   2          1
                                   65516      65515     192.168.8.69   established     96h58m16s   ipv4/unicast   2          1
aks-egressgw-26124580-vmss000001   65516      65515     192.168.8.68   established     96h50m40s   ipv4/unicast   2          1
                                   65516      65515     192.168.8.69   established     96h58m52s   ipv4/unicast   2          1
  • Check if routes are being exchanged between Cilium and Azure Route Server.
cilium bgp routes
(Defaulting to `available ipv4 unicast` routes, please see help for more options)

Node                               VRouter   Prefix             NextHop   Age         Attrs
aks-egressgw-26124580-vmss000000   65516     10.100.255.48/32   0.0.0.0   725h8m40s   [{Origin: i} {Nexthop: 0.0.0.0}]
aks-egressgw-26124580-vmss000001   65516     10.100.255.49/32   0.0.0.0   725h8m36s   [{Origin: i} {Nexthop: 0.0.0.0}]

cilium bgp routes advertised ipv4 unicast peer 192.168.8.69
Node                               VRouter   Peer           Prefix             NextHop        Age          Attrs
aks-egressgw-26124580-vmss000000   65516     192.168.8.69   10.100.255.48/32   192.168.8.36   725h31m55s   [{Origin: i} {AsPath: 65516} {Nexthop: 192.168.8.36}]
aks-egressgw-26124580-vmss000001   65516     192.168.8.69   10.100.255.49/32   192.168.8.37   725h31m51s   [{Origin: i} {AsPath: 65516} {Nexthop: 192.168.8.37}]

cilium bgp routes advertised ipv4 unicast peer 192.168.8.68
Node                               VRouter   Peer           Prefix             NextHop        Age          Attrs
aks-egressgw-26124580-vmss000000   65516     192.168.8.68   10.100.255.48/32   192.168.8.36   725h31m59s   [{Origin: i} {AsPath: 65516} {Nexthop: 192.168.8.36}]
aks-egressgw-26124580-vmss000001   65516     192.168.8.68   10.100.255.49/32   192.168.8.37   725h31m55s   [{Origin: i} {AsPath: 65516} {Nexthop: 192.168.8.37}]

Common questions for Egress Gateway?

  • How is the traffic encapsulated from the worker node to the egress node?
    • The traffic is encapsulated from a worker node to an egress node regardless of the cluster’s routing mode, and in this case, the AKS cluster with BYOCNI uses VXLAN as the encapsulation.
  • How can you find the identity of the source endpoint if the traffic is encapsulated?
    • VNI of the VXLAN header equals the Identity of a source endpoint. In this case, the VNI maps to 53596
    • You can then track the identity using the Cilium CLI, which indicates that it’s the busybox Pod.
kubectl -n kube-system exec ds/cilium -- cilium identity get 53596

Defaulted container "cilium-agent" out of: cilium-agent, config (init), mount-cgroup (init), apply-sysctl-overwrites (init), mount-bpf-fs (init), wait-for-node-init (init), clean-cilium-state (init), install-cni-binaries (init)
ID      LABELS
53596   k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=default
        k8s:io.cilium.k8s.policy.cluster=default
        k8s:io.cilium.k8s.policy.serviceaccount=default
        k8s:io.kubernetes.pod.namespace=default
        k8s:run=busybox
  • How can you find the identity of the remote endpoint if the traffic is encapsulated?
    • Traffic is encapsulated over VXLAN from the server to the busy box pod behind one of the worker nodes.
    • In this case, VNI=6 is the identity of the server VM called remote-node by Cilium.
    • You can then track the identity using the Cilium CLI, indicating it’s the remote node.
kubectl -n kube-system exec ds/cilium -- cilium identity get 6

Defaulted container "cilium-agent" out of: cilium-agent, config (init), mount-cgroup (init), apply-sysctl-overwrites (init), mount-bpf-fs (init), wait-for-node-init (init), clean-cilium-state (init), install-cni-binaries (init)
ID   LABELS
6    reserved:remote-node
  • How can you check the learned routes in the Azure Route Server?
    • Using Azure CLI, you can check the routes that the Azure Route server has learned.
az network routeserver peering list-learned-routes --name 'cilium-egress-gw-bgp' --resource-group 'by
ocniars' --routeserver 'myRouteServer'
{
  "RouteServiceRole_IN_0": [
    {
      "asPath": "65516",
      "localAddress": "192.168.8.69",
      "network": "10.100.255.48/32",
      "nextHop": "192.168.8.36",
      "origin": "EBgp",
      "sourcePeer": "192.168.8.36",
      "weight": 32768
    }
  ],
  "RouteServiceRole_IN_1": [
    {
      "asPath": "65516",
      "localAddress": "192.168.8.68",
      "network": "10.100.255.48/32",
      "nextHop": "192.168.8.36",
      "origin": "EBgp",
      "sourcePeer": "192.168.8.36",
      "weight": 32768
    }
  ]
}

az network routeserver peering list-learned-routes --name 'cilium-egress-gw-bgp-1' --resource-group '
byocniars' --routeserver 'myRouteServer'
{
  "RouteServiceRole_IN_0": [
    {
      "asPath": "65516",
      "localAddress": "192.168.8.69",
      "network": "10.100.255.49/32",
      "nextHop": "192.168.8.37",
      "origin": "EBgp",
      "sourcePeer": "192.168.8.37",
      "weight": 32768
    }
  ],
  "RouteServiceRole_IN_1": [
    {
      "asPath": "65516",
      "localAddress": "192.168.8.68",
      "network": "10.100.255.49/32",
      "nextHop": "192.168.8.37",
      "origin": "EBgp",
      "sourcePeer": "192.168.8.37",
      "weight": 32768
    }
  ]
}
  • How can you check the routes advertised by the Azure Route Server?
    • You can use Azure CLI to check the routes that the Azure Route server has advertised.
az network routeserver peering list-advertised-routes --name 'cilium-egress-gw-bgp' --resource-group
'byocniars' --routeserver 'myRouteServer'
{
  "RouteServiceRole_IN_0": [
    {
      "asPath": "65515",
      "localAddress": "192.168.8.69",
      "network": "192.168.40.0/22",
      "nextHop": "192.168.8.69",
      "origin": "Igp",
      "weight": 0
    },
    {
      "asPath": "65515",
      "localAddress": "192.168.8.69",
      "network": "192.168.8.0/22",
      "nextHop": "192.168.8.69",
      "origin": "Igp",
      "weight": 0
    }
  ],
  "RouteServiceRole_IN_1": [
    {
      "asPath": "65515",
      "localAddress": "192.168.8.68",
      "network": "192.168.40.0/22",
      "nextHop": "192.168.8.68",
      "origin": "Igp",
      "weight": 0
    },
    {
      "asPath": "65515",
      "localAddress": "192.168.8.68",
      "network": "192.168.8.0/22",
      "nextHop": "192.168.8.68",
      "origin": "Igp",
      "weight": 0
    }
  ]
}

az network routeserver peering list-advertised-routes --name 'cilium-egress-gw-bgp-1' --resource-group 'byocniars' --routeserver 'myRouteServer'
{
  "RouteServiceRole_IN_0": [
    {
      "asPath": "65515",
      "localAddress": "192.168.8.69",
      "network": "192.168.40.0/22",
      "nextHop": "192.168.8.69",
      "origin": "Igp",
      "weight": 0
    },
    {
      "asPath": "65515",
      "localAddress": "192.168.8.69",
      "network": "192.168.8.0/22",
      "nextHop": "192.168.8.69",
      "origin": "Igp",
      "weight": 0
    }
  ],
  "RouteServiceRole_IN_1": [
    {
      "asPath": "65515",
      "localAddress": "192.168.8.68",
      "network": "192.168.40.0/22",
      "nextHop": "192.168.8.68",
      "origin": "Igp",
      "weight": 0
    },
    {
      "asPath": "65515",
      "localAddress": "192.168.8.68",
      "network": "192.168.8.0/22",
      "nextHop": "192.168.8.68",
      "origin": "Igp",
      "weight": 0
    }
  ]
}

Conclusion

Hopefully, this post gave you a good overview of deploying Cilium and Egress Gateway in AKS (Azure Kubernetes Service) using BYOCNI as the network plugin. If you have any feedback on the solution, please share it with us. Talk to us, and let’s see how Isovalent can help with your use case.

Try it out

Start with the Egress Gateway lab and explore Egress Gateway in action.

Further Reading

To dive deeper into the topic of Egress Gateway, check out these two videos:

Amit Gupta
AuthorAmit GuptaSenior Technical Marketing Engineer

Related

Blogs

Cilium in Azure Kubernetes Service (AKS)

In this tutorial, users will learn how to deploy Isovalent Enterprise for Cilium on your AKS cluster from Azure Marketplace on a new cluster and also upgrade an existing cluster from an AKS cluster running Azure CNI powered by Cilium to Isovalent Enterprise for Cilium.

By
Amit Gupta
Blogs

Enabling Enterprise features for Cilium in Azure Kubernetes Service (AKS)

In this tutorial, you will learn how to enable Enterprise features (Layer-3, 4 & 7 policies, DNS-based policies, and observe the Network Flows using Hubble-CLI) in an Azure Kubernetes Service (AKS) cluster running Isovalent Enterprise for Cilium.

By
Amit Gupta

Industry insights you won’t delete. Delivered to your inbox weekly.