Cluster Mesh Service Affinity

[12:35] In this video, Senior Technical Marketing Engineer Nico Vibert walks through a new feature with Cilium 1.12 - the ability to specify service affinity for meshed cluster load balancing.

Transcript

Welcome to this Isovalent and Cilium tech flash episode on cluster mesh service affinity. I’m Nico, and thanks for watching this video. This is actually a follow-up to a cluster mesh video that my colleague Ray did, which was a great introduction to cluster mesh, why you might want to use it to connect your Kubernetes clusters together, and he came in with a great demo as well.

Now in Cilium 1.12, we are introducing a new feature called service affinity, which is really about the ability to direct traffic, either locally or remotely, and to provide some preference on how you load balance traffic to your clusters or between clusters.

Now, let’s have a look at what actually happens by default. Up until this new feature came up, if you use cluster mesh to connect your Kubernetes clusters together and use the global load balancing feature, what happens is when traffic enters your, essentially your, back-end service, traffic will be load balanced between all the back-ends, whether they are local or remote. It will be essentially distributed evenly across all back-ends, assuming they’re healthy.

So, that worked great, but you might want to have more granularity in how you want traffic to be distributed across your back-ends, and that’s what we’re introducing with service affinity with a couple of different options. One is called local and one is called remote. The previous default option is set to ‘none’, the default one.

So, with the local service affinity, what we are doing is we are telling traffic to prefer the local back-ends, assuming the back-ends are healthy. Traffic will enter the cluster IP and will be load balanced between the local healthy back-ends and same here on both those clusters. Now, imagine the back-ends here were to become unhealthy for whatever reason. The traffic will be fell over automatically to your remote site, which is, you know, a perfect example of distributed or global load balancing.

So, that’s brilliant. That just gives you, it’s just one single line of code. What you’ll be using is an annotation to specify that a service is ‘local’, ‘remote’, or ‘none’. Or, you can stick to the standard behavior where traffic was distributed evenly across all healthy connected clusters.

We also have another option which is probably going to be less used, but it can be handy, which is the ‘remote’ option where you might want to tell the traffic to go to the remote site, and that could be maybe for canary deployments where you

want to do some upgrade on your local back-ends and you want the traffic to go away from your local back-ends towards the remote side. And again, if the remote sites were to become unavailable, you will actually prefer to use the local traffic.

As I mentioned, it’s just one line of code to add an annotation in your service specification that says ‘okay, the service affinity is set to local’, and likewise for ‘remote’. Now, let’s have a look at the demo to see how it all works.

Now, let’s have a look at the demo. So, what we’re going to do here is we’re going to generate some traffic from a local pod to a service that is globally available across multiple clusters connected in a cluster mesh, and we’re going to trial with different types of service affinity to see what happens when traffic is distributed or when a pod fails.

So, let’s have a quick look at what we’ve got here. So, we’re actually using ‘kind’ to run our clusters. If we look, we’ve got Cilium working, enabling a cluster mesh. We’re actually in the ‘kind services service affinity one’ context, and if you look at the cluster mesh status, we can see that the cluster connection to ‘kind service affinity 2’ is configured and connected. So, we’ve got two clusters connected with ‘cluster mesh’ and it’s working fine.

We’ve deployed a couple of daemon sets in the namespace ‘demo’, everything is ready and working. So, ‘eco server daemon set’ is for is essentially our very basic HTTP server, and we’ll be running some curl requests from the ‘netshoot daemon set’. The pods are there and healthy.

Now, we’re going to have a quick look at the services. So, we’ve already deployed three services: ‘local’, ‘none’, and ‘remote’. ‘None’ again is that the default mode, the one we had in previous times before 1.12 was released, and that will load balance the traffic across all the healthy endpoints. ‘Local’ we prefer the local cluster, and ‘remote’ we prefer the remote cluster.

And the way to set this up was by using an annotation in the service specification. You can see we’re specifying the service affinity of ‘local’, ‘remote’, or ‘none’.

Okay, let’s start first by generating traffic from our pod to the cluster IP service. So, we’re generating 100 requests. We are essentially connected to the client, and we’re running ‘curl’ requests, and then we’re just filtering and displaying the requests, so you can essentially count how many of the requests are staying local or going to the remote cluster.

So, with the default behavior, when we send a hundred requests, what we expect to see is a pretty even distribution across the two clusters, 50/50. It’s sometimes it’s you will see 48/52 or 47/53, but it’s roughly a round robin distribution across the two clusters. Okay.

Now, what we can do next is to start sending traffic to our remote. So, here again, we are sending requests from a pod which is in cluster one to a globally available service which is available both in cluster one and in cluster two. The parts are healthy, but because we’ve set the affinity to ‘remote’, the traffic will automatically be sent to the remote site.

Now, let’s run some local connectivity tests from a pod in cluster one to a service set up to a ‘local’ affinity annotation, to ensure that the traffic stays locally. So, what we expect here to see is also traffic to prefer the local pod. Okay.

Now, to demonstrate that the traffic would access the remote cluster, let’s delete the pod. So, let’s go into this in a different window. As it’s part of the daemon set, it’s going to be automatically recreated. But, while it’s being recreated, we’re going to rerun the command we run before, and then what we expect to see is traffic to be redirected to the remote site. And you can see all the responses from the cluster have been from the other pod. We are still terminating the echo server.

Now, that’s it for the demo. Hopefully, what you can see is how we can use this service affinity to have some more granularity in how we direct traffic to our, you know, between our clusters and get the local, the remote, and the standard known options. I hope this was helpful. If you want to learn more, again, you can head out to ‘docs.cilium.io’, and yeah, if you have any other feedback, feel free to jump on the ‘slack’ channel the ‘Cilium’ and ‘EBPF’ slack channel, and get in touch. Thank you for watching. Bye.”