'Azure AKS Public IP in Non-standard Resource Group

I've been trying to manage an Azure Kubernetes Service (AKS) instance via Terraform. When I create the AKS instance via the Azure CLI per this MS tutorial, then install an ingress controller with a static public IP, per this MS tutorial, everything works fine. This method implicitly creates a service principal (SP).

When I create an otherwise exact duplicate of the AKS cluster via Terraform, I am forced to supply the service principal explicitly. I gave this new SP "Contributor" access to the cluster's entire resource group yet, when I get to the step to create the ingress controller (using the same command that tutorial 2 provided, above: helm install stable/nginx-ingress --set controller.replicaCount=2 --set controller.service.loadBalancerIP="XX.XX.XX.XX"), the ingress service comes up but it never acquires its public IP. The IP status remains "<pending>" indefinitely, and I can find nothing in any log about why. Are there logs that should tell me why my IP is still pending?

Again, I am fairly certain that, other than the SP, the Terraform AKS cluster is an exact duplicate of the one created based on the MS tutorial. Running terraform plan finds no differences between the two. Does anyone have any idea what permission my AKS SP might need or what else I might be missing here? Strangely, I can't find ANY permissions assigned to the implicitly created principal via the Azure portal, but I can't think of anything else that might be causing this behavior.

Not sure if it's a red herring or not, but other users have complained about a similar problem in the context of issues opened against the second tutorial. Their fix always appears to be "tear down your cluster and retry", but that isn't an acceptable solution in this context. I need a reproducible working cluster and azurerm_kubernetes_cluster doesn't currently allow for building an AKS instance with an implicitly created SP.



Solution 1:[1]

I'm going to answer my own question, for posterity. It turns out the problem was the resource group where I created the static public IP. AKS clusters use two resource groups: the group that you explicitly created the cluster in, and a second group which is implicitly created by the cluster. That second, implicit resource group always gets a name starting with "MC_" (the rest of the name is derivative of the explicit RG, the cluster name, and the region).

Anyhow, the default AKS configuration requires that the public IP be created within that implicit resource group. Assuming that you created the AKS cluster with Terraform, its name will be exported in ${azurerm_kubernetes_cluster.NAME.node_resource_group}.

EDIT 2019-05-23

Since writing this, we found a use case that the workaround of using the MC_* resource group wasn't good enough for. I opened a support ticket with MS and they directed me to this solution. Add the following annotation to your LoadBalancer (or Ingress controller), and make sure that the AKS SP has at least Network Contributor rights in the destination resource group (myResourceGroup in the example below):

metadata:
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-resource-group: myResourceGroup

This solved it completely for us.

Solution 2:[2]

I can't comment just yet so putting this addition as answer.

Derek is right, you can totally use existing IP from a resource group different to where AKS cluster was provisioned. There is the documentation page. Just make sure you've done these two steps below:

  1. Add "Network Contributor" role assignment for your AKS service principal to the resource group where your existing static IP is.

  2. Add service.beta.kubernetes.io/azure-load-balancer-resource-group: myResourceGroup to the ingress controller with the following command:

kubectl annotate service ingress-nginx-controller -n ingress service.beta.kubernetes.io/azure-load-balancer-resource-group=datagate

Solution 3:[3]

Set Static IP Resource Group when Installing Helm Chart

Here is a minimal helm install command for nginx-controller that works when the static IP is in a different resource group than the cluster managed node resource group.

helm upgrade --install ingress-nginx ingress-nginx \
  --repo https://kubernetes.github.io/ingress-nginx \
  --namespace ingress-nginx \
  --set controller.replicaCount=1 \
  --set controller.service.externalTrafficPolicy=Local \
  --set controller.service.loadBalancerIP=$ingress_controller_ip \
  --set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-resource-group"=$STATIC_IP_ROSOURCE_GROUP

The key is the last override to provide the resource group of the static IP.

Also, note that you may need to customize the load balancer health probe if your root path doesn't return a successful http response. We do this by additionally adding the following (replace /healthz with your probe EP):

Additional Note: Health Probe Endpoints

--set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-health-probe-request-path"=/healthz

Versions

Kubernetes 1.22.6
ingress-nginx-4.1.0
ingress-nginx/controller:v1.2.0

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Gleb Teterin
Solution 3