Debugging Kubernetes webhook timeouts with Cilium
During the installation of operators and applying their CRD’s in Kubernetes I occasionally end up on some issue that looks like Error from server (InternalError): error when creating "...": Internal error occurred: failed calling webhook "...": failed to call webhook: Post "https://...webhook...svc...?timeout=11s": context deadline exceeded
. Below a summary of observations and steps I take to solve this, as a reminder.
This guide documents my approach to solving webhook timeouts in Kubernetes clusters using Cilium for networking. Kubernetes environments can differ a lot, but the troubleshooting approach could be valuable.
This also assumes some familiarity with Kubernetes admission webhooks. It focuses specifically on networking issues when webhooks time out, not other potential webhook failures.
First, check the basics
Before diving into network policies, verify:
- Webhook pods are running (
kubectl get pods -n <namespace>
) - Basic connectivity (
kubectl logs <webhook-pod> -n <namespace>
)
Observe and diagnose the issue
I’m unsure if the following is relevant, however, clusters I work on often have Cilium and Kyverno set up to block a bunch of traffic for security reasons.
Then to debug I inspect network traffic. So, when I look in Hubble (via Cilium’s port forward or Hubble UI in the browser), there are no connections being made to the kube apiserver, so no dropped or forwarded connections.
Then I check if the webhook is actually functioning on its URL. So I run a pod in the namespace of the webhook and curl -X POST https://...webhook...svc...?timeout=10s
which should return an error (and not timeout).
I’ve observed that in my specific environment, the webhook service is a ClusterIP
, and in some cases changing service configurations can help expose networking issues for debugging purposes.
Fix the webhook
First off, ensure that the webhook service’s type is LoadBalancer
and set its loadBalancerIP
to match its clusterIP
. This will probably not fix the issue, but it will show the dropped connections in Hubble.
Now that there are dropped connections visible in Hubble, a (Cilium) Network Policy will help solve the final step. Below an example:
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: ingress--kube-apiserver-operator-webhook
namespace: <operator-namespace>
spec:
endpointSelector:
matchLabels:
app.kubernetes.io/component: webhook
app.kubernetes.io/instance: <operator-name>
ingress:
- fromEntities:
- kube-apiserver
toPorts:
- ports:
- port: "<svc-port>"
protocol: TCP
Replace the <vars>
and apply this to the cluster. This should now allow the webhook to run as expected.