Webhooks - What’s the worst that could happen?

A presentation at Kubernetes Community Days UK in October 2023 in London, UK by Marcus Noble

Slide 1

Slide 1

Webhooks What’s the worst that could happen?

Kubernetes Community Days UK

October 17th 2023

Slide 2

Slide 2

Hi 👋, I’m Marcus Noble, a platform engineer at Giant Swarm

Find me on Mastodon at: @Marcus@k8s.social All my profiles / contact details at: MarcusNoble.com

6+ years experience running Kubernetes in production environments.

Slide 3

Slide 3

Webhooks in Kubernetes are ✨ POWERFUL ✨ But with that power comes😱 RISK 😱

Slide 4

Slide 4

Webhooks in Kubernetes

Kubernetes has four main types of webhooks:

  • ValidatingWebhookConfiguration
    • Introduced in v1.9 (replacing GenericAdmissionWebhook introduced in v1.7)
  • MutatingWebhookConfiguration
    • Introduced in v1.9
  • CustomResourceConversion
    • Introduced in v1.13
  • ValidatingAdmissionPolicy (not exactly a webhook but similar functionality)
    • Introduced in v1.26 and currently Beta as of v1.28
    • Note: This admission plugin is enabled by default, but is only active if you enable the ValidatingAdmissionPolicy feature gate and the admissionregistration.k8s.io/v1alpha1 API.

Slide 5

Slide 5

Purpose / Use Cases

Defaulting

  • Injecting imagePullSecrets dynamically when pods are created
  • Injecting sidecars into pods
  • Setting default resource limits
  • Injecting proxy environment variables into pods

Policy Enforcement

  • Prevent using latest image tag
  • Require all pods to have resource limits set
  • Block the use of deprecated Kubernetes APIs (e.g. batch/v1beta1)
  • Block use of hostPath

Best Practices

  • Require a PodDisruptionBudget to be set
  • Enforce a standard set of labels / annotations on all resources
  • Restrict allowed namespaces
  • Replace all image registries with an in-hour container image proxy / cache

Problem Mitigation

  • Block nodes joining the cluster with known CVEs based on the kernel version (e.g. CVE-2022-0185)
  • Prevent custom nginx snippets from being used
  • Inject Log4Shell mitigation env var into all pods (CVE-2021-44228)
  • Block binding to the cluster-admin role

Slide 6

Slide 6

Example API request

Slide 7

Slide 7

Example API request

Slide 8

Slide 8

Example API request

Slide 9

Slide 9

Sounds great, right!? So where’s the risk?

Slide 10

Slide 10

Admission webhooks can be burdensome to develop and operate. […] Each webhook must be deployed, monitored and have a well defined upgrade and rollback plan. To make matters worse, if a webhook times out or becomes unavailable, the Kubernetes control plane can become unavailable.

https://kubernetes.io/blog/2022/12/20/validating-admission-policies-alpha/

Slide 11

Slide 11

The Kubernetes control plane can become unavailable!

https://kubernetes.io/blog/2022/12/20/validating-admission-policies-alpha/

Slide 12

Slide 12

Let’s look at some numbers

I took at look at 129 clusters* and found that ALL of them had at least 1 validating and mutating webhooks in place and overall had an average of 9 validating and 7 mutating webhooks. The most validating webhooks in a single cluster was 25 and the most mutating webhooks was 15!

  • a mix of production, dev and test with a wide variety of node count

Slide 13

Slide 13

How bad can things get?

Slide 14

Slide 14

Let’s play a quick game I’ll show you scenarios of a misconfigured or malicious webhook and through a show of hands I want you all to let me know if you think it’ll cause problems in the cluster. 🙋

Slide 15

Slide 15

Different content type In this scenario we’re going to return a valid Content-Type header but one that doesn’t match the content being returned. 🙋 I think this will break things 💁 The cluster will handle that just fine

Slide 16

Slide 16

Different content type - ✅ The api-server rejects any webhook responses that aren’t JSON (or YAML) regardless of their Content-Type header. The header value is actually ignored. 🙋 I think this will break things 💁 The cluster will handle that just fine

Slide 17

Slide 17

Cut off response In this scenario we’re going to set the Content-Length response header to be longer than the actual response body. 🙋 I think this will break things 💁 The cluster will handle that just fine

Slide 18

Slide 18

Cut off response - ✅ The api-server returns an error of unexpected EOF. 🙋 I think this will break things 💁 The cluster will handle that just fine

Slide 19

Slide 19

Redirect This scenario responds to all webhook requests with a redirect to a service that infinitely redirects the client. 🙋 I think this will break things 💁 The cluster will handle that just fine

Slide 20

Slide 20

Redirect - ✅ The api-server stops following the redirects after 10 redirects. 🙋 I think this will break things 💁 The cluster will handle that just fine

Slide 21

Slide 21

Reinvocation This scenario configures two mutating webhooks with a reinvocationPolicy set to IfNeeded. Both webhooks will mutate the object, causing the other webhook to be triggered again. 🙋 I think this will break things 💁 The cluster will handle that just fine

Slide 22

Slide 22

Reinvocation - ✅ Each webhook is triggered 2 times and then no more. The api-server keeps track of how many times it has called a specific webhook and avoid calling it endlessly. From the Kubernetes documentation: > if additional invocations result in further modifications to the object, webhooks are not guaranteed to be invoked again. 🙋 I think this will break things 💁 The cluster will handle that just fine

Slide 23

Slide 23

“Fork bomb” In this scenario our webhook handler generates a new Event resource to record that the webhook was triggered. This in turn triggers our webhook against the Event resource which generates another Event and so on. 🙋 I think this will break things 💁 The cluster will handle that just fine

Slide 24

Slide 24

“Fork bomb” - 🤷 The cluster will be find initially, providing the webhook handler doesn’t wait for successful creation of the Event. But, a couple of things might happen here: ● The api-server may DoS itself with too many requests if the cluster is active enough. ● In the background the Events in the cluster are building up. This number will keep increasing, leading to usage of etcd storage and resources. Depending on cluster configuration (see the —event-ttl api-server flag) this could potentially take down etcd eventually and cause a cluster outage. 🙋 I think this will break things 💁 The cluster will handle that just fine

Slide 25

Slide 25

Data overload In this scenario we’re going to return as much data as possible in the response to the api-server. To achieve this we’re piping random data from the crypto/rand package to the response writer. 🙋 I think this will break things 💁 The cluster will handle that just fine

Slide 26

Slide 26

Data overload - 🔥 The api-server completely locks up and stops responding to any further api calls. ● Restarting of api-server temporary fixes the issue until the webhook is next triggered. ● Reducing the timeoutSeconds seems to help the api-server handle the webhook calls but the default 10s causes it to lockup. 🙋 I think this will break things 💁 The cluster will handle that just fine

Slide 27

Slide 27

The Actual Risk

Slide 28

Slide 28

The Actual Risks of Webhooks

While webhooks themselves are rarely the cause of cluster problems, they have a tendency to exacerbate them. As webhooks are often on the critical path, e.g. pod creation, if they fail they can cause new problems to surface or existing problems to become harder to handle!

Slide 29

Slide 29

The Risks of Webhook

tl;dr; - If a webhook isn’t built and configured with resilience in mind it’s possible that a failing webhook can block the creation of critical pods, such as the api-server, and eventually take down the whole cluster if not caught quickly.

Slide 30

Slide 30

What’s the fix?

Slide 31

Slide 31

What’s the fix?

● Set failurePolicy: Ignore whenever possible ● Make use of namespaceSelector and objectSelector to limit the impacted resources ● Auto-remove the webhooks when the pod shuts down (Kyverno does a good job of this) ● Do you even need the webhook? Can the mutation be done async as an operator? ● During an incident it might make sense to remove webhooks until things are resolved so they don’t make things work. ● Make use of the new ValidatingAdmissionPolicy where possible!

Slide 32

Slide 32

ValidatingAdmissionPolicy

KEP-3488 - CEL for Admission Control ● Implement expression language support (CEL) into current validation mechanism, avoiding some cases where webhooks would be needed. ● Performed by the API server so doesn’t require pods/services to be alive for it to work. ● Follows on from KEP-2876: CRD Validation Expression Language which introduced similar for CRDs in v1.23. ● Only for validating resources, not mutating.

Introduced: 2022-09-01 | Status: Alpha in v1.26, Beta in v1.28

Slide 33

Slide 33

ValidatingAdmissionPolicy

ValidatingAdmissionPolicy defines a set of filters and expressions to perform against resources. matchConstraints defines filters for what resources this policy is compatible with (object and namespace selectors also possible here). validations lists our expressions to perform against the resources. Here we’re matching Deployments that have less than 3 replicas defined.

Slide 34

Slide 34

ValidatingAdmissionPolicy

ValidatingAdmissionPolicyBinding is needed to make use of our policies. apiVersion: admissionregistration.k8s.io/v1beta1 policyName is the name of the policy from the kind: ValidatingAdmissionPolicyBinding previous slide. metadata: validationActions is what we should do with matches (Deny, Warn or Audit). matchResources allows us to indicate what resources to run the policy against. In this case, we’re going to ensure all Deployments with an environment label of production has at least 3 replicas.

name: “ha-replicas-binding.marcusnoble.com” spec: policyName: “ha-replicas.marcusnoble.com” validationActions: [Deny] matchResources: namespaceSelector: matchLabels: environment: production

Slide 35

Slide 35

ValidatingAdmissionPolicy

When creating a deployment in a production namespace with only a single replica defined:

The deployments “nginx-deployment” is invalid: ValidatingAdmissionPolicy ‘ha-replicas.marcusnoble.com’ with binding ‘ha-replicas-binding.marcusnoble.com’ denied request: failed expression: object.spec.replicas >= 3

Slide 36

Slide 36

ValidatingAdmissionPolicy

More features: ● Parameters ● Warning message expressions ● Variables ● Match Conditions ● Auditing

https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/

Slide 37

Slide 37

Summary ● Webhooks are rarely the cause of incidents but to have a tendency to make them worse ● Beware the critical path of API requests ● Ensure your webhooks are built to be resilient and highly-available ● Where possible, move to using ValidatingAdmissionPolicy instead