[Enhancement]: Add alerts for PodDisruptionBudgets #1028

darraghjones · 2025-02-10T21:45:36Z

What's the general idea for the enhancement?

I'm planning to start using PodDisruptionBudgets to enure there is always at least one healthy instance of my deployments. However, I've noticed that this project does not appear to have any PDB related alerts.

Would it make sense to alert on something like:

kube_poddisruptionbudget_status_current_healthy < kube_poddisruptionbudget_status_desired_healthy

Apologies if I've missed or mistaken something

What parts of the codebase does the enhancement target?

Alerts

Anything else relevant to the enhancement that would help with the triage process?

No response

I agree to the following terms:

I agree to follow this project's Code of Conduct.
I have filled out all the required information above to the best of my ability.
I have searched the issues of this repository and believe that this is not a duplicate.
I have confirmed this proposal applies to the default branch of the repository, as of the latest commit at the time of submission.

The text was updated successfully, but these errors were encountered:

skl · 2025-02-11T11:56:59Z

Hi @darraghjones, great idea! It makes sense to me to have an alert with behavior specific to PDBs. There are quite a few existing alerts around pod and workload health, so I'd be interested to know if any of the following would already cover your particular use case?

KubeContainerWaiting
KubeDaemonSetMisScheduled
KubeDaemonSetNotScheduled
KubeDaemonSetRolloutStuck
KubeDeploymentGenerationMismatch
KubeDeploymentReplicasMismatch
KubeDeploymentRolloutStuck
KubeHpaMaxedOut
KubeHpaReplicasMismatch
KubePodCrashLooping
KubePodNotReady
KubeStatefulSetGenerationMismatch
KubeStatefulSetReplicasMismatch
KubeStatefulSetUpdateNotRolledOut

What I'm interested in, is what a PDB alert should (or shouldn't) look for, given the context of existing alerts above. Please see this as an open discussion, all ideas are welcome!

darraghjones · 2025-02-11T13:51:50Z

Hi.

We are using these alerts, and during a recent incident a number of them did fire. I believe it was these ones:
KubeContainerWaiting
KubeDeploymentReplicasMismatch
KubePodCrashLooping

The issue was that we only treat these as p3 alerts, and as such did not escalte them out of hours. I do not necessarily think they should be P1 alerts...because is many cases we would have a number of other replicas running, so one being down isn't a big deal.

However, in this recent incident we were only running a single replica. My feeling with having an alert(s) specifically around PDBs is that they could be treated as P1s.

Hopefully this gives some context. Happy to delve deeper if needs be.

skl · 2025-02-11T15:26:40Z

Thanks for the additional context, which severity do you have in mind for a PDB alert based on the below:

https://github.com/kubernetes-monitoring/kubernetes-mixin?tab=readme-ov-file#alert-severities

... and do you think that would be the default severity for everyone? I know in some cases we have a default lower priority and then that can be overridden to a higher severity using the config.

darraghjones · 2025-02-11T20:00:52Z

From my own perspective, I feel like the severity of such an alert should be Critical, as at least from my own undertsanding, kuberntes should be trying really hard to honor the PodDisruptionBudgets....and if they are violated, then something is likely gone quite badly wrong.

Of course, it's hard to say if that should be the default severity for everyone...but I can't really envisage a scenario where you wouldn't want it to be.

skl · 2025-02-13T18:30:56Z

@darraghjones what's your opinion on how quickly this alert should fire? For example, pods can sometimes take a while to come up, presumably there are genuine cases where a PDB is violated simply because it's taking some time for a pod to become ready - so, maybe something like 15 minutes is a reasonable delay to ensure that this really is a critical state?

darraghjones · 2025-02-14T15:00:08Z

If you have a deployment with a pod diruption budget specifiying min available of 1, for example, k8s should not reschedule any pods in this deployment unless there is a least one healthy pod running. so, the fact that a pod can take some time before becoming ready should not caue the PDB to be violated.

AFAIK, the PDB should only become violated due to involuntary disruptions, e.g. hardware failure, or also during 'intentional' voluntary discuprtion, such as someone manually scaling the deployment to 0.

given this, it makes sense to me for the alert to fire 'immediately'.

skl · 2025-02-14T16:10:36Z

What about if you then decide to increase the existing PDB min available from 1 to 2 and the 2nd pod takes 5-10 minutes to become healthy?

darraghjones · 2025-02-15T17:50:55Z

I don't think increasing the min available will actually cause k8s to create a new pod. you should first increase the number of replicas. the PDB will only prevent voluntary disruptions.

skl · 2025-03-24T19:06:07Z

Discussed this internally and there's some concern about this being not trivial in the general case.

In particular, whilst this may be straightforward for stable workloads and PDBs set using integers, it may be difficult to write an alert for workloads that use a Horizontal Pod Autoscaler and PDB using percentages.

To counter the above, I had wondered about excluding HPA workloads but this is also apparently difficult.

darraghjones · 2025-03-24T22:44:41Z

My original suggestion was to alert if:

kube_poddisruptionbudget_status_current_healthy < kube_poddisruptionbudget_status_desired_healthy

these metrics refer to the current and desired number of healthy pods respectively. they will be integers after having taken into account the PDB spec and the number of replicas set by a HPA for example.

skl · 2025-03-25T10:37:59Z

Going to trial this internally first, to ensure it's not too noisy - from a quick glance it may need a configurable for value, similar to #989. I'll create a draft PR with the changes whilst this is being evaluated and link it to this issue.

skl added enhancement New feature or request question Further information is requested labels Feb 11, 2025

skl added the keepalive Use to prevent automatic closing label Mar 11, 2025

skl added help wanted Extra attention is needed and removed question Further information is requested labels Mar 24, 2025

skl self-assigned this Mar 25, 2025

skl linked a pull request Mar 25, 2025 that will close this issue

feat(alerts): New KubePdbNotEnoughHealthyPods alert #1045

Draft

skl removed the help wanted Extra attention is needed label Mar 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement]: Add alerts for PodDisruptionBudgets #1028

[Enhancement]: Add alerts for PodDisruptionBudgets #1028

darraghjones commented Feb 10, 2025 •

edited

Loading

skl commented Feb 11, 2025

darraghjones commented Feb 11, 2025 •

edited

Loading

skl commented Feb 11, 2025

darraghjones commented Feb 11, 2025

skl commented Feb 13, 2025

darraghjones commented Feb 14, 2025

skl commented Feb 14, 2025

darraghjones commented Feb 15, 2025

skl commented Mar 24, 2025 •

edited

Loading

darraghjones commented Mar 24, 2025

skl commented Mar 25, 2025

[Enhancement]: Add alerts for PodDisruptionBudgets #1028

[Enhancement]: Add alerts for PodDisruptionBudgets #1028

Comments

darraghjones commented Feb 10, 2025 • edited Loading

What's the general idea for the enhancement?

What parts of the codebase does the enhancement target?

Anything else relevant to the enhancement that would help with the triage process?

I agree to the following terms:

skl commented Feb 11, 2025

darraghjones commented Feb 11, 2025 • edited Loading

skl commented Feb 11, 2025

darraghjones commented Feb 11, 2025

skl commented Feb 13, 2025

darraghjones commented Feb 14, 2025

skl commented Feb 14, 2025

darraghjones commented Feb 15, 2025

skl commented Mar 24, 2025 • edited Loading

darraghjones commented Mar 24, 2025

skl commented Mar 25, 2025

darraghjones commented Feb 10, 2025 •

edited

Loading

darraghjones commented Feb 11, 2025 •

edited

Loading

skl commented Mar 24, 2025 •

edited

Loading