Skip to content

[Enhancement]: Add alerts for PodDisruptionBudgets #1028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks done
darraghjones opened this issue Feb 10, 2025 · 11 comments · May be fixed by #1045
Open
4 tasks done

[Enhancement]: Add alerts for PodDisruptionBudgets #1028

darraghjones opened this issue Feb 10, 2025 · 11 comments · May be fixed by #1045
Assignees
Labels
enhancement New feature or request keepalive Use to prevent automatic closing

Comments

@darraghjones
Copy link

darraghjones commented Feb 10, 2025

What's the general idea for the enhancement?

I'm planning to start using PodDisruptionBudgets to enure there is always at least one healthy instance of my deployments. However, I've noticed that this project does not appear to have any PDB related alerts.

Would it make sense to alert on something like:

kube_poddisruptionbudget_status_current_healthy < kube_poddisruptionbudget_status_desired_healthy

Apologies if I've missed or mistaken something

What parts of the codebase does the enhancement target?

Alerts

Anything else relevant to the enhancement that would help with the triage process?

No response

I agree to the following terms:

  • I agree to follow this project's Code of Conduct.
  • I have filled out all the required information above to the best of my ability.
  • I have searched the issues of this repository and believe that this is not a duplicate.
  • I have confirmed this proposal applies to the default branch of the repository, as of the latest commit at the time of submission.
@skl
Copy link
Collaborator

skl commented Feb 11, 2025

Hi @darraghjones, great idea! It makes sense to me to have an alert with behavior specific to PDBs. There are quite a few existing alerts around pod and workload health, so I'd be interested to know if any of the following would already cover your particular use case?

  • KubeContainerWaiting
  • KubeDaemonSetMisScheduled
  • KubeDaemonSetNotScheduled
  • KubeDaemonSetRolloutStuck
  • KubeDeploymentGenerationMismatch
  • KubeDeploymentReplicasMismatch
  • KubeDeploymentRolloutStuck
  • KubeHpaMaxedOut
  • KubeHpaReplicasMismatch
  • KubePodCrashLooping
  • KubePodNotReady
  • KubeStatefulSetGenerationMismatch
  • KubeStatefulSetReplicasMismatch
  • KubeStatefulSetUpdateNotRolledOut

What I'm interested in, is what a PDB alert should (or shouldn't) look for, given the context of existing alerts above. Please see this as an open discussion, all ideas are welcome!

@skl skl added enhancement New feature or request question Further information is requested labels Feb 11, 2025
@darraghjones
Copy link
Author

darraghjones commented Feb 11, 2025

Hi.

We are using these alerts, and during a recent incident a number of them did fire. I believe it was these ones:
KubeContainerWaiting
KubeDeploymentReplicasMismatch
KubePodCrashLooping

The issue was that we only treat these as p3 alerts, and as such did not escalte them out of hours. I do not necessarily think they should be P1 alerts...because is many cases we would have a number of other replicas running, so one being down isn't a big deal.

However, in this recent incident we were only running a single replica. My feeling with having an alert(s) specifically around PDBs is that they could be treated as P1s.

Hopefully this gives some context. Happy to delve deeper if needs be.

@skl
Copy link
Collaborator

skl commented Feb 11, 2025

Thanks for the additional context, which severity do you have in mind for a PDB alert based on the below:

... and do you think that would be the default severity for everyone? I know in some cases we have a default lower priority and then that can be overridden to a higher severity using the config.

@darraghjones
Copy link
Author

From my own perspective, I feel like the severity of such an alert should be Critical, as at least from my own undertsanding, kuberntes should be trying really hard to honor the PodDisruptionBudgets....and if they are violated, then something is likely gone quite badly wrong.

Of course, it's hard to say if that should be the default severity for everyone...but I can't really envisage a scenario where you wouldn't want it to be.

@skl
Copy link
Collaborator

skl commented Feb 13, 2025

@darraghjones what's your opinion on how quickly this alert should fire? For example, pods can sometimes take a while to come up, presumably there are genuine cases where a PDB is violated simply because it's taking some time for a pod to become ready - so, maybe something like 15 minutes is a reasonable delay to ensure that this really is a critical state?

@darraghjones
Copy link
Author

If you have a deployment with a pod diruption budget specifiying min available of 1, for example, k8s should not reschedule any pods in this deployment unless there is a least one healthy pod running. so, the fact that a pod can take some time before becoming ready should not caue the PDB to be violated.

AFAIK, the PDB should only become violated due to involuntary disruptions, e.g. hardware failure, or also during 'intentional' voluntary discuprtion, such as someone manually scaling the deployment to 0.

given this, it makes sense to me for the alert to fire 'immediately'.

@skl
Copy link
Collaborator

skl commented Feb 14, 2025

What about if you then decide to increase the existing PDB min available from 1 to 2 and the 2nd pod takes 5-10 minutes to become healthy?

@darraghjones
Copy link
Author

I don't think increasing the min available will actually cause k8s to create a new pod. you should first increase the number of replicas. the PDB will only prevent voluntary disruptions.

@skl skl added the keepalive Use to prevent automatic closing label Mar 11, 2025
@skl
Copy link
Collaborator

skl commented Mar 24, 2025

Discussed this internally and there's some concern about this being not trivial in the general case.

In particular, whilst this may be straightforward for stable workloads and PDBs set using integers, it may be difficult to write an alert for workloads that use a Horizontal Pod Autoscaler and PDB using percentages.

To counter the above, I had wondered about excluding HPA workloads but this is also apparently difficult.

@skl skl added help wanted Extra attention is needed and removed question Further information is requested labels Mar 24, 2025
@darraghjones
Copy link
Author

My original suggestion was to alert if:

kube_poddisruptionbudget_status_current_healthy < kube_poddisruptionbudget_status_desired_healthy

these metrics refer to the current and desired number of healthy pods respectively. they will be integers after having taken into account the PDB spec and the number of replicas set by a HPA for example.

@skl
Copy link
Collaborator

skl commented Mar 25, 2025

Going to trial this internally first, to ensure it's not too noisy - from a quick glance it may need a configurable for value, similar to #989. I'll create a draft PR with the changes whilst this is being evaluated and link it to this issue.

@skl skl self-assigned this Mar 25, 2025
@skl skl linked a pull request Mar 25, 2025 that will close this issue
@skl skl removed the help wanted Extra attention is needed label Mar 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request keepalive Use to prevent automatic closing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants