-
Notifications
You must be signed in to change notification settings - Fork 607
[Enhancement]: Add alerts for PodDisruptionBudgets #1028
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @darraghjones, great idea! It makes sense to me to have an alert with behavior specific to PDBs. There are quite a few existing alerts around pod and workload health, so I'd be interested to know if any of the following would already cover your particular use case?
What I'm interested in, is what a PDB alert should (or shouldn't) look for, given the context of existing alerts above. Please see this as an open discussion, all ideas are welcome! |
Hi. We are using these alerts, and during a recent incident a number of them did fire. I believe it was these ones: The issue was that we only treat these as p3 alerts, and as such did not escalte them out of hours. I do not necessarily think they should be P1 alerts...because is many cases we would have a number of other replicas running, so one being down isn't a big deal. However, in this recent incident we were only running a single replica. My feeling with having an alert(s) specifically around PDBs is that they could be treated as P1s. Hopefully this gives some context. Happy to delve deeper if needs be. |
Thanks for the additional context, which severity do you have in mind for a PDB alert based on the below: ... and do you think that would be the default severity for everyone? I know in some cases we have a default lower priority and then that can be overridden to a higher severity using the config. |
From my own perspective, I feel like the severity of such an alert should be Critical, as at least from my own undertsanding, kuberntes should be trying really hard to honor the PodDisruptionBudgets....and if they are violated, then something is likely gone quite badly wrong. Of course, it's hard to say if that should be the default severity for everyone...but I can't really envisage a scenario where you wouldn't want it to be. |
@darraghjones what's your opinion on how quickly this alert should fire? For example, pods can sometimes take a while to come up, presumably there are genuine cases where a PDB is violated simply because it's taking some time for a pod to become ready - so, maybe something like 15 minutes is a reasonable delay to ensure that this really is a critical state? |
If you have a deployment with a pod diruption budget specifiying min available of 1, for example, k8s should not reschedule any pods in this deployment unless there is a least one healthy pod running. so, the fact that a pod can take some time before becoming ready should not caue the PDB to be violated. AFAIK, the PDB should only become violated due to involuntary disruptions, e.g. hardware failure, or also during 'intentional' voluntary discuprtion, such as someone manually scaling the deployment to 0. given this, it makes sense to me for the alert to fire 'immediately'. |
What about if you then decide to increase the existing PDB min available from 1 to 2 and the 2nd pod takes 5-10 minutes to become healthy? |
I don't think increasing the min available will actually cause k8s to create a new pod. you should first increase the number of replicas. the PDB will only prevent voluntary disruptions. |
Discussed this internally and there's some concern about this being not trivial in the general case. In particular, whilst this may be straightforward for stable workloads and PDBs set using integers, it may be difficult to write an alert for workloads that use a Horizontal Pod Autoscaler and PDB using percentages. To counter the above, I had wondered about excluding HPA workloads but this is also apparently difficult. |
My original suggestion was to alert if:
these metrics refer to the current and desired number of healthy pods respectively. they will be integers after having taken into account the PDB spec and the number of replicas set by a HPA for example. |
Going to trial this internally first, to ensure it's not too noisy - from a quick glance it may need a configurable |
What's the general idea for the enhancement?
I'm planning to start using PodDisruptionBudgets to enure there is always at least one healthy instance of my deployments. However, I've noticed that this project does not appear to have any PDB related alerts.
Would it make sense to alert on something like:
Apologies if I've missed or mistaken something
What parts of the codebase does the enhancement target?
Alerts
Anything else relevant to the enhancement that would help with the triage process?
No response
I agree to the following terms:
The text was updated successfully, but these errors were encountered: