-
Notifications
You must be signed in to change notification settings - Fork 10.1k
fix(server/etcdserver/api/v3compactor): fix flaky TestPeriodicHourly and others with Retry Mechanism #19748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(server/etcdserver/api/v3compactor): fix flaky TestPeriodicHourly and others with Retry Mechanism #19748
Conversation
…with retry mechanism This commit fixes the flaky TestPeriodicHourly test (issue etcd-io#19499) and similar test cases by introducing an exponential backoff retry mechanism for waiting on actions rather than relying on a fixed timeout. The root cause of the flakiness was timing inconsistencies in CI environments, where actions sometimes take slightly longer than the expected 10ms timeout set when creating a compactable object. The new waitWithRetry function attempts to get the expected actions multiple times with exponentially increasing wait periods (10ms, 20ms, 40ms, etc.) before ultimately failing. The solution was verified by reproducing the issue with an artificial delay and stress-testing the fix with 1000 consecutive runs. The following test functions were updated to use the new retry mechanism: - TestPeriodicHourly - TestPeriodicMinutes - TestPeriodicPause - TestPeriodicSkipRevNotChange Fixes etcd-io#19499 Signed-off-by: amosehiguese <[email protected]>
Hi @amosehiguese. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted filessee 25 files with indirect coverage changes @@ Coverage Diff @@
## main #19748 +/- ##
==========================================
- Coverage 68.80% 68.72% -0.09%
==========================================
Files 421 421
Lines 35857 35857
==========================================
- Hits 24672 24642 -30
- Misses 9752 9780 +28
- Partials 1433 1435 +2 Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
cc @fuweid
} | ||
// Exponential backoff | ||
backoffTime := time.Duration(10*(1<<retry)) * time.Millisecond | ||
t.Logf("Retry %d: waiting %v before next attempt (last error: %v)", retry+1, backoffTime, lastErr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: %s is more human readable
t.Logf("Retry %d: waiting %v before next attempt (last error: %v)", retry+1, backoffTime, lastErr) | |
t.Logf("Retry %d: waiting %s before next attempt (last error: %v)", retry+1, backoffTime, lastErr) |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahrtr, amosehiguese The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
You are welcome @ahrtr and thanks for the correction. I guess I can move on to other issues. 😊 |
This commit fixes the flaky TestPeriodicHourly test (issue #19499) and similar test cases by introducing an exponential backoff retry mechanism for waiting on actions rather than relying on a fixed timeout.
The root cause of the flakiness was timing inconsistencies in CI environments, where actions sometimes take slightly longer than the expected 10ms timeout set when creating a compactable object.
The new waitWithRetry function attempts to get the expected actions multiple times with exponentially increasing wait periods (10ms, 20ms, 40ms, etc.) before ultimately failing.
The solution was verified by reproducing the issue with an artificial delay and stress-testing the fix with 1000 consecutive runs.
The following test functions were updated to use the new retry mechanism:
Fixes #19499
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.