Skip to content

fix(server/etcdserver/api/v3compactor): fix flaky TestPeriodicHourly and others with Retry Mechanism #19748

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

amosehiguese
Copy link
Contributor

This commit fixes the flaky TestPeriodicHourly test (issue #19499) and similar test cases by introducing an exponential backoff retry mechanism for waiting on actions rather than relying on a fixed timeout.

The root cause of the flakiness was timing inconsistencies in CI environments, where actions sometimes take slightly longer than the expected 10ms timeout set when creating a compactable object.

The new waitWithRetry function attempts to get the expected actions multiple times with exponentially increasing wait periods (10ms, 20ms, 40ms, etc.) before ultimately failing.

The solution was verified by reproducing the issue with an artificial delay and stress-testing the fix with 1000 consecutive runs.

The following test functions were updated to use the new retry mechanism:

  • TestPeriodicHourly
  • TestPeriodicMinutes
  • TestPeriodicPause
  • TestPeriodicSkipRevNotChange

Fixes #19499

test

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

…with retry mechanism

This commit fixes the flaky TestPeriodicHourly test (issue etcd-io#19499) and similar
test cases by introducing an exponential backoff retry mechanism for waiting
on actions rather than relying on a fixed timeout.

The root cause of the flakiness was timing inconsistencies in CI environments,
where actions sometimes take slightly longer than the expected 10ms timeout
set when creating a compactable object.

The new waitWithRetry function attempts to get the expected actions multiple
times with exponentially increasing wait periods (10ms, 20ms, 40ms, etc.)
before ultimately failing.

The solution was verified by reproducing the issue with an artificial
delay and stress-testing the fix with 1000 consecutive runs.

The following test functions were updated to use the new retry mechanism:
- TestPeriodicHourly
- TestPeriodicMinutes
- TestPeriodicPause
- TestPeriodicSkipRevNotChange

Fixes etcd-io#19499

Signed-off-by: amosehiguese <[email protected]>
@k8s-ci-robot
Copy link

Hi @amosehiguese. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@siyuanfoundation
Copy link
Contributor

/ok-to-test

Copy link

codecov bot commented Apr 15, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.72%. Comparing base (d3cebae) to head (650232c).
Report is 60 commits behind head on main.

Additional details and impacted files

see 25 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #19748      +/-   ##
==========================================
- Coverage   68.80%   68.72%   -0.09%     
==========================================
  Files         421      421              
  Lines       35857    35857              
==========================================
- Hits        24672    24642      -30     
- Misses       9752     9780      +28     
- Partials     1433     1435       +2     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d3cebae...650232c. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

cc @fuweid

}
// Exponential backoff
backoffTime := time.Duration(10*(1<<retry)) * time.Millisecond
t.Logf("Retry %d: waiting %v before next attempt (last error: %v)", retry+1, backoffTime, lastErr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: %s is more human readable

Suggested change
t.Logf("Retry %d: waiting %v before next attempt (last error: %v)", retry+1, backoffTime, lastErr)
t.Logf("Retry %d: waiting %s before next attempt (last error: %v)", retry+1, backoffTime, lastErr)

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, amosehiguese

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@amosehiguese
Copy link
Contributor Author

You are welcome @ahrtr and thanks for the correction. I guess I can move on to other issues. 😊

@ahrtr ahrtr requested review from ivanvc and fuweid April 17, 2025 12:23
@ahrtr ahrtr merged commit 51e38f4 into etcd-io:main Apr 29, 2025
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Flaky go.etcd.io/etcd/server/v3/etcdserver/api/v3compactor: TestPeriodicHourly
4 participants