-
Notifications
You must be signed in to change notification settings - Fork 10.1k
Add e2e test to reproduce issue #19406 #19608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @miancheng7. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
8f102c6
to
320b1b0
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files
... and 31 files with indirect coverage changes @@ Coverage Diff @@
## main #19608 +/- ##
=======================================
Coverage 68.76% 68.77%
=======================================
Files 421 421
Lines 35897 35857 -40
=======================================
- Hits 24686 24661 -25
+ Misses 9783 9765 -18
- Partials 1428 1431 +3 Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
/ok-to-test |
t.Cleanup(func() { require.NoError(t, clus.Stop()) }) | ||
|
||
// produce some data | ||
cli := newClient(t, clus.EndpointsGRPC(), e2e.ClientConfig{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can reuse the v3 client in generateTrafficAndVerifyPutLatency
?
320b1b0
to
16dc8de
Compare
8c4d3fc
to
8563cb0
Compare
8563cb0
to
9507b5d
Compare
cc @ahrtr @serathius @fuweid for a stamp of approval, thanks! |
Hi @ahrtr @serathius @fuweid , can you help review? Thanks in advance |
e203299
to
53b185f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks!
@@ -63,6 +64,7 @@ func (s *store) scheduleCompaction(compactMainRev, prevCompactRev int64) (KeyVal | |||
// gofail: var compactBeforeSetFinishedCompact struct{} | |||
UnsafeSetFinishedCompact(tx, compactMainRev) | |||
tx.Unlock() | |||
dbCompactionPauseMs.Observe(float64(time.Since(start) / time.Millisecond)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This metric was introduced in 2015 - b5838edb93.
And in this patch - #11034, the write buffer is explicitly forced into bbolt, except for the last round, which skips the flush and allows the backend to commit the buffer in the background. However, LockOutsideApply() still holds the lock. Therefore, I believe we should record the pause duration for the last round, even if it doesn't flush any data.
tests/e2e/reproduce_19406_test.go
Outdated
// make an http request to fetch all Prometheus metrics | ||
url := httpEndpoint + "/metrics" | ||
resp, err := http.Get(url) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requires.NoErrorf(t, err)
tests/e2e/reproduce_19406_test.go
Outdated
b, err := io.ReadAll(resp.Body) | ||
resp.Body.Close() | ||
if err != nil { | ||
t.Fatalf("fetch error: reading %s: %v", url, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requires.NoErrorf(t, err, "failed to read %s", url)
53b185f
to
7dc1bd2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I do not see much value of this e2e test, also a little hard to understand. Essentially, it's just testing golang's time.After. But I am not strongly against it.
We should spend more effort on the real performance and scalability test.
@@ -63,6 +64,7 @@ func (s *store) scheduleCompaction(compactMainRev, prevCompactRev int64) (KeyVal | |||
// gofail: var compactBeforeSetFinishedCompact struct{} | |||
UnsafeSetFinishedCompact(tx, compactMainRev) | |||
tx.Unlock() | |||
dbCompactionPauseMs.Observe(float64(time.Since(start) / time.Millisecond)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a separate minor bug fix. The dbCompactionPauseMs
won't have any data if each time the number of keys to be compacted is less than batchNum
.
Can we fix this separately ?
tests/e2e/reproduce_19406_test.go
Outdated
expectSleepInterval, actualSleepInterval) | ||
} | ||
|
||
func GetEtcdCompactionMetrics(t *testing.T, httpEndpoint string) (pauseDuration, totalDuration float64, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only used in this test, why export it?
func GetEtcdCompactionMetrics(t *testing.T, httpEndpoint string) (pauseDuration, totalDuration float64, err error) { | |
func getEtcdCompactionMetrics(t *testing.T, httpEndpoint string) (pauseDuration, totalDuration float64, err error) { |
tests/e2e/reproduce_19406_test.go
Outdated
// make an http request to fetch all Prometheus metrics | ||
url := httpEndpoint + "/metrics" | ||
resp, err := http.Get(url) | ||
require.NoErrorf(t, err, "failed to open url %s", url) | ||
b, err := io.ReadAll(resp.Body) | ||
resp.Body.Close() | ||
require.NoErrorf(t, err, "failed to read %s", url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider to move getMetrics
to e2e/framework, and reuse it.
etcd/tests/e2e/metrics_test.go
Lines 344 to 359 in ddeaba7
func getMetrics(metricsURL string) (map[string]*dto.MetricFamily, error) { | |
httpClient := http.Client{Transport: &http.Transport{}} | |
resp, err := httpClient.Get(metricsURL) | |
if err != nil { | |
return nil, err | |
} | |
defer resp.Body.Close() | |
data, err := io.ReadAll(resp.Body) | |
if err != nil { | |
return nil, err | |
} | |
var parser expfmt.TextParser | |
return parser.TextToMetricFamilies(bytes.NewReader(data)) | |
} |
+1 to that, we should have latency SLO that is up-kept during compact instead of testing time.After. |
I think it's not against to introduce performance test. It's not testing time.After. It's to verify that we should pause and handle compaction batch by batch. Even if it was simple refactor on pausing, it's easy to make mistakes. Besides, this project has multiple test case to verify schedule compaction, like hourly periodic job. I don't think it's to test time standard package. Performance regression result could be deceptive, it maybe be caused by external factors, like slow disk or noise neighbor. Like this issue, we still need to put a lot of effort to narrow down root cause. I think this test is to prevent regression from refactor. Maybe current solution is not ideal. But metric is available tool we can use. Anyway, I don't think it's to test time.After. |
Signed-off-by: Miancheng Lin <[email protected]>
7dc1bd2
to
03839c9
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahrtr, chaochn47, fuweid, miancheng7 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
PR Description
This is a follow-up to issue #19406.
The fix in #19405 has been merged, and this PR introduces an e2e test to reproduce the issue. The goal is to ensure that the test will catch the problem if it resurfaces in the future.
Testing
I have tested this locally and confirmed that the test:
✅ Passes consistently with the fix applied.
❌ Fails reliably if the fix in #19405 is rolled back.
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.