Skip to content

Add e2e test to reproduce issue #19406 #19608

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 26, 2025

Conversation

miancheng7
Copy link
Contributor

@miancheng7 miancheng7 commented Mar 15, 2025

PR Description

This is a follow-up to issue #19406.

The fix in #19405 has been merged, and this PR introduces an e2e test to reproduce the issue. The goal is to ensure that the test will catch the problem if it resurfaces in the future.

Testing

I have tested this locally and confirmed that the test:
Passes consistently with the fix applied.
Fails reliably if the fix in #19405 is rolled back.

% go test -v -run TestReproduce19406
=== RUN   TestReproduce19406
    testing.go:25: Changing working directory to: /tmp/TestReproduce194063444741838/001
    logger.go:146: 2025-03-14T23:56:18.083Z	INFO	starting server...	{"name": "TestReproduce19406-test-0"}
......
    logger.go:146: 2025-03-14T23:56:29.739Z	INFO	stopping server...	{"name": "TestReproduce19406-test-0"}
    logger.go:146: 2025-03-14T23:56:29.777Z	INFO	stopped server.	{"name": "TestReproduce19406-test-0"}
--- PASS: TestReproduce19406 (11.70s)
PASS
ok  	go.etcd.io/etcd/tests/v3/e2e	11.714s


# rollback the fix and run it again
% go test -v -run TestReproduce19406
=== RUN   TestReproduce19406
    testing.go:25: Changing working directory to: /tmp/TestReproduce194064120080253/001
......
    logger.go:146: 2025-03-14T23:54:38.079Z	INFO	started server.	{"name": "TestReproduce19406-test-0", "pid": 30796}
    reproduce_19406_test.go:57: start compaction...
    reproduce_19406_test.go:99: Test failed: put latency is larger than 100ms
    reproduce_19406_test.go:99: Test failed: put latency is larger than 100ms
......
    logger.go:146: 2025-03-14T23:54:43.592Z	INFO	stopping server...	{"name": "TestReproduce19406-test-0"}
    logger.go:146: 2025-03-14T23:54:43.605Z	INFO	stopped server.	{"name": "TestReproduce19406-test-0"}
--- FAIL: TestReproduce19406 (6.56s)
FAIL
exit status 1
FAIL	go.etcd.io/etcd/tests/v3/e2e	6.579s

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

@k8s-ci-robot
Copy link

Hi @miancheng7. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

codecov bot commented Mar 15, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.77%. Comparing base (5122d43) to head (03839c9).
Report is 160 commits behind head on main.

Additional details and impacted files
Files with missing lines Coverage Δ
server/storage/mvcc/kvstore_compaction.go 100.00% <ø> (ø)

... and 31 files with indirect coverage changes

@@           Coverage Diff           @@
##             main   #19608   +/-   ##
=======================================
  Coverage   68.76%   68.77%           
=======================================
  Files         421      421           
  Lines       35897    35857   -40     
=======================================
- Hits        24686    24661   -25     
+ Misses       9783     9765   -18     
- Partials     1428     1431    +3     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5122d43...03839c9. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@chaochn47
Copy link
Member

/ok-to-test

t.Cleanup(func() { require.NoError(t, clus.Stop()) })

// produce some data
cli := newClient(t, clus.EndpointsGRPC(), e2e.ClientConfig{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can reuse the v3 client in generateTrafficAndVerifyPutLatency?

@miancheng7 miancheng7 force-pushed the e2etestforissue19406 branch from 320b1b0 to 16dc8de Compare March 15, 2025 16:41
@miancheng7 miancheng7 force-pushed the e2etestforissue19406 branch 2 times, most recently from 8c4d3fc to 8563cb0 Compare March 16, 2025 00:38
@miancheng7 miancheng7 force-pushed the e2etestforissue19406 branch from 8563cb0 to 9507b5d Compare March 16, 2025 00:51
@chaochn47
Copy link
Member

cc @ahrtr @serathius @fuweid for a stamp of approval, thanks!

@miancheng7
Copy link
Contributor Author

Hi @ahrtr @serathius @fuweid , can you help review? Thanks in advance

@miancheng7 miancheng7 force-pushed the e2etestforissue19406 branch 2 times, most recently from e203299 to 53b185f Compare April 14, 2025 14:50
Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks!

@@ -63,6 +64,7 @@ func (s *store) scheduleCompaction(compactMainRev, prevCompactRev int64) (KeyVal
// gofail: var compactBeforeSetFinishedCompact struct{}
UnsafeSetFinishedCompact(tx, compactMainRev)
tx.Unlock()
dbCompactionPauseMs.Observe(float64(time.Since(start) / time.Millisecond))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric was introduced in 2015 - b5838edb93.
And in this patch - #11034, the write buffer is explicitly forced into bbolt, except for the last round, which skips the flush and allows the backend to commit the buffer in the background. However, LockOutsideApply() still holds the lock. Therefore, I believe we should record the pause duration for the last round, even if it doesn't flush any data.

// make an http request to fetch all Prometheus metrics
url := httpEndpoint + "/metrics"
resp, err := http.Get(url)
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requires.NoErrorf(t, err)

b, err := io.ReadAll(resp.Body)
resp.Body.Close()
if err != nil {
t.Fatalf("fetch error: reading %s: %v", url, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requires.NoErrorf(t, err, "failed to read %s", url)

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I do not see much value of this e2e test, also a little hard to understand. Essentially, it's just testing golang's time.After. But I am not strongly against it.

We should spend more effort on the real performance and scalability test.

@@ -63,6 +64,7 @@ func (s *store) scheduleCompaction(compactMainRev, prevCompactRev int64) (KeyVal
// gofail: var compactBeforeSetFinishedCompact struct{}
UnsafeSetFinishedCompact(tx, compactMainRev)
tx.Unlock()
dbCompactionPauseMs.Observe(float64(time.Since(start) / time.Millisecond))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a separate minor bug fix. The dbCompactionPauseMs won't have any data if each time the number of keys to be compacted is less than batchNum.

Can we fix this separately ?

expectSleepInterval, actualSleepInterval)
}

func GetEtcdCompactionMetrics(t *testing.T, httpEndpoint string) (pauseDuration, totalDuration float64, err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only used in this test, why export it?

Suggested change
func GetEtcdCompactionMetrics(t *testing.T, httpEndpoint string) (pauseDuration, totalDuration float64, err error) {
func getEtcdCompactionMetrics(t *testing.T, httpEndpoint string) (pauseDuration, totalDuration float64, err error) {

Comment on lines 93 to 99
// make an http request to fetch all Prometheus metrics
url := httpEndpoint + "/metrics"
resp, err := http.Get(url)
require.NoErrorf(t, err, "failed to open url %s", url)
b, err := io.ReadAll(resp.Body)
resp.Body.Close()
require.NoErrorf(t, err, "failed to read %s", url)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider to move getMetrics to e2e/framework, and reuse it.

func getMetrics(metricsURL string) (map[string]*dto.MetricFamily, error) {
httpClient := http.Client{Transport: &http.Transport{}}
resp, err := httpClient.Get(metricsURL)
if err != nil {
return nil, err
}
defer resp.Body.Close()
data, err := io.ReadAll(resp.Body)
if err != nil {
return nil, err
}
var parser expfmt.TextParser
return parser.TextToMetricFamilies(bytes.NewReader(data))
}

@serathius
Copy link
Member

We should spend more effort on the real performance and scalability test.

+1 to that, we should have latency SLO that is up-kept during compact instead of testing time.After.

@fuweid
Copy link
Member

fuweid commented Apr 17, 2025

I think it's not against to introduce performance test. It's not testing time.After. It's to verify that we should pause and handle compaction batch by batch. Even if it was simple refactor on pausing, it's easy to make mistakes. Besides, this project has multiple test case to verify schedule compaction, like hourly periodic job. I don't think it's to test time standard package.

Performance regression result could be deceptive, it maybe be caused by external factors, like slow disk or noise neighbor. Like this issue, we still need to put a lot of effort to narrow down root cause. I think this test is to prevent regression from refactor. Maybe current solution is not ideal. But metric is available tool we can use. Anyway, I don't think it's to test time.After.

@miancheng7 miancheng7 force-pushed the e2etestforissue19406 branch from 7dc1bd2 to 03839c9 Compare April 18, 2025 04:04
@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, chaochn47, fuweid, miancheng7

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fuweid fuweid merged commit aa8238f into etcd-io:main Apr 26, 2025
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

6 participants