Add e2e test to reproduce issue #19406 #19608

miancheng7 · 2025-03-15T00:20:57Z

PR Description

This is a follow-up to issue #19406.

The fix in #19405 has been merged, and this PR introduces an e2e test to reproduce the issue. The goal is to ensure that the test will catch the problem if it resurfaces in the future.

Testing

I have tested this locally and confirmed that the test:
✅ Passes consistently with the fix applied.
❌ Fails reliably if the fix in #19405 is rolled back.

% go test -v -run TestReproduce19406
=== RUN   TestReproduce19406
    testing.go:25: Changing working directory to: /tmp/TestReproduce194063444741838/001
    logger.go:146: 2025-03-14T23:56:18.083Z	INFO	starting server...	{"name": "TestReproduce19406-test-0"}
......
    logger.go:146: 2025-03-14T23:56:29.739Z	INFO	stopping server...	{"name": "TestReproduce19406-test-0"}
    logger.go:146: 2025-03-14T23:56:29.777Z	INFO	stopped server.	{"name": "TestReproduce19406-test-0"}
--- PASS: TestReproduce19406 (11.70s)
PASS
ok  	go.etcd.io/etcd/tests/v3/e2e	11.714s


# rollback the fix and run it again
% go test -v -run TestReproduce19406
=== RUN   TestReproduce19406
    testing.go:25: Changing working directory to: /tmp/TestReproduce194064120080253/001
......
    logger.go:146: 2025-03-14T23:54:38.079Z	INFO	started server.	{"name": "TestReproduce19406-test-0", "pid": 30796}
    reproduce_19406_test.go:57: start compaction...
    reproduce_19406_test.go:99: Test failed: put latency is larger than 100ms
    reproduce_19406_test.go:99: Test failed: put latency is larger than 100ms
......
    logger.go:146: 2025-03-14T23:54:43.592Z	INFO	stopping server...	{"name": "TestReproduce19406-test-0"}
    logger.go:146: 2025-03-14T23:54:43.605Z	INFO	stopped server.	{"name": "TestReproduce19406-test-0"}
--- FAIL: TestReproduce19406 (6.56s)
FAIL
exit status 1
FAIL	go.etcd.io/etcd/tests/v3/e2e	6.579s

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

k8s-ci-robot · 2025-03-15T00:21:07Z

Hi @miancheng7. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

codecov · 2025-03-15T00:39:06Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.77%. Comparing base (5122d43) to head (03839c9).
Report is 160 commits behind head on main.

Additional details and impacted files

Files with missing lines	Coverage Δ
server/storage/mvcc/kvstore_compaction.go	`100.00% <ø> (ø)`

... and 31 files with indirect coverage changes

@@           Coverage Diff           @@
##             main   #19608   +/-   ##
=======================================
  Coverage   68.76%   68.77%           
=======================================
  Files         421      421           
  Lines       35897    35857   -40     
=======================================
- Hits        24686    24661   -25     
+ Misses       9783     9765   -18     
- Partials     1428     1431    +3

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5122d43...03839c9. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

chaochn47 · 2025-03-15T01:19:20Z

/ok-to-test

tests/e2e/reproduce_19406_test.go

chaochn47 · 2025-03-15T01:26:20Z

tests/e2e/reproduce_19406_test.go

+	t.Cleanup(func() { require.NoError(t, clus.Stop()) })
+
+	// produce some data
+	cli := newClient(t, clus.EndpointsGRPC(), e2e.ClientConfig{})


Can reuse the v3 client in generateTrafficAndVerifyPutLatency?

tests/e2e/reproduce_19406_test.go

chaochn47 · 2025-03-17T16:35:56Z

cc @ahrtr @serathius @fuweid for a stamp of approval, thanks!

miancheng7 · 2025-04-01T22:05:53Z

Hi @ahrtr @serathius @fuweid , can you help review? Thanks in advance

tests/e2e/reproduce_19406_test.go

fuweid

LGTM

Thanks!

fuweid · 2025-04-14T19:34:44Z

server/storage/mvcc/kvstore_compaction.go

@@ -63,6 +64,7 @@ func (s *store) scheduleCompaction(compactMainRev, prevCompactRev int64) (KeyVal
 			// gofail: var compactBeforeSetFinishedCompact struct{}
 			UnsafeSetFinishedCompact(tx, compactMainRev)
 			tx.Unlock()
+			dbCompactionPauseMs.Observe(float64(time.Since(start) / time.Millisecond))


This metric was introduced in 2015 - b5838edb93.
And in this patch - #11034, the write buffer is explicitly forced into bbolt, except for the last round, which skips the flush and allows the backend to commit the buffer in the background. However, LockOutsideApply() still holds the lock. Therefore, I believe we should record the pause duration for the last round, even if it doesn't flush any data.

fuweid · 2025-04-14T19:37:40Z

tests/e2e/reproduce_19406_test.go

+	// make an http request to fetch all Prometheus metrics
+	url := httpEndpoint + "/metrics"
+	resp, err := http.Get(url)
+	if err != nil {


requires.NoErrorf(t, err)

fuweid · 2025-04-14T19:38:24Z

tests/e2e/reproduce_19406_test.go

+	b, err := io.ReadAll(resp.Body)
+	resp.Body.Close()
+	if err != nil {
+		t.Fatalf("fetch error: reading %s: %v", url, err)


requires.NoErrorf(t, err, "failed to read %s", url)

ahrtr

Honestly, I do not see much value of this e2e test, also a little hard to understand. Essentially, it's just testing golang's time.After. But I am not strongly against it.

We should spend more effort on the real performance and scalability test.

ahrtr · 2025-04-17T15:35:19Z

server/storage/mvcc/kvstore_compaction.go

@@ -63,6 +64,7 @@ func (s *store) scheduleCompaction(compactMainRev, prevCompactRev int64) (KeyVal
 			// gofail: var compactBeforeSetFinishedCompact struct{}
 			UnsafeSetFinishedCompact(tx, compactMainRev)
 			tx.Unlock()
+			dbCompactionPauseMs.Observe(float64(time.Since(start) / time.Millisecond))


This seems a separate minor bug fix. The dbCompactionPauseMs won't have any data if each time the number of keys to be compacted is less than batchNum.

Can we fix this separately ?

ahrtr · 2025-04-17T15:36:41Z

tests/e2e/reproduce_19406_test.go

+		expectSleepInterval, actualSleepInterval)
+}
+
+func GetEtcdCompactionMetrics(t *testing.T, httpEndpoint string) (pauseDuration, totalDuration float64, err error) {


It's only used in this test, why export it?

Suggested change

func GetEtcdCompactionMetrics(t *testing.T, httpEndpoint string) (pauseDuration, totalDuration float64, err error) {

func getEtcdCompactionMetrics(t *testing.T, httpEndpoint string) (pauseDuration, totalDuration float64, err error) {

ahrtr · 2025-04-17T15:37:29Z

tests/e2e/reproduce_19406_test.go

+	// make an http request to fetch all Prometheus metrics
+	url := httpEndpoint + "/metrics"
+	resp, err := http.Get(url)
+	require.NoErrorf(t, err, "failed to open url %s", url)
+	b, err := io.ReadAll(resp.Body)
+	resp.Body.Close()
+	require.NoErrorf(t, err, "failed to read %s", url)


Please consider to move getMetrics to e2e/framework, and reuse it.

etcd/tests/e2e/metrics_test.go

Lines 344 to 359 in ddeaba7

func getMetrics(metricsURL string) (map[string]*dto.MetricFamily, error) {

httpClient := http.Client{Transport: &http.Transport{}}

resp, err := httpClient.Get(metricsURL)

if err != nil {

return nil, err

}

defer resp.Body.Close()

data, err := io.ReadAll(resp.Body)

if err != nil {

return nil, err

}

var parser expfmt.TextParser

return parser.TextToMetricFamilies(bytes.NewReader(data))

}

serathius · 2025-04-17T16:31:26Z

We should spend more effort on the real performance and scalability test.

+1 to that, we should have latency SLO that is up-kept during compact instead of testing time.After.

fuweid · 2025-04-17T16:52:32Z

I think it's not against to introduce performance test. It's not testing time.After. It's to verify that we should pause and handle compaction batch by batch. Even if it was simple refactor on pausing, it's easy to make mistakes. Besides, this project has multiple test case to verify schedule compaction, like hourly periodic job. I don't think it's to test time standard package.

Performance regression result could be deceptive, it maybe be caused by external factors, like slow disk or noise neighbor. Like this issue, we still need to put a lot of effort to narrow down root cause. I think this test is to prevent regression from refactor. Maybe current solution is not ideal. But metric is available tool we can use. Anyway, I don't think it's to test time.After.

Signed-off-by: Miancheng Lin <[email protected]>

k8s-ci-robot · 2025-04-26T00:00:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, chaochn47, fuweid, miancheng7

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahrtr,fuweid]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the area/testing label Mar 15, 2025

k8s-ci-robot added needs-ok-to-test size/L labels Mar 15, 2025

miancheng7 force-pushed the e2etestforissue19406 branch from 8f102c6 to 320b1b0 Compare March 15, 2025 00:38

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Mar 15, 2025

chaochn47 reviewed Mar 15, 2025

View reviewed changes

miancheng7 force-pushed the e2etestforissue19406 branch from 320b1b0 to 16dc8de Compare March 15, 2025 16:41

k8s-ci-robot added size/M and removed size/L labels Mar 15, 2025

miancheng7 force-pushed the e2etestforissue19406 branch 2 times, most recently from 8c4d3fc to 8563cb0 Compare March 16, 2025 00:38

k8s-ci-robot added size/L and removed size/M labels Mar 16, 2025

miancheng7 force-pushed the e2etestforissue19406 branch from 8563cb0 to 9507b5d Compare March 16, 2025 00:51

chaochn47 approved these changes Mar 17, 2025

View reviewed changes

fuweid reviewed Apr 2, 2025

View reviewed changes

tests/e2e/reproduce_19406_test.go Show resolved Hide resolved

miancheng7 force-pushed the e2etestforissue19406 branch 2 times, most recently from e203299 to 53b185f Compare April 14, 2025 14:50

fuweid approved these changes Apr 14, 2025

View reviewed changes

k8s-ci-robot added the approved label Apr 14, 2025

miancheng7 force-pushed the e2etestforissue19406 branch from 53b185f to 7dc1bd2 Compare April 14, 2025 23:32

ahrtr reviewed Apr 17, 2025

View reviewed changes

Add e2e test to reproduce issue etcd-io#19406

03839c9

Signed-off-by: Miancheng Lin <[email protected]>

miancheng7 force-pushed the e2etestforissue19406 branch from 7dc1bd2 to 03839c9 Compare April 18, 2025 04:04

ahrtr approved these changes Apr 22, 2025

View reviewed changes

fuweid approved these changes Apr 26, 2025

View reviewed changes

fuweid merged commit aa8238f into etcd-io:main Apr 26, 2025
31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add e2e test to reproduce issue #19406 #19608

Add e2e test to reproduce issue #19406 #19608

miancheng7 commented Mar 15, 2025 •

edited

Loading

k8s-ci-robot commented Mar 15, 2025

codecov bot commented Mar 15, 2025 •

edited

Loading

chaochn47 commented Mar 15, 2025

chaochn47 Mar 15, 2025

chaochn47 commented Mar 17, 2025

miancheng7 commented Apr 1, 2025

fuweid left a comment

fuweid Apr 14, 2025

fuweid Apr 14, 2025

fuweid Apr 14, 2025

ahrtr left a comment

ahrtr Apr 17, 2025

ahrtr Apr 17, 2025

ahrtr Apr 17, 2025

serathius commented Apr 17, 2025

fuweid commented Apr 17, 2025

k8s-ci-robot commented Apr 26, 2025

	func GetEtcdCompactionMetrics(t *testing.T, httpEndpoint string) (pauseDuration, totalDuration float64, err error) {
	func getEtcdCompactionMetrics(t *testing.T, httpEndpoint string) (pauseDuration, totalDuration float64, err error) {

	func getMetrics(metricsURL string) (map[string]*dto.MetricFamily, error) {
	httpClient := http.Client{Transport: &http.Transport{}}
	resp, err := httpClient.Get(metricsURL)
	if err != nil {
	return nil, err
	}
	defer resp.Body.Close()

	data, err := io.ReadAll(resp.Body)
	if err != nil {
	return nil, err
	}

	var parser expfmt.TextParser
	return parser.TextToMetricFamilies(bytes.NewReader(data))
	}

Add e2e test to reproduce issue #19406 #19608

Add e2e test to reproduce issue #19406 #19608

Conversation

miancheng7 commented Mar 15, 2025 • edited Loading

PR Description

Testing

k8s-ci-robot commented Mar 15, 2025

codecov bot commented Mar 15, 2025 • edited Loading

Codecov Report

chaochn47 commented Mar 15, 2025

chaochn47 Mar 15, 2025

Choose a reason for hiding this comment

chaochn47 commented Mar 17, 2025

miancheng7 commented Apr 1, 2025

fuweid left a comment

Choose a reason for hiding this comment

fuweid Apr 14, 2025

Choose a reason for hiding this comment

fuweid Apr 14, 2025

Choose a reason for hiding this comment

fuweid Apr 14, 2025

Choose a reason for hiding this comment

ahrtr left a comment

Choose a reason for hiding this comment

ahrtr Apr 17, 2025

Choose a reason for hiding this comment

ahrtr Apr 17, 2025

Choose a reason for hiding this comment

ahrtr Apr 17, 2025

Choose a reason for hiding this comment

serathius commented Apr 17, 2025

fuweid commented Apr 17, 2025

k8s-ci-robot commented Apr 26, 2025

miancheng7 commented Mar 15, 2025 •

edited

Loading

codecov bot commented Mar 15, 2025 •

edited

Loading