Skip to content

[CI] DiskThresholdDeciderIT testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleRestores failing #127286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elasticsearchmachine opened this issue Apr 23, 2025 · 4 comments · Fixed by #127615
Assignees
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) medium-risk An open issue or test failure that is a medium risk to future releases Team:Distributed Coordination Meta label for Distributed Coordination team >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Apr 23, 2025

Build Scans:

Reproduction Line:

./gradlew ":server:internalClusterTest" --tests "org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleRestores" -Dtests.seed=7102862F2F4E8B3E -Dtests.locale=en-DE -Dtests.timezone=Indian/Comoro -Druntime.java=24

Applicable branches:
main

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.lang.AssertionError: 
Expected: a collection with size <1>
     but: collection size was <0>

Issue Reasons:

  • [main] 3 failures in test testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleRestores (1.1% fail rate in 266 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >test-failure Triaged test failures from CI labels Apr 23, 2025
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch 8.x

Mute Reasons:

  • [8.x] 2 failures in test testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleRestores (4.0% fail rate in 50 executions)

Build Scans:

elasticsearchmachine added a commit that referenced this issue Apr 23, 2025
…ldDeciderIT testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleRestores #127286
@elasticsearchmachine elasticsearchmachine added needs:risk Requires assignment of a risk label (low, medium, blocker) Team:Distributed Coordination Meta label for Distributed Coordination team labels Apr 23, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@JeremyDahlgren JeremyDahlgren self-assigned this Apr 29, 2025
@JeremyDahlgren JeremyDahlgren added medium-risk An open issue or test failure that is a medium risk to future releases and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Apr 30, 2025
@JeremyDahlgren
Copy link
Contributor

It looks like there is a race condition where the tiny node ends up hosting a shard from either the original index or the index copy. The assert at the end fails when it is only checking for shards from the original index, when it instead has a single shard from the index copy. To reproduce this more reliably I forced usableSpace = shardSizes.getSmallestShardSize() and indexRandom(true, indexName, 100) to build smaller shards and keep the usable space at the minimum.

@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 3 failures in test testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleRestores (1.1% fail rate in 266 executions)

Build Scans:

elasticsearchmachine added a commit that referenced this issue May 1, 2025
…ldDeciderIT testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleRestores #127286
JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this issue May 1, 2025
The test launches two concurrent restores and wants to verify that
the node with limited disk space is only assigned a single shard from
one of the indices.  The test was asserting that it had one shard from
the first index, but it is possible for it to get one shard from the
index copy instead.  This change allows the shard to be from either
index, but still asserts there is only one assignment to the tiny node.

Closes elastic#127286
JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this issue May 2, 2025
The test launches two concurrent restores and wants to verify that
the node with limited disk space is only assigned a single shard from
one of the indices.  The test was asserting that it had one shard from
the first index, but it is possible for it to get one shard from the
index copy instead.  This change allows the shard to be from either
index, but still asserts there is only one assignment to the tiny node.

Closes elastic#127286

(cherry picked from commit 6263f44)
JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this issue May 2, 2025
The test launches two concurrent restores and wants to verify that
the node with limited disk space is only assigned a single shard from
one of the indices.  The test was asserting that it had one shard from
the first index, but it is possible for it to get one shard from the
index copy instead.  This change allows the shard to be from either
index, but still asserts there is only one assignment to the tiny node.

Closes elastic#127286

(cherry picked from commit 6263f44)

# Conflicts:
#	muted-tests.yml
JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this issue May 2, 2025
The test launches two concurrent restores and wants to verify that
the node with limited disk space is only assigned a single shard from
one of the indices.  The test was asserting that it had one shard from
the first index, but it is possible for it to get one shard from the
index copy instead.  This change allows the shard to be from either
index, but still asserts there is only one assignment to the tiny node.

Closes elastic#127286

(cherry picked from commit 6263f44)

# Conflicts:
#	muted-tests.yml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) medium-risk An open issue or test failure that is a medium risk to future releases Team:Distributed Coordination Meta label for Distributed Coordination team >test-failure Triaged test failures from CI
Projects
None yet
2 participants