Merge disk space aware take 2 #127613

albertzaharovits · 2025-05-01T17:57:07Z

No description provided.

…nabled

albertzaharovits · 2025-05-02T07:56:33Z

server/src/main/java/org/elasticsearch/index/engine/ThreadPoolMergeExecutorService.java

+        TimeValue.MINUS_ONE,
+        Setting.Property.NodeScope
+    );
+    public static final Setting<RelativeByteSizeValue> INDICES_MERGE_DISK_HIGH_WATERMARK_SETTING = new Setting<>(


Here I've opted for the same way to define the disk space threshold as the cluster.routing.allocation.disk.* ones, i.e. a watermark one, which can be a ratio/percentage + an optional max headroom. This makes the disk limits configuration coherent and allows for finer control, but maybe we don't need this much control and a simple disk space limit, e.g. 1 or 5 or 10 GB is Ok?

Although I think a percentage default threshold is a better option.

I also like that they are similar - in fact I want it to default to the flood stage, which will require it to be defined in the same way.

albertzaharovits · 2025-05-02T08:35:31Z

server/src/main/java/org/elasticsearch/index/engine/ThreadPoolMergeExecutorService.java

+                    FsInfo.Path fsInfo = getFSInfo(dataPath); // uncached
+                    if (leastAvailablePath == null || leastAvailablePath.getAvailable().getBytes() > fsInfo.getAvailable().getBytes()) {
+                        leastAvailablePath = fsInfo;
+                    }


I think shards can still be distributed across multiple data paths...

Here, I've only considered the path with the least disk space available. If the smallest merge cannot run on that path, all merging will stall (even though there possibly are merges for shards from different data paths that have enough disk space to run).

The solution to properly account for multiple data paths is more difficult, similar to the "max merge threads per shard" complication. It needs to figure out the data path that the shard resides on (given the merge task), and then backlog that merge task if there's currently not enough disk space. There will then be a re-enqueue priority queue per data path, for when disk space becomes available.

If the least available data path I've proposed here is not satisfactory, another option would be to not support this new feature ("prevent merges from filling up disk") if there are multiple data paths defined, in the first release, and leave the MDP support as a homework.

WDYT @henningandersen ?

I am inclined to use the most available path instead. That is sort of similar to not supporting MDP, except it is still active and could by luck be doing good things sometimes.

The trouble with least available is that it can block merges that should be allowed and this can go on forever.

albertzaharovits · 2025-05-02T08:38:43Z

server/src/main/java/org/elasticsearch/index/engine/ThreadPoolMergeExecutorService.java

+    /** How frequently we check disk usage (default: 5 seconds). */
+    public static final Setting<TimeValue> INDICES_MERGE_DISK_CHECK_INTERVAL_SETTING = Setting.timeSetting(
+        "indices.merge.disk.check_interval",
+        TimeValue.timeValueSeconds(5),


The 5 sec interval that we check the available disk space. This is fixed, even if there is 10% or 90% disk space available, and the disk space is checked even when there isn't any merging going on.

Sounds good to me. I wondered about simply checking this every time, but then again, we need some scheduling once we hit the roof, so this looks good.

albertzaharovits · 2025-05-02T08:41:45Z

server/src/main/java/org/elasticsearch/index/engine/ThreadPoolMergeExecutorService.java

-        Comparator.comparingLong(MergeTask::estimatedMergeSize)
+    private final PriorityBlockingQueueWithMaxLimit<MergeTask> queuedMergeTasks = new PriorityBlockingQueueWithMaxLimit<>(
+        MergeTask::estimatedMergeSize,
+        Long.MAX_VALUE


If we cannot read the available disk space, all merging will be allowed (and there is some logging complaining about it).

Similarly, merging starts without waiting to read the fs stats.

Generally, my thinking was to let the merging execute as before, and if the filesystem stats are available, and if the remaining disk space is low, only then stop new merges from starting.

Alternatively, we could allow merging to happen only after filesystem stats are available and the remaining disk space is sufficient.

WDYT?

albertzaharovits · 2025-05-02T08:45:35Z

server/src/main/java/org/elasticsearch/index/engine/ThreadPoolMergeExecutorService.java

+            }
+            if (leastAvailablePath == null) {
+                LOGGER.error("Cannot read filesystem info");
+                return;


If we cannot read the available disk space (because of some errors), merging is allowed to go on, rather then be stopped.

Would we not need to call updateMaxPriorityLimit(Long.MAX_VALUE) to get that effect - in case we could read it and then cannot? We can also assert that the limit is not set.

albertzaharovits · 2025-05-02T08:50:25Z

server/src/main/java/org/elasticsearch/index/engine/ThreadPoolMergeExecutorService.java

+            lock.lockInterruptibly();
+            E peek;
+            try {
+                while ((peek = priorityQueue.peek()) == null || priorityFunction.applyAsLong(peek) > maxPriorityLimit)


The take method is special here.
In addition to blocking if there's no element, it will also block if the smallest element in the heap is larger than a limit.

henningandersen

Doing this at the queue seems like a good direction. Left a number of comments.

henningandersen · 2025-05-02T10:54:01Z

server/src/main/java/org/elasticsearch/index/engine/ThreadPoolMergeExecutorService.java

+    );
+    public static final Setting<RelativeByteSizeValue> INDICES_MERGE_DISK_HIGH_WATERMARK_SETTING = new Setting<>(
+        "indices.merge.disk.watermark.high",
+        "96%",


I wonder if this should default to the flood stage level rather than introduce yet another limit? I think having the setting is fine in order to explicitly override it, but it seems nice that it follows flood stage if configured rather than having to configure both.

henningandersen · 2025-05-02T10:57:21Z

server/src/main/java/org/elasticsearch/index/engine/ThreadPoolMergeExecutorService.java

+                    FsInfo.Path fsInfo = getFSInfo(dataPath); // uncached
+                    if (leastAvailablePath == null || leastAvailablePath.getAvailable().getBytes() > fsInfo.getAvailable().getBytes()) {
+                        leastAvailablePath = fsInfo;
+                    }


I am inclined to use the most available path instead. That is sort of similar to not supporting MDP, except it is still active and could by luck be doing good things sometimes.

The trouble with least available is that it can block merges that should be allowed and this can go on forever.

henningandersen · 2025-05-02T11:05:00Z

server/src/main/java/org/elasticsearch/index/engine/ThreadPoolMergeExecutorService.java

+        private long maxPriorityLimit;
+
+        PriorityBlockingQueueWithMaxLimit(ToLongFunction<? super E> priorityFunction, long maxPriorityLimit) {
+            this.priorityFunction = priorityFunction;


This function and it's name seems slightly confusing. I think it would be simpler to have it be a predicate that says whether it can run or not? We'd still need the wakeup when disk usage monitoring returns but that could just be a wakeup function.

That way it can also handle all criteria, like heap, disk etc. that we come up with in the future.

In fact, it should maybe reserve the capacity too, such that we conservatively ensure we do not run out. We can add that later, no need for it initially (and maybe it is good enough without it).

henningandersen · 2025-05-02T11:09:37Z

server/src/main/java/org/elasticsearch/index/engine/ThreadPoolMergeExecutorService.java

+        }
+    }
+
+    static class PriorityBlockingQueueWithMaxLimit<E> {


Perhaps we can add a comment on why we have this special queue. I think the primary benefit is that we can still process incoming merges that can be executed with this here, whereas if we did the blocking in runMergeTask, we'd not be able to do so once all threads are occupied.

henningandersen · 2025-05-02T11:11:58Z

server/src/main/java/org/elasticsearch/index/engine/ThreadPoolMergeExecutorService.java

+    /** How frequently we check disk usage (default: 5 seconds). */
+    public static final Setting<TimeValue> INDICES_MERGE_DISK_CHECK_INTERVAL_SETTING = Setting.timeSetting(
+        "indices.merge.disk.check_interval",
+        TimeValue.timeValueSeconds(5),


Sounds good to me. I wondered about simply checking this every time, but then again, we need some scheduling once we hit the roof, so this looks good.

henningandersen · 2025-05-02T11:12:40Z

server/src/main/java/org/elasticsearch/index/engine/ThreadPoolMergeExecutorService.java

+        TimeValue.MINUS_ONE,
+        Setting.Property.NodeScope
+    );
+    public static final Setting<RelativeByteSizeValue> INDICES_MERGE_DISK_HIGH_WATERMARK_SETTING = new Setting<>(


I also like that they are similar - in fact I want it to default to the flood stage, which will require it to be defined in the same way.

henningandersen · 2025-05-02T11:13:15Z

server/src/main/java/org/elasticsearch/index/engine/ThreadPoolMergeExecutorService.java

+            if (INDICES_MERGE_DISK_HIGH_WATERMARK_SETTING.exists(settings)) {
+                return "-1";
+            } else {
+                return "40GB";


Here we should return the flood stage value.

henningandersen · 2025-05-02T11:14:29Z

server/src/main/java/org/elasticsearch/index/engine/ThreadPoolMergeExecutorService.java

+            }
+            if (leastAvailablePath == null) {
+                LOGGER.error("Cannot read filesystem info");
+                return;


Would we not need to call updateMaxPriorityLimit(Long.MAX_VALUE) to get that effect - in case we could read it and then cannot? We can also assert that the limit is not set.

albertzaharovits added 12 commits April 15, 2025 21:02

watermark settings

60fc8b8

Merge branch 'main' into merge-disk-space-aware-take-2

8718784

Check watermark limits only when the thread pool merge scheduler is e…

253daa1

…nabled

initial merge disk space monitor

365ae2d

move settings to MergeDiskSpaceMonitor

0e6f7e2

remove unused import

27476ca

Nits

b9b26fa

Thrashy in-between

1ddd3d3

DiskSpaceMonitor

94d27f9

PriorityBlockingQueueWithMaxLimit

023f042

Trimming code

ad6cf4c

Merge branch 'main' into merge-disk-space-aware-take-2

e507684

albertzaharovits self-assigned this May 1, 2025

elasticsearchmachine added the v9.1.0 label May 1, 2025

elasticsearchmachine and others added 5 commits May 1, 2025 18:07

[CI] Auto commit changes from spotless

6805fa9

Fix compilation issue

33a0246

Fix TPMST

4c012cc

Some test fixes

6a6a759

[CI] Auto commit changes from spotless

2f2720d

albertzaharovits requested a review from henningandersen May 2, 2025 07:42

albertzaharovits commented May 2, 2025

View reviewed changes

henningandersen reviewed May 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge disk space aware take 2 #127613

Merge disk space aware take 2 #127613

albertzaharovits commented May 1, 2025

albertzaharovits May 2, 2025

albertzaharovits May 2, 2025

henningandersen May 2, 2025

albertzaharovits May 2, 2025

henningandersen May 2, 2025

albertzaharovits May 2, 2025

henningandersen May 2, 2025

albertzaharovits May 2, 2025

albertzaharovits May 2, 2025

albertzaharovits May 2, 2025

henningandersen May 2, 2025

albertzaharovits May 2, 2025

henningandersen left a comment

henningandersen May 2, 2025

henningandersen May 2, 2025

henningandersen May 2, 2025

henningandersen May 2, 2025

henningandersen May 2, 2025

henningandersen May 2, 2025

henningandersen May 2, 2025

henningandersen May 2, 2025

Merge disk space aware take 2 #127613

Are you sure you want to change the base?

Merge disk space aware take 2 #127613

Conversation

albertzaharovits commented May 1, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

henningandersen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment