Skip to content

Generalise exponential bucket histogram, add percentile calculation #127597

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
9918299
Publish queue latency metrics from tracked thread pools
nicktindall Jan 21, 2025
b1d8df4
Merge remote-tracking branch 'origin/main' into ES-10531_add_thread_p…
nicktindall Jan 21, 2025
dfef676
Fix metric name
nicktindall Jan 21, 2025
2ceb965
Temporary hack to fix metric name
nicktindall Jan 21, 2025
dbed27f
Propose solution to composite thread-pool names
nicktindall Jan 22, 2025
c04f2ea
Merge remote-tracking branch 'origin/main' into ES-10531_add_thread_p…
nicktindall Jan 22, 2025
c95f625
Fix fixed thread pool names
nicktindall Jan 22, 2025
f850bc9
Merge remote-tracking branch 'origin/main' into ES-10531_add_thread_p…
nicktindall Jan 23, 2025
1b450f0
Merge branch 'main' into ES-10531_add_thread_pool_queue_latency_metric
nicktindall Mar 7, 2025
e7f5bb6
POC using HandlingTimeTracker to track queue latency
nicktindall Mar 11, 2025
d269195
Merge branch 'main' into ES-10531_add_thread_pool_queue_latency_metric
nicktindall Mar 11, 2025
80b8b3f
Tidy
nicktindall Mar 11, 2025
9811299
Merge branch 'main' into ES-10531_add_thread_pool_queue_latency_metric
nicktindall Mar 12, 2025
598d0a8
Fix metric name
nicktindall Mar 13, 2025
4153d27
Generalise HandlingTimeTracker
nicktindall Mar 25, 2025
d4b0818
Merge remote-tracking branch 'origin/main' into ES-10531_add_thread_p…
nicktindall Mar 25, 2025
3e3d9fc
Tidy and fix document/test
nicktindall Mar 25, 2025
be4b96c
Restore field name
nicktindall Mar 26, 2025
8a313ab
Update docs/changelog/120488.yaml
nicktindall Mar 26, 2025
8fa7dc6
Fix changelog area value
nicktindall Mar 26, 2025
2c36b97
Update server/src/main/java/org/elasticsearch/common/metrics/Exponent…
nicktindall Mar 27, 2025
420739e
Setup metrics separately to constructor
nicktindall Mar 28, 2025
3208423
Remove unnecessary change
nicktindall Mar 28, 2025
9dd0457
Merge remote-tracking branch 'origin/main' into ES-10531_add_thread_p…
nicktindall Mar 28, 2025
516a4a9
Merge branch 'main' into ES-10531_add_thread_pool_queue_latency_metric
nicktindall Apr 2, 2025
d6e44ed
Add 99th percentile
nicktindall Apr 6, 2025
dfbd9ac
Merge branch 'main' into ES-10531_add_thread_pool_queue_latency_metric
nicktindall Apr 7, 2025
2b294bd
Record queue latency metrics in beforeExecute
nicktindall Apr 7, 2025
9e227ed
Handle p100 percentile when last bucket is populated
nicktindall Apr 7, 2025
15acd80
Update server/src/main/java/org/elasticsearch/common/metrics/Exponent…
nicktindall May 1, 2025
368a66a
Fix variable naming
nicktindall May 1, 2025
bb84ed9
Update variable names where we use ExponentialBucketHistogram
nicktindall May 1, 2025
273fc23
Merge remote-tracking branch 'origin/main' into ES-10531_add_thread_p…
nicktindall May 1, 2025
778e8b5
Update server/src/main/java/org/elasticsearch/common/metrics/Exponent…
nicktindall May 1, 2025
876ae78
Assert that taskQueueLatency >= 0
nicktindall May 1, 2025
51191eb
Improve documentation of test
nicktindall May 1, 2025
f96674c
Remove queue latency metric changes
nicktindall May 1, 2025
31f2083
getHistogram() -> getSnapshot()
nicktindall May 1, 2025
4b73745
Overload getPercentile to allow re-use of a snapshot
nicktindall May 2, 2025
5d2cbad
Merge branch 'main' into ES-10531_generalise_exponential_bucket_histo…
nicktindall May 5, 2025
221e079
Additional tests
nicktindall May 5, 2025
7e6a6ce
Additional tests
nicktindall May 5, 2025
8ef9745
Minimise change
nicktindall May 5, 2025
91261d6
Validate that bucket count is sane
nicktindall May 5, 2025
a3707eb
Restore comment
nicktindall May 5, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the "Elastic License
* 2.0", the "GNU Affero General Public License v3.0 only", and the "Server Side
* Public License v 1"; you may not use this file except in compliance with, at
* your election, the "Elastic License 2.0", the "GNU Affero General Public
* License v3.0 only", or the "Server Side Public License, v 1".
*/

package org.elasticsearch.common.metrics;

import java.util.Arrays;
import java.util.concurrent.atomic.LongAdder;

/**
* A histogram with a fixed number of buckets of exponentially increasing width.
* <p>
* The bucket boundaries are defined by increasing powers of two, e.g.
* <code>
* (-&infin;, 1), [1, 2), [2, 4), [4, 8), ..., [2^({@link #bucketCount}-2), &infin;)
* </code>
* There are {@link #bucketCount} buckets.
*/
public class ExponentialBucketHistogram {

private final int bucketCount;
private final long lastBucketLowerBound;

public static int[] getBucketUpperBounds(int bucketCount) {
int[] bounds = new int[bucketCount - 1];
for (int i = 0; i < bounds.length; i++) {
bounds[i] = 1 << i;
}
return bounds;
}

private int getBucket(long observedValue) {
if (observedValue <= 0) {
return 0;
} else if (lastBucketLowerBound <= observedValue) {
return bucketCount - 1;
} else {
return Long.SIZE - Long.numberOfLeadingZeros(observedValue);
}
}

private final LongAdder[] buckets;

public ExponentialBucketHistogram(int bucketCount) {
if (bucketCount < 2 || bucketCount > Integer.SIZE) {
throw new IllegalArgumentException("Bucket count must be in [2, " + Integer.SIZE + "], got " + bucketCount);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The observations are long but the bucket bounds are integers, I suppose because the scale of the numbers we've been putting in here are easily contained in the integer range. Perhaps we could revisit this if we need histograms to hold larger values, but that's not the case for the queue latency metric.

this.bucketCount = bucketCount;
this.lastBucketLowerBound = getBucketUpperBounds(bucketCount)[bucketCount - 2];
buckets = new LongAdder[bucketCount];
for (int i = 0; i < bucketCount; i++) {
buckets[i] = new LongAdder();
}
}

public int[] calculateBucketUpperBounds() {
return getBucketUpperBounds(bucketCount);
}

public void addObservation(long observedValue) {
buckets[getBucket(observedValue)].increment();
}

/**
* @return An array of frequencies of handling times in buckets with upper bounds as returned by {@link #calculateBucketUpperBounds()},
* plus an extra bucket for handling times longer than the longest upper bound.
*/
public long[] getSnapshot() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this from getHistogram to getSnapshot because histogram.getHistogram() would look weird.

final long[] histogram = new long[bucketCount];
for (int i = 0; i < bucketCount; i++) {
histogram[i] = buckets[i].longValue();
}
return histogram;
}

/**
* Calculate the Nth percentile value
*
* @param percentile The percentile as a fraction (in [0, 1.0])
* @return A value greater than the specified fraction of values in the histogram
* @throws IllegalArgumentException if the requested percentile is invalid
*/
public long getPercentile(float percentile) {
return getPercentile(percentile, getSnapshot(), calculateBucketUpperBounds());
}

/**
* Calculate the Nth percentile value
*
* @param percentile The percentile as a fraction (in [0, 1.0])
* @param snapshot An array of frequencies of handling times in buckets with upper bounds as per {@link #calculateBucketUpperBounds()}
* @param bucketUpperBounds The upper bounds of the buckets in the histogram, as per {@link #calculateBucketUpperBounds()}
* @return A value greater than the specified fraction of values in the histogram
* @throws IllegalArgumentException if the requested percentile is invalid
*/
public long getPercentile(float percentile, long[] snapshot, int[] bucketUpperBounds) {
assert snapshot.length == bucketCount && bucketUpperBounds.length == bucketCount - 1;
if (percentile < 0 || percentile > 1) {
throw new IllegalArgumentException("Requested percentile must be in [0, 1.0], percentile=" + percentile);
}
Comment on lines +103 to +105
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to add a few quick expectThrows() tests for these?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 221e079

final long totalCount = Arrays.stream(snapshot).sum();
long percentileIndex = (long) Math.ceil(totalCount * percentile);
// Find which bucket has the Nth percentile value and return the upper bound value.
for (int i = 0; i < bucketCount; i++) {
percentileIndex -= snapshot[i];
if (percentileIndex <= 0) {
if (i == snapshot.length - 1) {
return Long.MAX_VALUE;
} else {
return bucketUpperBounds[i];
}
}
}
assert false : "We shouldn't ever get here";
return Long.MAX_VALUE;
}

/**
* Clear all values in the histogram (non-atomic)
*/
public void clear() {
for (int i = 0; i < bucketCount; i++) {
buckets[i].reset();
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -9,58 +9,20 @@

package org.elasticsearch.common.network;

import java.util.concurrent.atomic.LongAdder;
import org.elasticsearch.common.metrics.ExponentialBucketHistogram;

/**
* Tracks how long message handling takes on a transport thread as a histogram with fixed buckets.
*/
public class HandlingTimeTracker {
public class HandlingTimeTracker extends ExponentialBucketHistogram {
Copy link
Contributor Author

@nicktindall nicktindall May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to keep the HandlingTimeTracker because the decision to have 17 + 1 buckets is contextual and should be fixed because it's used in serialization. It also reduces the size of the change.

HandlingTimeTracker is now just an ExponentialBucketHistogram of a specific size.


public static int[] getBucketUpperBounds() {
int[] bounds = new int[17];
for (int i = 0; i < bounds.length; i++) {
bounds[i] = 1 << i;
}
return bounds;
}
public static final int BUCKET_COUNT = 18;

private static int getBucket(long handlingTimeMillis) {
if (handlingTimeMillis <= 0) {
return 0;
} else if (LAST_BUCKET_LOWER_BOUND <= handlingTimeMillis) {
return BUCKET_COUNT - 1;
} else {
return Long.SIZE - Long.numberOfLeadingZeros(handlingTimeMillis);
}
public static int[] getBucketUpperBounds() {
return ExponentialBucketHistogram.getBucketUpperBounds(BUCKET_COUNT);
}

public static final int BUCKET_COUNT = getBucketUpperBounds().length + 1;

private static final long LAST_BUCKET_LOWER_BOUND = getBucketUpperBounds()[BUCKET_COUNT - 2];

private final LongAdder[] buckets;

public HandlingTimeTracker() {
buckets = new LongAdder[BUCKET_COUNT];
for (int i = 0; i < BUCKET_COUNT; i++) {
buckets[i] = new LongAdder();
}
super(BUCKET_COUNT);
}

public void addHandlingTime(long handlingTimeMillis) {
buckets[getBucket(handlingTimeMillis)].increment();
}

/**
* @return An array of frequencies of handling times in buckets with upper bounds as returned by {@link #getBucketUpperBounds()}, plus
* an extra bucket for handling times longer than the longest upper bound.
*/
public long[] getHistogram() {
final long[] histogram = new long[BUCKET_COUNT];
for (int i = 0; i < BUCKET_COUNT; i++) {
histogram[i] = buckets[i].longValue();
}
return histogram;
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -460,7 +460,7 @@ public void incomingRequest(final HttpRequest httpRequest, final HttpChannel htt
handleIncomingRequest(httpRequest, trackingChannel, httpRequest.getInboundException());
} finally {
final long took = threadPool.rawRelativeTimeInMillis() - startTime;
networkService.getHandlingTimeTracker().addHandlingTime(took);
networkService.getHandlingTimeTracker().addObservation(took);
final long logThreshold = slowLogThresholdMs;
if (logThreshold > 0 && took > logThreshold) {
logger.warn(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ public void addResponseStats(long contentLength) {
}

public void addResponseTime(long timeMillis) {
responseTimeTracker.addHandlingTime(timeMillis);
responseTimeTracker.addObservation(timeMillis);
}

public HttpRouteStats getStats() {
Expand All @@ -102,7 +102,7 @@ public HttpRouteStats getStats() {
responseStats.count().longValue(),
responseStats.totalSize().longValue(),
responseStats.getHistogram(),
responseTimeTracker.getHistogram()
responseTimeTracker.getSnapshot()
);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ private void messageReceived(TcpChannel channel, InboundMessage message, long st
}
} finally {
final long took = threadPool.rawRelativeTimeInMillis() - startTime;
handlingTimeTracker.addHandlingTime(took);
handlingTimeTracker.addObservation(took);
final long logThreshold = slowLogThresholdMs;
if (logThreshold > 0 && took > logThreshold) {
logSlowMessage(message, took, logThreshold, responseHandler);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -414,7 +414,7 @@ private void maybeLogSlowMessage(boolean success) {
final long logThreshold = slowLogThresholdMs;
if (logThreshold > 0) {
final long took = threadPool.rawRelativeTimeInMillis() - startTime;
handlingTimeTracker.addHandlingTime(took);
handlingTimeTracker.addObservation(took);
if (took > logThreshold) {
logger.warn(
"sending transport message [{}] of size [{}] on [{}] took [{}ms] which is above the warn "
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1001,8 +1001,8 @@ public final TransportStats getStats() {
bytesRead,
messagesSent,
bytesWritten,
networkService.getHandlingTimeTracker().getHistogram(),
outboundHandlingTimeTracker.getHistogram(),
networkService.getHandlingTimeTracker().getSnapshot(),
outboundHandlingTimeTracker.getSnapshot(),
requestHandlers.getStats()
);
}
Expand Down
Loading