Skip to content

[ML] Inference API add configurable connection pool TTL #127585

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 7, 2025

Conversation

jonathan-buttner
Copy link
Contributor

This PR allows the connection pool TTL value to be configured via a setting. This is currently only used for the EIS connection pool to help triage the connection timeouts issue.

@jonathan-buttner jonathan-buttner added >non-issue :ml Machine learning Team:ML Meta label for the ML team auto-backport Automatically create backport pull requests when merged v8.19.0 v9.1.0 labels Apr 30, 2025
*/
public static final Setting<TimeValue> CONNECTION_TTL_SETTING = Setting.timeSetting(
"xpack.inference.elastic.http.connection_ttl",
// -1 indicates that the TTL never expires
Copy link
Contributor Author

@jonathan-buttner jonathan-buttner Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the constructor for the pooling connection manager

    public PoolingNHttpClientConnectionManager(ConnectingIOReactor ioReactor, NHttpConnectionFactory<ManagedNHttpClientConnection> connFactory, Registry<SchemeIOSessionStrategy> ioSessionFactoryRegistry, SocketAddressResolver<HttpRoute> socketAddressResolver) {
        this(ioReactor, connFactory, ioSessionFactoryRegistry, socketAddressResolver, -1L, TimeUnit.MILLISECONDS);
    }

public static final Setting<TimeValue> CONNECTION_TTL_SETTING = Setting.timeSetting(
"xpack.inference.elastic.http.connection_ttl",
// -1 indicates that the TTL never expires
TimeValue.MINUS_ONE,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxjakob @prwhelan @dtuck9 any thoughts on a default value here?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm not entirely familiar with the use case to give specific guidance. But some questions to consider:

  1. Is this a high request rate client?
  2. Would having to re-perform the TLS handshake more frequently cause any notable latency/performance issues?
  3. Do you have metrics in place that could monitor any potential negative consequences of setting this?

And to confirm, this setting wouldn't abruptly close connections with in-flight requests, correct?

Copy link
Contributor Author

@jonathan-buttner jonathan-buttner May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Diana

Is this a high request rate client?

Yeah it could be, at the moment likely not but once we support sparse embedding I can see it having a higher request rate.

Would having to re-perform the TLS handshake more frequently cause any notable latency/performance issues?

I'm not sure

Do you have metrics in place that could monitor any potential negative consequences of setting this?

Not at the moment but we should probably add some 😅

And to confirm, this setting wouldn't abruptly close connections with in-flight requests, correct?

My understanding of the docs is that if the TTL is reached, the pool won't reuse an existing connection. I haven't seen anything about it terminating active connections.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is for closing idle connections, so it should be safe to set this to some value so that the connection gets released from the NLB/EIS.

My understanding is it just needs to be lower than the NLB's connection timeout, which is ~350s, otherwise the client tries to reuse the connection and the NLB sends TCP RST which I think surfaces as another connection timeout? The setting would depend on the user's traffic patterns, so I feel like it's hard for us to pick a good value for everyone. Making it configurable is a good idea, then maybe we have the default set to 60s (I'm picking that number randomly)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value should be set so the connection does expire, this will refresh the connection pool and possibly help overcome bugs with broken connections. The setting limits the lifespan of the connection in the pool, 24 hours seems like a reasonable value. If all the connections in the pool are created at the same time maybe add some jitter (+/- [1-60]) mins so they don't all EOL at the same time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback.

24 hours seems like a reasonable value

One thing to note is that we have an idle connection evictor that runs every minute and checks to see if any connection in the pool has been idle for over at least a minute. If it has, the connection is closed and removed from the pool. I'm not saying that a 1 minute idle time is the correct value, but if we leave that as it is I suspect that a 24 hour value for the TTL won't have much affect (if the connection is idle it will have been closed long before the 24 hours is up). Unless a user is constantly sending requests for a 24 hour period. Maybe that'd happen more often when we go to multi-tenant (the connect is associated with a URL not the user) or in a reindex use case.

Copy link
Contributor

@demjened demjened May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note is that we have an idle connection evictor that runs every minute and checks to see if any connection in the pool has been idle for over at least a minute.

Does this mean that if we set the TTL to anything above 60s, it won't have an effect as the eviction will happen earlier?
And for my understanding, what we're trying to do by setting the right TTL is to find a balance between avoiding unnecessarily frequent TLS handshakes with new connections, vs avoiding unnecessarily long-lived connections that consume resources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that if we set the TTL to anything above 60s, it won't have an effect as the eviction will happen earlier?

I believe the connection will be evicted if it is idle for 60 seconds or more. If we set the TTL to 2 minutes, after the connection has existed for 2 minutes or more (even if it hasn't been idle for the 60 seconds), when a request is sent , the pool will not use that connection. It's basically an upper bound limit that takes priority over other settings. So if a connection is viable for leasing in each other way (hasn't been idle for a long period of time, etc) if it has reached the TTL it won't be used.

So it still has an effect if we set it above 60 seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion, 60s sounds like a good initial setting, then we can revisit this if the workload changes (e.g. ELSER inference bursts at ingest). Some metrics/logs would help to keep an eye on this.

@jonathan-buttner jonathan-buttner marked this pull request as ready for review May 1, 2025 18:17
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@jonathan-buttner jonathan-buttner merged commit 3b4e536 into elastic:main May 7, 2025
17 checks passed
jonathan-buttner added a commit to jonathan-buttner/elasticsearch that referenced this pull request May 7, 2025
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.19

elasticsearchmachine pushed a commit that referenced this pull request May 7, 2025
ywangd pushed a commit to ywangd/elasticsearch that referenced this pull request May 9, 2025
afoucret pushed a commit to afoucret/elasticsearch that referenced this pull request May 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged :ml Machine learning >non-issue Team:ML Meta label for the ML team v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants