Skip to content

Interop issue between msquic and openssl-3.5-dev #4905

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 4 tasks
nhorman opened this issue Mar 8, 2025 · 3 comments
Open
1 of 4 tasks

Interop issue between msquic and openssl-3.5-dev #4905

nhorman opened this issue Mar 8, 2025 · 3 comments
Labels
Area: Core Related to the shared, core protocol logic external Proposed by non-MSFT OS: Linux TLS: OpenSSL
Milestone

Comments

@nhorman
Copy link
Contributor

nhorman commented Mar 8, 2025

Describe the bug

When doing some interop testing with openssl, I noticed a problem no the quic-interop-runner retry test:
https://github.com/openssl/openssl/actions/runs/13733428533/job/38414155590

In attempting to track it down I gathered the following data in the attached zip file

retrydata.zip

without-retry.cap - tcpdump of connection between openssl client and msquic with retry disabled on quicinteropserver

with-retry.cap - tcpdump of connection between openssl client and msquic with retry enabled on quicinteropserver

without_retry.log - log file generated from msquic quicinteropserver with retry disabled

with_retry.log - log file generated from msquic quicinteropserver with retry enabled

without_retry_keys.log - keylog file for tcpdump with retry disabled on msquic server. Unfortunately no keylog file is generated when retry is enabled, as the handshake never progresses that far.

As the logs show, everything works normally when retry is disabled, but the handshake never completes with retry enabled.

I'm having a hard time making out the server side logs, but it appears that when retry is enabled on the server, the full client hello never gets re-assembled .

This may well be relevant here: Openssl recently enabled the use of ML-KEM keyshares, which dramatically increases the side of the client hello record (spanning 3 datagrams in the tcpdump). I think, looking at the with_retry.log file, on line 277, that msquic eventually decodes the client hello, but then discards the record on line 291.

Its worth noting, that if I set the openssl client up such that it only advertises older keyshares (like X25519), everything also works fine, even with retry enabled. Its only when the client hello spans multiple datagrams with retry enabled on the server that the problem manifests.

Affected OS

  • Windows
  • Linux
  • macOS
  • Other (specify below)

Additional OS information

ubuntu linux 22.04

MsQuic version

main

Steps taken to reproduce bug

  1. run the msquic quicinteropserver ./quicinteropserver -retry:1 -listen:127.0.0.1 -port:4433 -root:~/www -file:/home/nhorman/git/openssl/test/certs/servercert.pem -key:/home/nhorman/git/openssl/test/certs/serverkey.pem
  2. run the openssl quic-hq-interop client LD_LIBRARY_PATH=/home/nhorman/git/openssl SSLKEYLOGFILE=./keys.log SSL_CERT_FILE=/certs/ca.pem SSL_CERT_DIR=/certs ./quic-hq-interop 127.0.0.1 4433 ./reqfile.txt

Expected behavior

The files listed in reqfile.txt should be transferred from the server to the client

Actual outcome

client hangs waiting for handshake to complete, server sends some encrypted data, but its contents are unknown as the keylog file is never produced. Connection is never established

Additional details

No response

@nhorman
Copy link
Contributor Author

nhorman commented Mar 9, 2025

A little more debugging here, it appears that the client is discarding the inbound frames due to a mismatch between the initial source connection id, and the packets source connection id (based on section 7.2 of RFC 9000). Given that msquic sent an SCID of 25c6f641.... in frame 6 of the with_retry.cap, the openssl client expects subsequent packets from the server to contain that SCID for this connection, but the handshake data in frames 7,8 and onward contain a different SCID.

This may possibly be related to #3762 , though I can't draw any connection between this behavior and the fact that it works if the client hello isn't split over multiple datagrams

@nhorman
Copy link
Contributor Author

nhorman commented Mar 9, 2025

Looking further at the with_retry.cap file, I think this is related to #3762. Comparing this to a trace I just took in which I only have the client advertise an x25519 keyshare:

keys.log

test.zip

I can see in the test.cap file (included in test.zip):

Frame 1) Client sends an initial packet with DCID 69252cb875da67e9
Frame 2) Server sends a retry frame selecting SCID cc1f52210a77c28255
Frame 3) Client resends initial packet with updated DCID cc1f52210a77c28255
Frame 4) Server sends an initial packet with server hello containing SCID 3b7839b4eae46e8bc7
Frame 6 and 7) Client sends Handshake finished message with 2nd updated DCID of 3b7839b4eae46e8bc7

And everything works (though wireshark seems a bit confused about decoding the encrypted data with the second change), but the client and server transfer data as expected, as seens from the subsequent stream messages

Comparing this to the previous with_retry.cap file, in which larger keyshares are offered:
Frames 1 and 2) Client sends inital packets with Client hello, using DCID da3583d99cf845fd
Frame 3) Server responds with a retry frame containing SCID 256cf641f71de9427
Frames 4 and 5) Client resends initial packet with new DCID 256cf641f71de9427
Frame 6) Server sends an initial packet containing an ACK with SCID 256cf641f71de9427 (<= First initial frame from server)
Frames 7 and 8) Client sends some more handshake data with DCID 256cf641f71de9427
Frame 9) Server sends its server hello using a new SCID 72c0e5d9311abc995a
At this point the client starts discarding packets from the server because the SCID doesn't match the recorded connection id

From RFC 9000, section 7.2:

Upon first receiving an Initial or Retry packet from the server, the client uses the Source Connection ID supplied by the 
server as the Destination Connection ID for subsequent packets, including any 0-RTT packets. This means that a client 
might have to change the connection ID it sets in the Destination Connection ID field twice during connection 
establishment: once in response to a Retry packet and once in response to an Initial packet from the server. Once a client 
has received a valid Initial packet from the server, it MUST discard any subsequent packet it receives on that connection 
with a different Source Connection ID.

Normally the second SCID change would be honored by all parties, and everything is ok (as it is in test.cap above). But When the the initial packet with the client hello spans multiple datagrams, we get an early initial packet from the server (the ACK in Frame 6 in with_retry.cap), which indicates to the client that no further CID updates are expected, and so the change of CID in the server handshake message violates RFC requirements in section 7.2, causing the subsequent drops.

I think the right fix here is that, once the server sends an initial packet after a retry, altering the CID in the connection can no longer be allowed.

nhorman added a commit to nhorman/openssl that referenced this issue Mar 9, 2025
With the addition of larger ml-kem keys in our tls handshake, we've
uncovered a interop failure, as described here:
microsoft/msquic#4905

In short, when we send a client hello that spans multiple datagrams, the
servers sends an ACK frame in a datagram prior to sending its server
hello.  msquic however, recomputes a new SCID always when sending its
sserver hello, which is fine nominally, but because in this test the
server sends a retry frame to update the SCID, followed by an ACK using
that SCID (which is an initial packet), msquic violates the RFC in
section 7.2 which states:

Once a client has received a valid Initial packet from the server, it MUST
discard any subsequent packet it receives on that connection with a
different Source Connection ID

Because msquic sent an initial packet with that ACK frame, we are
required to discard subsequent frames on the connection containing a
different SCID.

Until msquic fixes that in their implementation we are going to fail the
retry interop test, so for now, lets exclude the test.

Also, while we're at it, re-add chrome into the client list for our
server tests, as that seems to have been lost during the merge.

Fixes openssl/project#1132
@anrossi anrossi added OS: Linux TLS: OpenSSL Area: Core Related to the shared, core protocol logic labels Mar 10, 2025
@anrossi anrossi added this to the Future milestone Mar 10, 2025
openssl-machine pushed a commit to openssl/openssl that referenced this issue Mar 12, 2025
With the addition of larger ml-kem keys in our tls handshake, we've
uncovered a interop failure, as described here:
microsoft/msquic#4905

In short, when we send a client hello that spans multiple datagrams, the
servers sends an ACK frame in a datagram prior to sending its server
hello.  msquic however, recomputes a new SCID always when sending its
sserver hello, which is fine nominally, but because in this test the
server sends a retry frame to update the SCID, followed by an ACK using
that SCID (which is an initial packet), msquic violates the RFC in
section 7.2 which states:

Once a client has received a valid Initial packet from the server, it MUST
discard any subsequent packet it receives on that connection with a
different Source Connection ID

Because msquic sent an initial packet with that ACK frame, we are
required to discard subsequent frames on the connection containing a
different SCID.

Until msquic fixes that in their implementation we are going to fail the
retry interop test, so for now, lets exclude the test.

Also, while we're at it, re-add chrome into the client list for our
server tests, as that seems to have been lost during the merge.

Fixes openssl/project#1132

Reviewed-by: Saša Nedvědický <[email protected]>
Reviewed-by: Matt Caswell <[email protected]>
(Merged from #27014)
openssl-machine pushed a commit to openssl/openssl that referenced this issue Mar 12, 2025
With the addition of larger ml-kem keys in our tls handshake, we've
uncovered a interop failure, as described here:
microsoft/msquic#4905

In short, when we send a client hello that spans multiple datagrams, the
servers sends an ACK frame in a datagram prior to sending its server
hello.  msquic however, recomputes a new SCID always when sending its
sserver hello, which is fine nominally, but because in this test the
server sends a retry frame to update the SCID, followed by an ACK using
that SCID (which is an initial packet), msquic violates the RFC in
section 7.2 which states:

Once a client has received a valid Initial packet from the server, it MUST
discard any subsequent packet it receives on that connection with a
different Source Connection ID

Because msquic sent an initial packet with that ACK frame, we are
required to discard subsequent frames on the connection containing a
different SCID.

Until msquic fixes that in their implementation we are going to fail the
retry interop test, so for now, lets exclude the test.

Also, while we're at it, re-add chrome into the client list for our
server tests, as that seems to have been lost during the merge.

Fixes openssl/project#1132

Reviewed-by: Saša Nedvědický <[email protected]>
Reviewed-by: Matt Caswell <[email protected]>
(Merged from #27014)

(cherry picked from commit 2fb4cfe)
Sashan pushed a commit to Sashan/openssl that referenced this issue Mar 17, 2025
With the addition of larger ml-kem keys in our tls handshake, we've
uncovered a interop failure, as described here:
microsoft/msquic#4905

In short, when we send a client hello that spans multiple datagrams, the
servers sends an ACK frame in a datagram prior to sending its server
hello.  msquic however, recomputes a new SCID always when sending its
sserver hello, which is fine nominally, but because in this test the
server sends a retry frame to update the SCID, followed by an ACK using
that SCID (which is an initial packet), msquic violates the RFC in
section 7.2 which states:

Once a client has received a valid Initial packet from the server, it MUST
discard any subsequent packet it receives on that connection with a
different Source Connection ID

Because msquic sent an initial packet with that ACK frame, we are
required to discard subsequent frames on the connection containing a
different SCID.

Until msquic fixes that in their implementation we are going to fail the
retry interop test, so for now, lets exclude the test.

Also, while we're at it, re-add chrome into the client list for our
server tests, as that seems to have been lost during the merge.

Fixes openssl/project#1132

Reviewed-by: Saša Nedvědický <[email protected]>
Reviewed-by: Matt Caswell <[email protected]>
(Merged from openssl#27014)
@nibanks
Copy link
Member

nibanks commented Mar 18, 2025

Thanks for the bug report! We've been pretty busy lately. We'll try to take a look soon.

@nibanks nibanks added the external Proposed by non-MSFT label Mar 18, 2025
@nibanks nibanks moved this to Planned in DPT Iteration Tracker Apr 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Core Related to the shared, core protocol logic external Proposed by non-MSFT OS: Linux TLS: OpenSSL
Projects
Status: Planned
Development

No branches or pull requests

3 participants