Skip to content

Misc. bug: Server does not always cancel requests for disconnected connections #13262

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
CyberShadow opened this issue May 2, 2025 · 0 comments

Comments

@CyberShadow
Copy link

Name and Version

$ ./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
version: 0 (unknown)
built with gcc (GCC) 13.3.0 for x86_64-unknown-linux-gnu

(Actually version 5161)

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

./llama-server -m gemma-3-27b-pt-q4_0.gguf -ngl 9999 --host 127.0.0.1 --port 8000 --threads-http 1

# ...

curl -v --request POST \
    --url http://127.0.0.1:8000/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Five, Four, Three, Two, One, '$RANDOM'\n\n\n\nThe countdown","n_predict": 256, "n_probs":10, "temperature":0,"stream":true}'

Problem description & steps to reproduce

It looks like sometimes the server will try to generate responses for HTTP requests that have been queued, but the client has since disconnected.

I can reproduce the problem as follows:

  1. Start the server
  2. Start the curl command above. The key aspects of it is that it must be long-running (i.e. n_predict is high).
  3. While it's still running, in another terminal, start and then immediately cancel (with Ctrl+C) the same command a few times, in quick succession.
  4. Start the curl command once more.
  5. Cancel the original curl command in step 2.

Expected behavior: The server should start to immediately reply to the command from step 4.
Actual behavior: The server seems to hang, because it is pointlessly generating replies to the canceled commands in step 3.

I tried to force the server to handle one request at a time with the --threads-http 1 option, but it doesn't seem to make a difference.

First Bad Commit

This seems to be a regression, but it was introduced about a year ago, so the exact change which introduced it is probably not relevant.

Relevant log output

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant