Skip to content

Misc. bug: Completions hang after CUDA error, but health endpoint reports all OK #13281

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lee-b opened this issue May 3, 2025 · 0 comments

Comments

@lee-b
Copy link

lee-b commented May 3, 2025

Name and Version

ghcr.io/ggerganov/llama.cpp:server-cuda d5709176ee6d

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

docker-compose config (with healthchecks and volumes removed for brevity):

services:
  llama-cpp-chat:
    image: ghcr.io/ggerganov/llama.cpp:server-cuda
    pull_policy: always
    restart: always
    ports:
      - 8080:8080
    command:
      - "-a"
      - "llama-cpp-chat"

      - "--tensor-split"
      - "22,22,22,10"

      - "--parallel"
      - "4"
      - "--batch-size"
      - "1024"
      - "--ubatch-size"
      - "512"
      - "--threads-http"
      - "4"

      - "-ngl"
      - "500"

      - "-fa"

      - "-m"
      - "/models/unsloth--Llama-4-Scout-17B-16E-Instruct-GGUF/IQ4_XS/Llama-4-Scout-17B-16E-Instruct-IQ4_XS-00001-of-00002.gguf"
      - "-c"
      - "32768"

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ "0", "1", "2" ]
              capabilities: [gpu]


$ docker images | grep server-cuda
ghcr.io/ggerganov/llama.cpp                      server-cuda   d5709176ee6d   7 hours ago    2.76GB

Problem description & steps to reproduce

After the following error (which may or may not be a bug in its own right -- it's possibly a PCIe riser bus integrity issue on my side):

/app/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
   CUDA error: an illegal memory access was encountered
   current device: 0, in function ggml_cuda_mul_mat_q at /app/ggml/src/ggml-cuda/mmq.cu:145
   cudaMemcpyAsync(ids_host.data(), ids->data, ggml_nbytes(ids), cudaMemcpyDeviceToHost, stream)

Completions hang (apparently forever):

$ date; curl -X POST http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"user","content":"Hello"}]}'
Sat  3 May 13:41:45 BST 2025
^C
$ date
Sat  3 May 13:42:33 BST 2025

Whereas before such an error (or after the error and a llama-server restart):

$ date; curl -X POST http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"user","content":"Hello"}]}'; date
Sat  3 May 13:40:47 BST 2025
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"Hello! How can I assist you today?"}}],"created":1746276047,"model":"llama-cpp-chat","system_fingerprint":"b5269-1d36b367","object":"chat.completion","usage":{"completion_tokens":10,"prompt_tokens":11,"total_tokens":21},"id":"chatcmpl-shzI6fsLwPfFYsQ6SRArjYUKB6lI1mMD","timings":{"prompt_n":11,"prompt_ms":88.858,"prompt_per_token_ms":8.078000000000001,"prompt_per_second":123.79301807378064,"predicted_n":10,"predicted_ms":198.143,"predicted_per_token_ms":19.8143,"predicted_per_second":50.468600959912784}}Sat  3 May 13:40:47 BST 2025

This is not ideal in itself, but understandable.

However, at least, we would want to restart the service (since this is sufficient for recovery, at least with pcie_aspm=off given on the linux command line).

Docker/k8s could do this restarting, but would need the health endpoint to tell us that there's a problem. But, I get:

$ curl http://localhost:8080/health
{"status":"ok"}

So at least, toggling the "ok" to "error" or something, when this CUDA error occurs, would be helpful.

First Bad Commit

No response

Relevant log output

(see above)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant