Misc. bug: Completions hang after CUDA error, but health endpoint reports all OK #13281

lee-b · 2025-05-03T12:51:48Z

Name and Version

ghcr.io/ggerganov/llama.cpp:server-cuda d5709176ee6d

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

docker-compose config (with healthchecks and volumes removed for brevity):

services:
  llama-cpp-chat:
    image: ghcr.io/ggerganov/llama.cpp:server-cuda
    pull_policy: always
    restart: always
    ports:
      - 8080:8080
    command:
      - "-a"
      - "llama-cpp-chat"

      - "--tensor-split"
      - "22,22,22,10"

      - "--parallel"
      - "4"
      - "--batch-size"
      - "1024"
      - "--ubatch-size"
      - "512"
      - "--threads-http"
      - "4"

      - "-ngl"
      - "500"

      - "-fa"

      - "-m"
      - "/models/unsloth--Llama-4-Scout-17B-16E-Instruct-GGUF/IQ4_XS/Llama-4-Scout-17B-16E-Instruct-IQ4_XS-00001-of-00002.gguf"
      - "-c"
      - "32768"

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ "0", "1", "2" ]
              capabilities: [gpu]


$ docker images | grep server-cuda
ghcr.io/ggerganov/llama.cpp                      server-cuda   d5709176ee6d   7 hours ago    2.76GB

Problem description & steps to reproduce

After the following error (which may or may not be a bug in its own right -- it's possibly a PCIe riser bus integrity issue on my side):

/app/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
   CUDA error: an illegal memory access was encountered
   current device: 0, in function ggml_cuda_mul_mat_q at /app/ggml/src/ggml-cuda/mmq.cu:145
   cudaMemcpyAsync(ids_host.data(), ids->data, ggml_nbytes(ids), cudaMemcpyDeviceToHost, stream)

Completions hang (apparently forever):

$ date; curl -X POST http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"user","content":"Hello"}]}'
Sat  3 May 13:41:45 BST 2025
^C
$ date
Sat  3 May 13:42:33 BST 2025

Whereas before such an error (or after the error and a llama-server restart):

$ date; curl -X POST http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"user","content":"Hello"}]}'; date
Sat  3 May 13:40:47 BST 2025
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"Hello! How can I assist you today?"}}],"created":1746276047,"model":"llama-cpp-chat","system_fingerprint":"b5269-1d36b367","object":"chat.completion","usage":{"completion_tokens":10,"prompt_tokens":11,"total_tokens":21},"id":"chatcmpl-shzI6fsLwPfFYsQ6SRArjYUKB6lI1mMD","timings":{"prompt_n":11,"prompt_ms":88.858,"prompt_per_token_ms":8.078000000000001,"prompt_per_second":123.79301807378064,"predicted_n":10,"predicted_ms":198.143,"predicted_per_token_ms":19.8143,"predicted_per_second":50.468600959912784}}Sat  3 May 13:40:47 BST 2025

This is not ideal in itself, but understandable.

However, at least, we would want to restart the service (since this is sufficient for recovery, at least with pcie_aspm=off given on the linux command line).

Docker/k8s could do this restarting, but would need the health endpoint to tell us that there's a problem. But, I get:

$ curl http://localhost:8080/health
{"status":"ok"}

So at least, toggling the "ok" to "error" or something, when this CUDA error occurs, would be helpful.

First Bad Commit

No response

Relevant log output

(see above)

The text was updated successfully, but these errors were encountered:

lee-b added the bug-unconfirmed label May 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: Completions hang after CUDA error, but health endpoint reports all OK #13281

Misc. bug: Completions hang after CUDA error, but health endpoint reports all OK #13281

lee-b commented May 3, 2025 •

edited

Loading

Misc. bug: Completions hang after CUDA error, but health endpoint reports all OK #13281

Misc. bug: Completions hang after CUDA error, but health endpoint reports all OK #13281

Comments

lee-b commented May 3, 2025 • edited Loading

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

lee-b commented May 3, 2025 •

edited

Loading