You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After the following error (which may or may not be a bug in its own right -- it's possibly a PCIe riser bus integrity issue on my side):
/app/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
CUDA error: an illegal memory access was encountered
current device: 0, in function ggml_cuda_mul_mat_q at /app/ggml/src/ggml-cuda/mmq.cu:145
cudaMemcpyAsync(ids_host.data(), ids->data, ggml_nbytes(ids), cudaMemcpyDeviceToHost, stream)
Completions hang (apparently forever):
$ date; curl -X POST http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"user","content":"Hello"}]}'
Sat 3 May 13:41:45 BST 2025
^C
$ date
Sat 3 May 13:42:33 BST 2025
Whereas before such an error (or after the error and a llama-server restart):
$ date; curl -X POST http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"user","content":"Hello"}]}'; date
Sat 3 May 13:40:47 BST 2025
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"Hello! How can I assist you today?"}}],"created":1746276047,"model":"llama-cpp-chat","system_fingerprint":"b5269-1d36b367","object":"chat.completion","usage":{"completion_tokens":10,"prompt_tokens":11,"total_tokens":21},"id":"chatcmpl-shzI6fsLwPfFYsQ6SRArjYUKB6lI1mMD","timings":{"prompt_n":11,"prompt_ms":88.858,"prompt_per_token_ms":8.078000000000001,"prompt_per_second":123.79301807378064,"predicted_n":10,"predicted_ms":198.143,"predicted_per_token_ms":19.8143,"predicted_per_second":50.468600959912784}}Sat 3 May 13:40:47 BST 2025
This is not ideal in itself, but understandable.
However, at least, we would want to restart the service (since this is sufficient for recovery, at least with pcie_aspm=off given on the linux command line).
Docker/k8s could do this restarting, but would need the health endpoint to tell us that there's a problem. But, I get:
Name and Version
ghcr.io/ggerganov/llama.cpp:server-cuda d5709176ee6d
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
Problem description & steps to reproduce
After the following error (which may or may not be a bug in its own right -- it's possibly a PCIe riser bus integrity issue on my side):
Completions hang (apparently forever):
Whereas before such an error (or after the error and a llama-server restart):
This is not ideal in itself, but understandable.
However, at least, we would want to restart the service (since this is sufficient for recovery, at least with pcie_aspm=off given on the linux command line).
Docker/k8s could do this restarting, but would need the health endpoint to tell us that there's a problem. But, I get:
So at least, toggling the "ok" to "error" or something, when this CUDA error occurs, would be helpful.
First Bad Commit
No response
Relevant log output
(see above)
The text was updated successfully, but these errors were encountered: