Eval bug: Heavy nondeterminism in Qwen3 MoE (CUDA) #13280

matteoserva · 2025-05-03T08:28:37Z

Name and Version

llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
version: 5269 (1d36b36)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

rtx 5060ti 16GB + rtx4060ti 16GB

Models

Qwen_Qwen3-30B-A3B-Q6_K.gguf by bartowski.
sha256sum: d511d02955714b08ff1b4354d6eae8ea513179a83fa5498466db2731528074dd

Problem description & steps to reproduce

I'm using a grammar to simulate the nothink qwen prompt format. Sometimes the output is generated correctly, sometimes the model outputs the wrong token while still aligned with the grammar.

The command I'm using to test:

curl http://localhost:8080/completion -H "Content-Type: application/json" -d '{
  "prompt": "<|im_start|>system\n<|im_end|>\n<|im_start|>user\nhi<|im_end|>\n<|im_start|>assistant\n",
  "grammar": "root ::= \"<think>\\n\\n</think>\\n\\n\" .*",
  "temperature": 0.001,
  "n_predict": 6,
  "seed": 42
}'

Correct output: [151667 198 198 151668 ...] <think>\n\n</think>...
Wrong output: [151667 198 198 27 14 ...] <think>\n\n</...

Sometimes the model outputs the correct output, sometimes it outputs the wrong output and the following output breaks since the model cannot see the </think> token. I'm not restarting llama-server between tests and not changing the seed. I expect the model to always output the token 151668

Command line used to launch llama-server: /llama-server -ngl 175 -t 6 -c 32768 --host 0.0.0.0 -fa -ctk q8_0 -ctv q8_0 --slots -a current --temp 0.6

First Bad Commit

No response

Relevant log output

`{"index":0,"content":"<think>\n\n</think"...`

The text was updated successfully, but these errors were encountered:

CISC · 2025-05-03T09:34:18Z

You can't use grammar to force a specific token, only a specific sequence of characters, if the model can complete that sequence with alternate tokens that's perfectly acceptable to the grammar.

CISC · 2025-05-03T09:40:06Z

If you are just trying to disable thinking you yourself have already implemented the means to do that, see #13212 (comment) :)

matteoserva · 2025-05-03T09:44:20Z

This is a different problem. The output shouldn't be random when the seed is fixed and the temperature is zero.

If I let the model generate its output "naturally", the generated tokens are always the same at every run.

If I use the grammar to force the exact same output string, then the output tokens can change randomly, producing extremely low probability tokens.

CISC · 2025-05-03T09:57:36Z

This is a different problem. The output shouldn't be random when the seed is fixed and the temperature is zero.

Shouldn't, but is, at least with CUDA unfortunately.

If I let the model generate its output "naturally", the generated tokens are always the same at every run.

If I use the grammar to force the exact same output string, then the output tokens can change randomly, producing extremely low probability tokens.

Well, grammar will naturally skew the probabilities, I think you're just seeing the results of the numerical instability of CUDA more frequently due to the limited token selection.

CISC · 2025-05-03T10:05:37Z

I'll reopen the issue in case someone wants to take a closer look, but you might want to change the title to a more accurate description of the issue.

matteoserva · 2025-05-03T11:52:27Z

I made some tests and I can provide more context. I used the following command to print the probabilities:

curl http://localhost:8080/completion -H "Content-Type: application/json" -d '{
  "prompt": "<|im_start|>system\n<|im_end|>\n<|im_start|>user\nhi<|im_end|>\n<|im_start|>assistant\n",
  "grammar": "root ::= \"<think>\\n\\n</think>\\n\\n\" .*",
  "temperature": 0.001,
  "n_predict": 4,
  "seed": 42,
  "n_probs": 4
}'

The 4 output tokens are

<think> \n \n </think> when the output is generated correctly
<think> \n \n < when the generation is wrong. The model then goes out of distribution and generates low quality output.

The logprob of the 4th token is:

</think> : -22.50234603881836,
<: -23.995466232299805

Sometimes I get the wrong result 30 times in a row, sometimes I get the correct result 30 times in a row. With these settings it should be less than 1 in a billion chance ( 2^30) of being correct behavior.

EDIT: nevermind. I have to test with "cache_prompt": false

matteoserva · 2025-05-03T12:37:50Z

Ok. I am able to narrow down the problem even more with this python script:

import requests
import json

headers = {
    'Content-Type': 'application/json',
}

json_data = {
    'prompt': '<|im_start|>system\n<|im_end|>\n<|im_start|>user\nhi<|im_end|>\n<|im_start|>assistant\n',
    'grammar': 'root ::= "<think>\\n\\n</think>\\n\\n" .*',
    'temperature': 0.001,
    'n_predict': 4,
    'seed': 42,
    'n_probs': 1,
    'cache_prompt': False,
}

for _ in range(10):
    response = requests.post('http://localhost:8080/completion', headers=headers, json=json_data)
    result = json.loads(response.text)
    fourth = result["completion_probabilities"][3]
    del fourth["top_logprobs"]
    del fourth["bytes"]
    print(fourth)

This is the output of Qwen3-MoE for CPU debug:

$ python3 qwen-mode-bug.py 
{'id': 27, 'token': '<', 'logprob': -21.9802303314209}
{'id': 27, 'token': '<', 'logprob': -21.9802303314209}
{'id': 27, 'token': '<', 'logprob': -21.9802303314209}
{'id': 27, 'token': '<', 'logprob': -21.9802303314209}
{'id': 27, 'token': '<', 'logprob': -21.9802303314209}
...

This is the output of Qwen3-MoE for CUDA Release:

$ python3 qwen-mode-bug.py 
{'id': 27, 'token': '<', 'logprob': -24.497241973876953}
{'id': 27, 'token': '<', 'logprob': -25.099708557128906}
{'id': 151668, 'token': '</think>', 'logprob': -22.38550567626953}
{'id': 27, 'token': '<', 'logprob': -22.939809799194336}
{'id': 27, 'token': '<', 'logprob': -23.012990951538086}
{'id': 27, 'token': '<', 'logprob': -24.535411834716797}
{'id': 151668, 'token': '</think>', 'logprob': -26.998727798461914}
{'id': 27, 'token': '<', 'logprob': -25.120460510253906}
{'id': 27, 'token': '<', 'logprob': -26.36814308166504}
{'id': 27, 'token': '<', 'logprob': -25.176483154296875}

This is the output with Qwen3-0.6 (Dense), CUDA Release:

{'id': 27, 'token': '<', 'logprob': -10.674346923828125}
{'id': 27, 'token': '<', 'logprob': -10.674346923828125}
{'id': 27, 'token': '<', 'logprob': -10.674346923828125}
{'id': 27, 'token': '<', 'logprob': -10.674346923828125}
...

problem

The log probabilities in Qwen-MoE (CUDA) vary wildly between -22 and -27, which is not expected. The inputs are always the same and the probability distribution should be always the same.
I might be wrong but I'm not expecting this behaviour.
@JohannesGaessler Maybe you can confirm this?

JohannesGaessler · 2025-05-03T17:58:26Z

Given the server parameters the output should to my knowledge be deterministic, even with CUDA. I have recently made quite a lot of changes to the MoE code. Unfortunately, as was pointed out to me in ggml-org/ggml#1178 (comment) , prior to my changes the results were already non-deterministic so I think a git bisect will be inconclusive.

Just to rule out the possibility of a hardware issue: do you get deterministic results after running nvidia-smi --lock-gpu-clocks 0,1000 --mode 1? I have a machine with multiple 4090s running off of a single 2050 W PSU and as it turns out even with just 3 of them running in parallel the power spikes can overlap in unlucky ways that either cause bit flips or system crashes.

matteoserva · 2025-05-04T09:17:56Z

Just to rule out the possibility of a hardware issue: do you get deterministic results after running nvidia-smi --lock-gpu-clocks 0,1000 --mode 1? I have a machine with multiple 4090s running off of a single 2050 W PSU and as it turns out even with just 3 of them running in parallel the power spikes can overlap in unlucky ways that either cause bit flips or system crashes.

It doesn't seem to change anything

Qwen-moe with default settings:

python3 qwen-mode-bug.py 
{'id': 151668, 'token': '</think>', 'logprob': -25.749713897705078}
{'id': 27, 'token': '<', 'logprob': -25.500146865844727}
{'id': 27, 'token': '<', 'logprob': -23.861719131469727}
{'id': 27, 'token': '<', 'logprob': -24.062076568603516}
{'id': 27, 'token': '<', 'logprob': -24.572689056396484}
{'id': 27, 'token': '<', 'logprob': -23.132158279418945}
{'id': 151668, 'token': '</think>', 'logprob': -22.906593322753906}
{'id': 27, 'token': '<', 'logprob': -23.740596771240234}
{'id': 27, 'token': '<', 'logprob': -23.677282333374023}
{'id': 27, 'token': '<', 'logprob': -24.058055877685547}

Qwen-moe with `nvidia-smi --lock-gpu-clocks 0,1000 --mode 1`

$ python3 qwen-mode-bug.py 
{'id': 27, 'token': '<', 'logprob': -24.385953903198242}
{'id': 27, 'token': '<', 'logprob': -24.010208129882812}
{'id': 151668, 'token': '</think>', 'logprob': -23.4265079498291}
{'id': 151668, 'token': '</think>', 'logprob': -25.16918182373047}
{'id': 151668, 'token': '</think>', 'logprob': -26.014835357666016}
{'id': 151668, 'token': '</think>', 'logprob': -24.031843185424805}
{'id': 27, 'token': '<', 'logprob': -23.161317825317383}
{'id': 151668, 'token': '</think>', 'logprob': -24.846834182739258}
{'id': 151668, 'token': '</think>', 'logprob': -23.85172462463379}
{'id': 27, 'token': '<', 'logprob': -24.06478500366211}

Qwen 32b (dense) with `nvidia-smi --lock-gpu-clocks 0,1000 --mode 1`

$ python3 qwen-mode-bug.py 
{'id': 151668, 'token': '</think>', 'logprob': -8.822237014770508}
{'id': 151668, 'token': '</think>', 'logprob': -8.822237014770508}
{'id': 151668, 'token': '</think>', 'logprob': -8.822237014770508}
{'id': 151668, 'token': '</think>', 'logprob': -8.822237014770508}
{'id': 151668, 'token': '</think>', 'logprob': -8.822237014770508}
[...]

JohannesGaessler · 2025-05-04T10:09:01Z

Please check this fix: #13294 . There was a race condition in the new MMQ code that could be causing this.

matteoserva · 2025-05-04T10:50:14Z

Please check this fix: #13294 . There was a race condition in the new MMQ code that could be causing this.

No change even after 9a0053f

Is there something (compiler flags, macros) that I could try to help narrow down the issue?

More tests I made:

One or two cards doesn't make any difference
launched with -ngl 0 or -ngl 1 I get correct results
-ngl 2 and higher give slightly wrong results.
Increasing -ngl makes the results worse and worse (worse as in higher standard deviation/ the probability assigned to that token becomes more random)

JohannesGaessler · 2025-05-04T11:32:06Z

Does -ub 1 fix the issue?

Does this patch fix the issue?

diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
index 9fb2134f9..8374c9a35 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -1979,22 +1979,6 @@ static void ggml_cuda_mul_mat_id(ggml_backend_cuda_context & ctx, ggml_tensor *
 
     const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
 
-    if (src1->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32) {
-        if (ne2 == 1) {
-            if (ggml_is_quantized(src0->type)) {
-                ggml_cuda_mul_mat_vec_q(ctx, src0, src1, ids, dst);
-            } else {
-                ggml_cuda_mul_mat_vec(ctx, src0, src1, ids, dst);
-            }
-            return;
-        }
-
-        if (ggml_cuda_should_use_mmq(src0->type, cc, ne12)) {
-            ggml_cuda_mul_mat_q(ctx, src0, src1, ids, dst);
-            return;
-        }
-    }
-
     cudaStream_t stream = ctx.stream();
 
     GGML_ASSERT(nb12 % nb11 == 0);

matteoserva · 2025-05-04T11:47:40Z

Does -ub 1 fix the issue?

Does this patch fix the issue?

With -ub 1 I get correct and deterministic results

{'id': 151668, 'token': '</think>', 'logprob': -24.28609848022461}
{'id': 151668, 'token': '</think>', 'logprob': -24.28609848022461}
{'id': 151668, 'token': '</think>', 'logprob': -24.28609848022461}
{'id': 151668, 'token': '</think>', 'logprob': -24.28609848022461}

with -ngl 0 I get wrong but deterministic results

{'id': 27, 'token': '<', 'logprob': -25.752641677856445}
{'id': 27, 'token': '<', 'logprob': -25.752641677856445}
{'id': 27, 'token': '<', 'logprob': -25.752641677856445}
{'id': 27, 'token': '<', 'logprob': -25.752641677856445}

with -ngl 99 I get wrong and non deterministic results

{'id': 27, 'token': '<', 'logprob': -24.733509063720703}
{'id': 151668, 'token': '</think>', 'logprob': -24.34672737121582}
{'id': 27, 'token': '<', 'logprob': -23.8717098236084}
{'id': 27, 'token': '<', 'logprob': -24.000146865844727}
{'id': 151668, 'token': '</think>', 'logprob': -22.6581974029541}

with your patch merged with cuda-mmq-fix-race-cond

I get a crash

llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2009
  cudaStreamSynchronize(stream)
CUDA error
[New LWP 37411]
[New LWP 37422]
[New LWP 37423]
[New LWP 37424]
[New LWP 37425]
[New LWP 37426]
[New LWP 37427]
[New LWP 37428]
[New LWP 37429]
[New LWP 37430]
[New LWP 37431]
[New LWP 37432]
[New LWP 37433]
[New LWP 37434]
[New LWP 37435]
[New LWP 37436]
[New LWP 37437]
[New LWP 37438]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f01d49f2c17 in __GI___wait4 (pid=37515, stat_loc=0x7ffdc9e09d34, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: File o directory non esistente.
#0  0x00007f01d49f2c17 in __GI___wait4 (pid=37515, stat_loc=0x7ffdc9e09d34, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00007f01d4e44fd1 in ggml_abort () from /home/matteo/programmi/llama.cpp/build/bin/libggml-base.so
#2  0x00007f01d1c93ae3 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/matteo/programmi/llama.cpp/build/bin/libggml-cuda.so
#3  0x00007f01d1c9ede0 in ggml_cuda_mul_mat_id(ggml_backend_cuda_context&, ggml_tensor*) () from /home/matteo/programmi/llama.cpp/build/bin/libggml-cuda.so
#4  0x00007f01d1c9fccb in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/matteo/programmi/llama.cpp/build/bin/libggml-cuda.so
#5  0x00007f01d4e59fe3 in ggml_backend_sched_graph_compute_async () from /home/matteo/programmi/llama.cpp/build/bin/libggml-base.so
#6  0x00007f01d4f7abc9 in llama_context::graph_compute(ggml_cgraph*, bool) () from /home/matteo/programmi/llama.cpp/build/bin/libllama.so
#7  0x00007f01d4f7dde6 in llama_context::decode(llama_batch&) () from /home/matteo/programmi/llama.cpp/build/bin/libllama.so
#8  0x00007f01d4f7f1ab in llama_decode () from /home/matteo/programmi/llama.cpp/build/bin/libllama.so
#9  0x0000563e617ed899 in server_context::update_slots() ()
#10 0x0000563e617b915f in server_queue::start_loop() ()
#11 0x0000563e61786ee3 in main ()

JohannesGaessler · 2025-05-04T12:13:10Z

Potential fix: #13299

matteoserva · 2025-05-04T12:33:34Z

commit 93c4e23 fixed the nondeterminism issue. Now I'm getting the same answer in each consecutive run. Nice!
I see some differences when I change the parameters. I don't know if it's an issue.

My launch command: llama-server -ngl 175 -t 6 -c 32768 --host 0.0.0.0 --slots -a current --temp 0.6 -m Qwen3-30B-A3B-Q6_K.gguf

Testing with the python script I posted.

normal launch command (10 identical rows)

{'id': 151668, 'token': '</think>', 'logprob': -23.03097915649414}

with -ub 1

{'id': 27, 'token': '<', 'logprob': -22.11159896850586}

with -fa

{'id': 151668, 'token': '</think>', 'logprob': -23.786317825317383}

with -fa -ub 1

{'id': 151668, 'token': '</think>', 'logprob': -22.095962524414062}

JohannesGaessler · 2025-05-04T12:47:28Z

Generally speaking it is expected that results will not be bit-for-bit identical if you vary some of the parameters. If different code is being run, then the floating point rounding error will be different (there are potentially other differences too). Such small differences can then blow up to comparatively larger differences in the outputs.

matteoserva · 2025-05-04T12:50:23Z

Well. The issue is solved, then.
Thanks for the quick fix and update!

matteoserva added the bug-unconfirmed label May 3, 2025

CISC closed this as completed May 3, 2025

CISC reopened this May 3, 2025

matteoserva changed the title ~~Eval bug: grammar breaks Qwen3 MoE~~ Eval bug: grammar breaks sampling in Qwen3 MoE May 3, 2025

matteoserva changed the title ~~Eval bug: grammar breaks sampling in Qwen3 MoE~~ Eval bug: Heavy nondeterminism in Qwen3 MoE (CUDA) May 4, 2025

CISC linked a pull request May 4, 2025 that will close this issue

CUDA: fix race condition in MMQ stream-k fixup #13299

Merged

CISC closed this as completed May 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: Heavy nondeterminism in Qwen3 MoE (CUDA) #13280

Eval bug: Heavy nondeterminism in Qwen3 MoE (CUDA) #13280

matteoserva commented May 3, 2025 •

edited

Loading

CISC commented May 3, 2025

CISC commented May 3, 2025

matteoserva commented May 3, 2025

CISC commented May 3, 2025

CISC commented May 3, 2025

matteoserva commented May 3, 2025 •

edited

Loading

matteoserva commented May 3, 2025 •

edited

Loading

JohannesGaessler commented May 3, 2025

matteoserva commented May 4, 2025

JohannesGaessler commented May 4, 2025

matteoserva commented May 4, 2025 •

edited

Loading

JohannesGaessler commented May 4, 2025

matteoserva commented May 4, 2025 •

edited

Loading

JohannesGaessler commented May 4, 2025

matteoserva commented May 4, 2025

JohannesGaessler commented May 4, 2025

matteoserva commented May 4, 2025

Eval bug: Heavy nondeterminism in Qwen3 MoE (CUDA) #13280

Eval bug: Heavy nondeterminism in Qwen3 MoE (CUDA) #13280

Comments

matteoserva commented May 3, 2025 • edited Loading

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

CISC commented May 3, 2025

CISC commented May 3, 2025

matteoserva commented May 3, 2025

CISC commented May 3, 2025

CISC commented May 3, 2025

matteoserva commented May 3, 2025 • edited Loading

matteoserva commented May 3, 2025 • edited Loading

problem

JohannesGaessler commented May 3, 2025

matteoserva commented May 4, 2025

Qwen-moe with default settings:

Qwen-moe with nvidia-smi --lock-gpu-clocks 0,1000 --mode 1

Qwen 32b (dense) with nvidia-smi --lock-gpu-clocks 0,1000 --mode 1

JohannesGaessler commented May 4, 2025

matteoserva commented May 4, 2025 • edited Loading

JohannesGaessler commented May 4, 2025

matteoserva commented May 4, 2025 • edited Loading

JohannesGaessler commented May 4, 2025

matteoserva commented May 4, 2025

JohannesGaessler commented May 4, 2025

matteoserva commented May 4, 2025

matteoserva commented May 3, 2025 •

edited

Loading

matteoserva commented May 3, 2025 •

edited

Loading

matteoserva commented May 3, 2025 •

edited

Loading

Qwen-moe with `nvidia-smi --lock-gpu-clocks 0,1000 --mode 1`

Qwen 32b (dense) with `nvidia-smi --lock-gpu-clocks 0,1000 --mode 1`

matteoserva commented May 4, 2025 •

edited

Loading

matteoserva commented May 4, 2025 •

edited

Loading