Skip to content

Eval bug: Heavy nondeterminism in Qwen3 MoE (CUDA) #13280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
matteoserva opened this issue May 3, 2025 · 17 comments · Fixed by #13299
Closed

Eval bug: Heavy nondeterminism in Qwen3 MoE (CUDA) #13280

matteoserva opened this issue May 3, 2025 · 17 comments · Fixed by #13299

Comments

@matteoserva
Copy link
Contributor

matteoserva commented May 3, 2025

Name and Version

llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
version: 5269 (1d36b36)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

rtx 5060ti 16GB + rtx4060ti 16GB

Models

Qwen_Qwen3-30B-A3B-Q6_K.gguf by bartowski.
sha256sum: d511d02955714b08ff1b4354d6eae8ea513179a83fa5498466db2731528074dd

Problem description & steps to reproduce

I'm using a grammar to simulate the nothink qwen prompt format. Sometimes the output is generated correctly, sometimes the model outputs the wrong token while still aligned with the grammar.

The command I'm using to test:

curl http://localhost:8080/completion -H "Content-Type: application/json" -d '{
  "prompt": "<|im_start|>system\n<|im_end|>\n<|im_start|>user\nhi<|im_end|>\n<|im_start|>assistant\n",
  "grammar": "root ::= \"<think>\\n\\n</think>\\n\\n\" .*",
  "temperature": 0.001,
  "n_predict": 6,
  "seed": 42
}'

Correct output: [151667 198 198 151668 ...] <think>\n\n</think>...
Wrong output: [151667 198 198 27 14 ...] <think>\n\n</...

Sometimes the model outputs the correct output, sometimes it outputs the wrong output and the following output breaks since the model cannot see the </think> token. I'm not restarting llama-server between tests and not changing the seed. I expect the model to always output the token 151668

Command line used to launch llama-server: /llama-server -ngl 175 -t 6 -c 32768 --host 0.0.0.0 -fa -ctk q8_0 -ctv q8_0 --slots -a current --temp 0.6

First Bad Commit

No response

Relevant log output

`{"index":0,"content":"<think>\n\n</think"...`
@CISC
Copy link
Collaborator

CISC commented May 3, 2025

You can't use grammar to force a specific token, only a specific sequence of characters, if the model can complete that sequence with alternate tokens that's perfectly acceptable to the grammar.

@CISC CISC closed this as completed May 3, 2025
@CISC
Copy link
Collaborator

CISC commented May 3, 2025

If you are just trying to disable thinking you yourself have already implemented the means to do that, see #13212 (comment) :)

@matteoserva
Copy link
Contributor Author

This is a different problem. The output shouldn't be random when the seed is fixed and the temperature is zero.

If I let the model generate its output "naturally", the generated tokens are always the same at every run.

If I use the grammar to force the exact same output string, then the output tokens can change randomly, producing extremely low probability tokens.

@CISC
Copy link
Collaborator

CISC commented May 3, 2025

This is a different problem. The output shouldn't be random when the seed is fixed and the temperature is zero.

Shouldn't, but is, at least with CUDA unfortunately.

If I let the model generate its output "naturally", the generated tokens are always the same at every run.

If I use the grammar to force the exact same output string, then the output tokens can change randomly, producing extremely low probability tokens.

Well, grammar will naturally skew the probabilities, I think you're just seeing the results of the numerical instability of CUDA more frequently due to the limited token selection.

@CISC
Copy link
Collaborator

CISC commented May 3, 2025

I'll reopen the issue in case someone wants to take a closer look, but you might want to change the title to a more accurate description of the issue.

@CISC CISC reopened this May 3, 2025
@matteoserva
Copy link
Contributor Author

matteoserva commented May 3, 2025

I made some tests and I can provide more context. I used the following command to print the probabilities:

curl http://localhost:8080/completion -H "Content-Type: application/json" -d '{
  "prompt": "<|im_start|>system\n<|im_end|>\n<|im_start|>user\nhi<|im_end|>\n<|im_start|>assistant\n",
  "grammar": "root ::= \"<think>\\n\\n</think>\\n\\n\" .*",
  "temperature": 0.001,
  "n_predict": 4,
  "seed": 42,
  "n_probs": 4
}'

The 4 output tokens are

  1. <think> \n \n </think> when the output is generated correctly
  2. <think> \n \n < when the generation is wrong. The model then goes out of distribution and generates low quality output.

The logprob of the 4th token is:

  1. </think> : -22.50234603881836,
  2. <: -23.995466232299805

Sometimes I get the wrong result 30 times in a row, sometimes I get the correct result 30 times in a row. With these settings it should be less than 1 in a billion chance ( 2^30) of being correct behavior.

EDIT: nevermind. I have to test with "cache_prompt": false

@matteoserva matteoserva changed the title Eval bug: grammar breaks Qwen3 MoE Eval bug: grammar breaks sampling in Qwen3 MoE May 3, 2025
@matteoserva
Copy link
Contributor Author

matteoserva commented May 3, 2025

Ok. I am able to narrow down the problem even more with this python script:

import requests
import json

headers = {
    'Content-Type': 'application/json',
}

json_data = {
    'prompt': '<|im_start|>system\n<|im_end|>\n<|im_start|>user\nhi<|im_end|>\n<|im_start|>assistant\n',
    'grammar': 'root ::= "<think>\\n\\n</think>\\n\\n" .*',
    'temperature': 0.001,
    'n_predict': 4,
    'seed': 42,
    'n_probs': 1,
    'cache_prompt': False,
}

for _ in range(10):
    response = requests.post('http://localhost:8080/completion', headers=headers, json=json_data)
    result = json.loads(response.text)
    fourth = result["completion_probabilities"][3]
    del fourth["top_logprobs"]
    del fourth["bytes"]
    print(fourth)

This is the output of Qwen3-MoE for CPU debug:

$ python3 qwen-mode-bug.py 
{'id': 27, 'token': '<', 'logprob': -21.9802303314209}
{'id': 27, 'token': '<', 'logprob': -21.9802303314209}
{'id': 27, 'token': '<', 'logprob': -21.9802303314209}
{'id': 27, 'token': '<', 'logprob': -21.9802303314209}
{'id': 27, 'token': '<', 'logprob': -21.9802303314209}
...

This is the output of Qwen3-MoE for CUDA Release:

$ python3 qwen-mode-bug.py 
{'id': 27, 'token': '<', 'logprob': -24.497241973876953}
{'id': 27, 'token': '<', 'logprob': -25.099708557128906}
{'id': 151668, 'token': '</think>', 'logprob': -22.38550567626953}
{'id': 27, 'token': '<', 'logprob': -22.939809799194336}
{'id': 27, 'token': '<', 'logprob': -23.012990951538086}
{'id': 27, 'token': '<', 'logprob': -24.535411834716797}
{'id': 151668, 'token': '</think>', 'logprob': -26.998727798461914}
{'id': 27, 'token': '<', 'logprob': -25.120460510253906}
{'id': 27, 'token': '<', 'logprob': -26.36814308166504}
{'id': 27, 'token': '<', 'logprob': -25.176483154296875}

This is the output with Qwen3-0.6 (Dense), CUDA Release:

{'id': 27, 'token': '<', 'logprob': -10.674346923828125}
{'id': 27, 'token': '<', 'logprob': -10.674346923828125}
{'id': 27, 'token': '<', 'logprob': -10.674346923828125}
{'id': 27, 'token': '<', 'logprob': -10.674346923828125}
...

problem

The log probabilities in Qwen-MoE (CUDA) vary wildly between -22 and -27, which is not expected. The inputs are always the same and the probability distribution should be always the same.
I might be wrong but I'm not expecting this behaviour.
@JohannesGaessler Maybe you can confirm this?

@JohannesGaessler
Copy link
Collaborator

Given the server parameters the output should to my knowledge be deterministic, even with CUDA. I have recently made quite a lot of changes to the MoE code. Unfortunately, as was pointed out to me in ggml-org/ggml#1178 (comment) , prior to my changes the results were already non-deterministic so I think a git bisect will be inconclusive.

Just to rule out the possibility of a hardware issue: do you get deterministic results after running nvidia-smi --lock-gpu-clocks 0,1000 --mode 1? I have a machine with multiple 4090s running off of a single 2050 W PSU and as it turns out even with just 3 of them running in parallel the power spikes can overlap in unlucky ways that either cause bit flips or system crashes.

@matteoserva
Copy link
Contributor Author

Just to rule out the possibility of a hardware issue: do you get deterministic results after running nvidia-smi --lock-gpu-clocks 0,1000 --mode 1? I have a machine with multiple 4090s running off of a single 2050 W PSU and as it turns out even with just 3 of them running in parallel the power spikes can overlap in unlucky ways that either cause bit flips or system crashes.

It doesn't seem to change anything

Qwen-moe with default settings:

python3 qwen-mode-bug.py 
{'id': 151668, 'token': '</think>', 'logprob': -25.749713897705078}
{'id': 27, 'token': '<', 'logprob': -25.500146865844727}
{'id': 27, 'token': '<', 'logprob': -23.861719131469727}
{'id': 27, 'token': '<', 'logprob': -24.062076568603516}
{'id': 27, 'token': '<', 'logprob': -24.572689056396484}
{'id': 27, 'token': '<', 'logprob': -23.132158279418945}
{'id': 151668, 'token': '</think>', 'logprob': -22.906593322753906}
{'id': 27, 'token': '<', 'logprob': -23.740596771240234}
{'id': 27, 'token': '<', 'logprob': -23.677282333374023}
{'id': 27, 'token': '<', 'logprob': -24.058055877685547}

Qwen-moe with nvidia-smi --lock-gpu-clocks 0,1000 --mode 1

$ python3 qwen-mode-bug.py 
{'id': 27, 'token': '<', 'logprob': -24.385953903198242}
{'id': 27, 'token': '<', 'logprob': -24.010208129882812}
{'id': 151668, 'token': '</think>', 'logprob': -23.4265079498291}
{'id': 151668, 'token': '</think>', 'logprob': -25.16918182373047}
{'id': 151668, 'token': '</think>', 'logprob': -26.014835357666016}
{'id': 151668, 'token': '</think>', 'logprob': -24.031843185424805}
{'id': 27, 'token': '<', 'logprob': -23.161317825317383}
{'id': 151668, 'token': '</think>', 'logprob': -24.846834182739258}
{'id': 151668, 'token': '</think>', 'logprob': -23.85172462463379}
{'id': 27, 'token': '<', 'logprob': -24.06478500366211}

Qwen 32b (dense) with nvidia-smi --lock-gpu-clocks 0,1000 --mode 1

$ python3 qwen-mode-bug.py 
{'id': 151668, 'token': '</think>', 'logprob': -8.822237014770508}
{'id': 151668, 'token': '</think>', 'logprob': -8.822237014770508}
{'id': 151668, 'token': '</think>', 'logprob': -8.822237014770508}
{'id': 151668, 'token': '</think>', 'logprob': -8.822237014770508}
{'id': 151668, 'token': '</think>', 'logprob': -8.822237014770508}
[...]

@matteoserva matteoserva changed the title Eval bug: grammar breaks sampling in Qwen3 MoE Eval bug: Heavy nondeterminism in Qwen3 MoE (CUDA) May 4, 2025
@JohannesGaessler
Copy link
Collaborator

Please check this fix: #13294 . There was a race condition in the new MMQ code that could be causing this.

@matteoserva
Copy link
Contributor Author

matteoserva commented May 4, 2025

Please check this fix: #13294 . There was a race condition in the new MMQ code that could be causing this.

No change even after 9a0053f

Is there something (compiler flags, macros) that I could try to help narrow down the issue?

More tests I made:

  • One or two cards doesn't make any difference
  • launched with -ngl 0 or -ngl 1 I get correct results
  • -ngl 2 and higher give slightly wrong results.
  • Increasing -ngl makes the results worse and worse (worse as in higher standard deviation/ the probability assigned to that token becomes more random)

@JohannesGaessler
Copy link
Collaborator

Does -ub 1 fix the issue?

Does this patch fix the issue?

diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
index 9fb2134f9..8374c9a35 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -1979,22 +1979,6 @@ static void ggml_cuda_mul_mat_id(ggml_backend_cuda_context & ctx, ggml_tensor *
 
     const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
 
-    if (src1->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32) {
-        if (ne2 == 1) {
-            if (ggml_is_quantized(src0->type)) {
-                ggml_cuda_mul_mat_vec_q(ctx, src0, src1, ids, dst);
-            } else {
-                ggml_cuda_mul_mat_vec(ctx, src0, src1, ids, dst);
-            }
-            return;
-        }
-
-        if (ggml_cuda_should_use_mmq(src0->type, cc, ne12)) {
-            ggml_cuda_mul_mat_q(ctx, src0, src1, ids, dst);
-            return;
-        }
-    }
-
     cudaStream_t stream = ctx.stream();
 
     GGML_ASSERT(nb12 % nb11 == 0);

@matteoserva
Copy link
Contributor Author

matteoserva commented May 4, 2025

Does -ub 1 fix the issue?

Does this patch fix the issue?

  • With -ub 1 I get correct and deterministic results
{'id': 151668, 'token': '</think>', 'logprob': -24.28609848022461}
{'id': 151668, 'token': '</think>', 'logprob': -24.28609848022461}
{'id': 151668, 'token': '</think>', 'logprob': -24.28609848022461}
{'id': 151668, 'token': '</think>', 'logprob': -24.28609848022461}
  • with -ngl 0 I get wrong but deterministic results
{'id': 27, 'token': '<', 'logprob': -25.752641677856445}
{'id': 27, 'token': '<', 'logprob': -25.752641677856445}
{'id': 27, 'token': '<', 'logprob': -25.752641677856445}
{'id': 27, 'token': '<', 'logprob': -25.752641677856445}
  • with -ngl 99 I get wrong and non deterministic results
{'id': 27, 'token': '<', 'logprob': -24.733509063720703}
{'id': 151668, 'token': '</think>', 'logprob': -24.34672737121582}
{'id': 27, 'token': '<', 'logprob': -23.8717098236084}
{'id': 27, 'token': '<', 'logprob': -24.000146865844727}
{'id': 151668, 'token': '</think>', 'logprob': -22.6581974029541}
  • with your patch merged with cuda-mmq-fix-race-cond
I get a crash
llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2009
  cudaStreamSynchronize(stream)
CUDA error
[New LWP 37411]
[New LWP 37422]
[New LWP 37423]
[New LWP 37424]
[New LWP 37425]
[New LWP 37426]
[New LWP 37427]
[New LWP 37428]
[New LWP 37429]
[New LWP 37430]
[New LWP 37431]
[New LWP 37432]
[New LWP 37433]
[New LWP 37434]
[New LWP 37435]
[New LWP 37436]
[New LWP 37437]
[New LWP 37438]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f01d49f2c17 in __GI___wait4 (pid=37515, stat_loc=0x7ffdc9e09d34, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: File o directory non esistente.
#0  0x00007f01d49f2c17 in __GI___wait4 (pid=37515, stat_loc=0x7ffdc9e09d34, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00007f01d4e44fd1 in ggml_abort () from /home/matteo/programmi/llama.cpp/build/bin/libggml-base.so
#2  0x00007f01d1c93ae3 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/matteo/programmi/llama.cpp/build/bin/libggml-cuda.so
#3  0x00007f01d1c9ede0 in ggml_cuda_mul_mat_id(ggml_backend_cuda_context&, ggml_tensor*) () from /home/matteo/programmi/llama.cpp/build/bin/libggml-cuda.so
#4  0x00007f01d1c9fccb in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/matteo/programmi/llama.cpp/build/bin/libggml-cuda.so
#5  0x00007f01d4e59fe3 in ggml_backend_sched_graph_compute_async () from /home/matteo/programmi/llama.cpp/build/bin/libggml-base.so
#6  0x00007f01d4f7abc9 in llama_context::graph_compute(ggml_cgraph*, bool) () from /home/matteo/programmi/llama.cpp/build/bin/libllama.so
#7  0x00007f01d4f7dde6 in llama_context::decode(llama_batch&) () from /home/matteo/programmi/llama.cpp/build/bin/libllama.so
#8  0x00007f01d4f7f1ab in llama_decode () from /home/matteo/programmi/llama.cpp/build/bin/libllama.so
#9  0x0000563e617ed899 in server_context::update_slots() ()
#10 0x0000563e617b915f in server_queue::start_loop() ()
#11 0x0000563e61786ee3 in main ()

@JohannesGaessler
Copy link
Collaborator

Potential fix: #13299

@matteoserva
Copy link
Contributor Author

commit 93c4e23 fixed the nondeterminism issue. Now I'm getting the same answer in each consecutive run. Nice!
I see some differences when I change the parameters. I don't know if it's an issue.

My launch command: llama-server -ngl 175 -t 6 -c 32768 --host 0.0.0.0 --slots -a current --temp 0.6 -m Qwen3-30B-A3B-Q6_K.gguf

Testing with the python script I posted.

  • normal launch command (10 identical rows)
{'id': 151668, 'token': '</think>', 'logprob': -23.03097915649414}
  • with -ub 1
{'id': 27, 'token': '<', 'logprob': -22.11159896850586}
  • with -fa
{'id': 151668, 'token': '</think>', 'logprob': -23.786317825317383}
  • with -fa -ub 1
{'id': 151668, 'token': '</think>', 'logprob': -22.095962524414062}

@JohannesGaessler
Copy link
Collaborator

Generally speaking it is expected that results will not be bit-for-bit identical if you vary some of the parameters. If different code is being run, then the floating point rounding error will be different (there are potentially other differences too). Such small differences can then blow up to comparatively larger differences in the outputs.

@matteoserva
Copy link
Contributor Author

Well. The issue is solved, then.
Thanks for the quick fix and update!

@CISC CISC linked a pull request May 4, 2025 that will close this issue
@CISC CISC closed this as completed May 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants