-
Notifications
You must be signed in to change notification settings - Fork 11.6k
Eval bug: Heavy nondeterminism in Qwen3 MoE (CUDA) #13280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You can't use grammar to force a specific token, only a specific sequence of characters, if the model can complete that sequence with alternate tokens that's perfectly acceptable to the grammar. |
If you are just trying to disable thinking you yourself have already implemented the means to do that, see #13212 (comment) :) |
This is a different problem. The output shouldn't be random when the seed is fixed and the temperature is zero. If I let the model generate its output "naturally", the generated tokens are always the same at every run. If I use the grammar to force the exact same output string, then the output tokens can change randomly, producing extremely low probability tokens. |
Shouldn't, but is, at least with CUDA unfortunately.
Well, grammar will naturally skew the probabilities, I think you're just seeing the results of the numerical instability of CUDA more frequently due to the limited token selection. |
I'll reopen the issue in case someone wants to take a closer look, but you might want to change the title to a more accurate description of the issue. |
I made some tests and I can provide more context. I used the following command to print the probabilities:
The 4 output tokens are
The logprob of the 4th token is:
EDIT: nevermind. I have to test with "cache_prompt": false |
Ok. I am able to narrow down the problem even more with this python script:
This is the output of Qwen3-MoE for CPU debug:
This is the output of Qwen3-MoE for CUDA Release:
This is the output with Qwen3-0.6 (Dense), CUDA Release:
problemThe log probabilities in Qwen-MoE (CUDA) vary wildly between -22 and -27, which is not expected. The inputs are always the same and the probability distribution should be always the same. |
Given the server parameters the output should to my knowledge be deterministic, even with CUDA. I have recently made quite a lot of changes to the MoE code. Unfortunately, as was pointed out to me in ggml-org/ggml#1178 (comment) , prior to my changes the results were already non-deterministic so I think a git bisect will be inconclusive. Just to rule out the possibility of a hardware issue: do you get deterministic results after running |
It doesn't seem to change anything Qwen-moe with default settings:
Qwen-moe with
|
Please check this fix: #13294 . There was a race condition in the new MMQ code that could be causing this. |
No change even after 9a0053f Is there something (compiler flags, macros) that I could try to help narrow down the issue? More tests I made:
|
Does Does this patch fix the issue?
|
I get a crash
|
Potential fix: #13299 |
commit 93c4e23 fixed the nondeterminism issue. Now I'm getting the same answer in each consecutive run. Nice! My launch command: Testing with the python script I posted.
|
Generally speaking it is expected that results will not be bit-for-bit identical if you vary some of the parameters. If different code is being run, then the floating point rounding error will be different (there are potentially other differences too). Such small differences can then blow up to comparatively larger differences in the outputs. |
Well. The issue is solved, then. |
Name and Version
llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
version: 5269 (1d36b36)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
rtx 5060ti 16GB + rtx4060ti 16GB
Models
Qwen_Qwen3-30B-A3B-Q6_K.gguf by bartowski.
sha256sum: d511d02955714b08ff1b4354d6eae8ea513179a83fa5498466db2731528074dd
Problem description & steps to reproduce
I'm using a grammar to simulate the nothink qwen prompt format. Sometimes the output is generated correctly, sometimes the model outputs the wrong token while still aligned with the grammar.
The command I'm using to test:
Correct output:
[151667 198 198 151668 ...] <think>\n\n</think>...
Wrong output:
[151667 198 198 27 14 ...] <think>\n\n</...
Sometimes the model outputs the correct output, sometimes it outputs the wrong output and the following output breaks since the model cannot see the
</think>
token. I'm not restarting llama-server between tests and not changing the seed. I expect the model to always output the token 151668Command line used to launch llama-server:
/llama-server -ngl 175 -t 6 -c 32768 --host 0.0.0.0 -fa -ctk q8_0 -ctv q8_0 --slots -a current --temp 0.6
First Bad Commit
No response
Relevant log output
`{"index":0,"content":"<think>\n\n</think"...`
The text was updated successfully, but these errors were encountered: