Replies: 5 comments 2 replies
-
Can confirm this is happening on my RTX 2060 laptop as well. @JohannesGaessler Please take a look. It's outputting complete nonsense now. ./llama-server -m "Qwen 3\Qwen3-30B-A3B-UD-Q4_K_XL.gguf" -c 32768 -ngl 99 -fa --host 127.0.0.1 --port 5001 -t 6 -ctk q8_0 -ctv q8_0 -ub 2048 -ot ".ffn_.*_exps.=CPU" --jinja My settings. |
Beta Was this translation helpful? Give feedback.
-
My settings as well if it's useful: title llama-server |
Beta Was this translation helpful? Give feedback.
-
Johannes is a machine, already fixed it with this PR: #13415 |
Beta Was this translation helpful? Give feedback.
-
Damn what a beast, thank you! |
Beta Was this translation helpful? Give feedback.
-
@JohannesGaessler latest branch b5335 is still failing with 4070: (Qwen3 8B) flash attention on:
flash attention off:
|
Beta Was this translation helpful? Give feedback.
-
Hey this change introduced flash-attention for Deepseek for Ampere (RTX 3000 and above). It somehow broke flash-attention on my RTX 2070 for Qwen3/Gemma GGUF's (I haven't tested other models, but I assume it's across the board).
With flash-attn turned on, the models now output gibberish. Everything works fine without flash-attn.
Can you guys see what could've caused this break?
Beta Was this translation helpful? Give feedback.
All reactions