Lllama.cpp build b5328 broke flash-attention for my laptop RTX 2070 #13411

lilblam · 2025-05-09T18:27:08Z

lilblam
May 9, 2025

Hey this change introduced flash-attention for Deepseek for Ampere (RTX 3000 and above). It somehow broke flash-attention on my RTX 2070 for Qwen3/Gemma GGUF's (I haven't tested other models, but I assume it's across the board).

With flash-attn turned on, the models now output gibberish. Everything works fine without flash-attn.

Can you guys see what could've caused this break?

Dampfinchen · 2025-05-09T20:37:45Z

Dampfinchen
May 9, 2025

Can confirm this is happening on my RTX 2060 laptop as well.

@JohannesGaessler Please take a look. It's outputting complete nonsense now.

./llama-server -m "Qwen 3\Qwen3-30B-A3B-UD-Q4_K_XL.gguf" -c 32768 -ngl 99 -fa --host 127.0.0.1 --port 5001 -t 6 -ctk q8_0 -ctv q8_0 -ub 2048 -ot ".ffn_.*_exps.=CPU" --jinja

My settings.

0 replies

lilblam · 2025-05-09T20:46:20Z

lilblam
May 9, 2025
Author

My settings as well if it's useful:

title llama-server
:start
llama-server ^
--model models/Qwen3-1.7B-UD-Q6_K_XL.gguf ^
--ctx-size 32768 ^
--gpu-layers 99 ^
--temp 0.6 ^
--min-p 0.0 ^
--top-p 0.95 ^
--top-k 20 ^
--threads 9 ^
--flash-attn ^
--port 8013
pause
goto start

0 replies

Dampfinchen · 2025-05-09T22:03:57Z

Dampfinchen
May 9, 2025

Johannes is a machine, already fixed it with this PR: #13415

0 replies

lilblam · 2025-05-09T22:09:37Z

lilblam
May 9, 2025
Author

Damn what a beast, thank you!

0 replies

steampunque · 2025-05-10T15:04:33Z

steampunque
May 10, 2025

@JohannesGaessler latest branch b5335 is still failing with 4070: (Qwen3 8B)

flash attention on:

bash-5.1$ lm Hello
郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦郦^Cbash-5.1$

flash attention off:

bash-5.1$ 
bash-5.1$ 
bash-5.1$ 
bash-5.1$ lm Hello
<think>
Okay, the user said "Hello". I need to respond appropriately. Since it's a greeting, I should acknowledge it and offer assistance. Let me keep it friendly and open-ended. Maybe ask how I can help them today. That way, they know I'm here to assist with any questions or tasks they might have. I should make sure the response is welcoming and not too formal. Let me check for any typos or errors. Alright, that should work.
</think>

2 replies

JohannesGaessler May 10, 2025
Collaborator

Please make an issue for this. A discussion is the wrong feature to track bugs with in the first place, and this is definitely a separate bug.

steampunque May 10, 2025

Please make an issue for this. A discussion is the wrong feature to track bugs with in the first place, and this is definitely a separate bug.

Sure. Hardwiring async off seems to fix it on my 4070.

            //constexpr bool use_cp_async = nstages == 1;
            constexpr bool use_cp_async = 0;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lllama.cpp build b5328 broke flash-attention for my laptop RTX 2070 #13411

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Lllama.cpp build b5328 broke flash-attention for my laptop RTX 2070 #13411

lilblam May 9, 2025

Replies: 5 comments · 2 replies

Dampfinchen May 9, 2025

lilblam May 9, 2025 Author

Dampfinchen May 9, 2025

lilblam May 9, 2025 Author

steampunque May 10, 2025

JohannesGaessler May 10, 2025 Collaborator

steampunque May 10, 2025

lilblam
May 9, 2025

Replies: 5 comments 2 replies

Dampfinchen
May 9, 2025

lilblam
May 9, 2025
Author

Dampfinchen
May 9, 2025

lilblam
May 9, 2025
Author

steampunque
May 10, 2025

JohannesGaessler May 10, 2025
Collaborator