You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is because Flash Attention on Vulkan is only supported on Nvidia GPUs with coopmat2 on an up-to-date driver. For any other device it will fall back to CPU, which is slow.
This is because Flash Attention on Vulkan is only supported on Nvidia GPUs with coopmat2 on an up-to-date driver. For any other device it will fall back to CPU, which is slow.
Thank you. I hadn't noticed this before. I just tested with other models (e.g., gemma-2-27b-it-Q4_K_M) and confirmed that enabling -fa actually slows down performance when using Vulkan.
Name and Version
version: b5233 (ceda28e) with MSVC
Operating systems
Windows
GGML backends
Vulkan
Hardware
AMD 780m (8840u)
Models
Qwen3-30B-A3B-Q4_K_M.gguf
https://huggingface.co/bartowski/Qwen_Qwen3-30B-A3B-GGUF/blob/main/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf
Problem description & steps to reproduce
The AVX2 llama.cpp gives ~15 t/s, but Vulkan drops to ~9 t/s—unexpected slowdown.
Here are the command lines I used:
llama-server.exe -m "bartowski_Qwen3-30B-A3B-Q4_K_M.gguf" --host 0.0.0.0 --port 8090 --slots --props --metrics -np 1 -c 20480 -ngl 999 -ctk f16 -ctv f16 -fa --no-mmap --keep 0
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: