Performance of llama.cpp with Vulkan #10879
Replies: 71 comments 116 replies
-
AMD FirePro W8100
|
Beta Was this translation helpful? Give feedback.
-
AMD RX 470
|
Beta Was this translation helpful? Give feedback.
-
ubuntu 24.04, vulkan and cuda installed from official APT packages.
build: 4da69d1 (4351) vs CUDA on the same build/setup
build: 4da69d1 (4351) |
Beta Was this translation helpful? Give feedback.
-
Macbook Air M2 on Asahi Linux ggml_vulkan: Found 1 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
-
Gentoo Linux on ROG Ally (2023) Ryzen Z1 Extreme ggml_vulkan: Found 1 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
-
ggml_vulkan: Found 4 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
-
build: 0d52a69 (4439) NVIDIA GeForce RTX 3090 (NVIDIA)
AMD Radeon RX 6800 XT (RADV NAVI21) (radv)
AMD Radeon (TM) Pro VII (RADV VEGA20) (radv)
Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver)
|
Beta Was this translation helpful? Give feedback.
-
@netrunnereve Some of the tg results here are a little low, I think they might be debug builds. The cmake step (at least on Linux) might require |
Beta Was this translation helpful? Give feedback.
-
Build: 8d59d91 (4450)
Lack of proper Xe coopmat support in the ANV driver is a setback honestly.
edit: retested both with the default batch size. |
Beta Was this translation helpful? Give feedback.
-
Here's something exotic: An AMD FirePro S10000 dual GPU from 2012 with 2x 3GB GDDR5. build: 914a82d (4452)
|
Beta Was this translation helpful? Give feedback.
-
Latest arch with For the sake of consistency I run every bit in a script and also build every target from scratch (for some reason kill -STOP -1
timeout 240s $COMMAND
kill -CONT -1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none
build: ff3fcab (4459)
This bit seems to underutilise both GPU and CPU in real conditions based on
|
Beta Was this translation helpful? Give feedback.
-
Intel ARC A770 on Windows:
build: ba8a1f9 (4460) |
Beta Was this translation helpful? Give feedback.
-
Single GPU VulkanRadeon Instinct MI25 ggml_vulkan: 0 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) Radeon PRO VII ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) Multi GPU Vulkanggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) Single GPU RocmDevice 0: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no
build: 2739a71 (4461) Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
build: 2739a71 (4461) Multi GPU RocmDevice 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
build: 2739a71 (4461) Layer split
build: 2739a71 (4461) Row split
build: 2739a71 (4461) Single GPU speed is decent, but multi GPU trails Rocm by a wide margin, especially with large models due to the lack of row split. |
Beta Was this translation helpful? Give feedback.
-
AMD Radeon RX 5700 XT on Arch using mesa-git and setting a higher GPU power limit compared to the stock card.
I also think it could be interesting adding the flash attention results to the scoreboard (even if the support for it still isn't as mature as CUDA's).
|
Beta Was this translation helpful? Give feedback.
-
I tried but there's nothing after 1 hrs , ok, might be 40 mins... Anyway I run the llama_cli for a sample eval...
Meanwhile OpenBLAS
|
Beta Was this translation helpful? Give feedback.
-
ggml_vulkan: Found 2 Vulkan devices:
build: 7538246 (5083) ggml_vulkan: Found 1 Vulkan devices:
build: 7538246 (5083) |
Beta Was this translation helpful? Give feedback.
-
Here are some results with the Vulkan backend running on Steam Deck: ggml_vulkan: Found 1 Vulkan devices:
build: 5368ddd (5164) The output of |
Beta Was this translation helpful? Give feedback.
-
RTX 5060Ti 16GB Driver Version: 575.51.02 CUDA Version: 12.9 ggml_vulkan: Found 1 Vulkan devices:
build: 658987c (5170) w/Flash Attention
build: 658987c (5170) |
Beta Was this translation helpful? Give feedback.
-
M3 Ultra(Mac Studio 2025) 24P+8E Cores of CPU, 80 Cores of GPU with Vulkanggml_vulkan: Found 1 Vulkan devices:
build: 2d451c8 (5195) Non-BLAS
build: 2d451c8 (5195) For comparison, Metal on same machine
build: 2d451c8 (5195) It is interesting that TG in Vulkan is faster than Metal. Faster PP in Metal is as expected. |
Beta Was this translation helpful? Give feedback.
-
AMD Instinct MI50
build: d5fe4e8 (5192) With ROCm for comparison:
build: d5fe4e8 (5192) |
Beta Was this translation helpful? Give feedback.
-
AMD Radeon PRO W6800X Duo ggml_vulkan: 0 = AMD Radeon PRO W6800X Duo (MoltenVK) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
|
Beta Was this translation helpful? Give feedback.
-
AMD Ryzen 5 5600G (Debian 12, Vulkan version 1.3.211, LLVM 15.0.6, DRM 3.61, Linux 6.12.26, Mesa 22.3.6):
[edit]
CPU-only:
GPU clocked at 2.2GHz, RAM at 2667MHz (XMP)
[/edit] |
Beta Was this translation helpful? Give feedback.
-
AMD Radeon RX 6500 XT ggml_vulkan: Found 1 Vulkan devices:
build: g9fdfcdae |
Beta Was this translation helpful? Give feedback.
-
Upgrade time! ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1102) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: 8ae5ebc (5273) |
Beta Was this translation helpful? Give feedback.
-
AMD 5600U (Zen3 APU 6c/12t, Radeon Vega 7, 2ch 64GB DDR4-3200) Fedora 41, Linux 6.12.13, Vulkan version: 1.4.309 llama.cpp build: 141a908 (5298) build info-- Adding CPU backend variant ggml-cpu: -march=native -- GL_KHR_cooperative_matrix supported by glslc -- GL_NV_cooperative_matrix2 supported by glslc -- GL_EXT_integer_dot_product supported by glslc -- GL_EXT_bfloat16 not supported by glslc On average, with various dense models, I have noticed a 1.8-2.5x increase in PP performance with the Vulkan backend compared to this CPU. Vulkan
Vulkan (Only PP - ngl 0)
CPU only
Just for feedback to Vulkan maintainers: I noticed the opposite with a MoE model (Qwen3 MoE 30b-a3b) : Vulkan is 3-4x slower than the CPU in terms of PP, and about 10% slower in token generation. Vulkan (Qwen3-30B-A3B-GGUF:Q4_K_XL)
CPU only
|
Beta Was this translation helpful? Give feedback.
-
Tested my new ASUS NUC Pro 14 AI Mini PC with Core Ultra 7 258V and 32GB LPDDR5x-8533 2ch memory. Used the pre-built Windows binary—WSL Ubuntu version isn’t working yet. Vulkan PP still lags behind llama.cpp with IPEX LLM backend. `PS C:\Users\julia\Downloads\llama-b5328-bin-win-vulkan-x64> .\llama-bench.exe -m ..\llama-2-7b.Q4_0.gguf -ngl 99
build: 0cf6725 (5328)` IPEX LLM, same model
build: 6ecf5e8 (1)` Unfortunately, the memory bandwidth of 136,528 MB/s is nowhere near fully utilized in both cases. This is, what I can get (WSL):
|
Beta Was this translation helpful? Give feedback.
-
OS: Arch Linux latest
For comparison, with CUDA build:
|
Beta Was this translation helpful? Give feedback.
-
build: de4c07f (5359) |
Beta Was this translation helpful? Give feedback.
-
That would be cool to have a graph showinv cuda vs vulkan performance over time/versions |
Beta Was this translation helpful? Give feedback.
-
I've noticed that on my RX 7800 XT, the performance of the RADV driver is significantly worse than AMDVLK when using coopmat. In fact, the integer dot implementation ends up being much faster. Has anyone else run into this? It seems like it could be a driver implementation issue, but I’d like to gather some feedback before diving deeper. COOPMAT RADV
ggml_vulkan: 0 = AMD Radeon RX 7800 XT (RADV NAVI32) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
COOPMAT AMDVLK
ggml_vulkan: 0 = AMD Radeon RX 7800 XT (AMD open-source driver) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
INT DOT RADV
ggml_vulkan: 0 = AMD Radeon RX 7800 XT (RADV NAVI32) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none
INT DOT AMDVLK
ggml_vulkan: 0 = AMD Radeon RX 7800 XT (AMD open-source driver) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: none
build: 360a9c98 (5379) |
Beta Was this translation helpful? Give feedback.
-
This is similar to the Apple Silicon benchmark thread, but for Vulkan! Many improvements have been made to the Vulkan backend and I think it's good to consolidate and discuss our results here.
We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here.
Instructions
Either run the commands below or download one of our Vulkan releases. If you have multiple GPUs please run the test on a single GPU using
-sm none
unless the model is too big to fit in VRAM.Share your llama-bench results along with the git hash and Vulkan info string in the comments. Feel free to try other models and compare backends, but only valid runs will be placed on the scoreboard.
If multiple entries are posted for the same device newer commits with substantial Vulkan updates are prioritized, alternatively the one with the highest tg128 score will be used. Performance may vary depending on driver, operating system, board manufacturer, etc. even if the chip is the same. For integrated graphics your memory speed and number of channels will greatly affect your inference speed.
Vulkan Scoreboard for Llama 2 7B, Q4_0 (no FA)
Vulkan Scoreboard for Llama 2 7B, Q4_0 (with FA)
Beta Was this translation helpful? Give feedback.
All reactions