Performance of llama.cpp with Vulkan #10879

netrunnereve · 2024-12-18T03:56:09Z

netrunnereve
Dec 18, 2024
Collaborator

This is similar to the Apple Silicon benchmark thread, but for Vulkan! Many improvements have been made to the Vulkan backend and I think it's good to consolidate and discuss our results here.

We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here.

Instructions

Either run the commands below or download one of our Vulkan releases. If you have multiple GPUs please run the test on a single GPU using -sm none unless the model is too big to fit in VRAM.

wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_0.gguf
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build
cd build
cmake .. -DGGML_VULKAN=on -DCMAKE_BUILD_TYPE=Release
make
llama-bench -m ../../llama-2-7b.Q4_0.gguf -ngl 100 (add any extra options here)

Share your llama-bench results along with the git hash and Vulkan info string in the comments. Feel free to try other models and compare backends, but only valid runs will be placed on the scoreboard.

If multiple entries are posted for the same device newer commits with substantial Vulkan updates are prioritized, alternatively the one with the highest tg128 score will be used. Performance may vary depending on driver, operating system, board manufacturer, etc. even if the chip is the same. For integrated graphics your memory speed and number of channels will greatly affect your inference speed.

Vulkan Scoreboard for Llama 2 7B, Q4_0 (no FA)

Chip	pp512 t/s	tg128 t/s	Commit	Comments
AMD Radeon RX 7900 XTX	3236.63 ± 71.63	148.21 ± 0.94	`902368a`	Best of multiple submissions
Nvidia RTX 5070 Ti	5164.60 ± 24.58	126.67 ± 0.44	`43dfd74`
Nvidia RTX 3090	3301.47 ± 33.76	123.72 ± 0.14	`0d52a69`
AMD Radeon RX 9070 XT	2206.24 ± 266.06	117.22 ± 0.33	d84635b
Apple M3 Ultra Mac Studio	1116.83 ± 0.55	115.54 ± 0.78	`2d451c8`	MoltenVK
AMD Radeon RX 7800 XT	1260.54 ± 10.51	107.53 ± 0.07	`ee02ad0`
AMD Radeon RX 6900 XT	1257.98 ± 1.55	101.42 ± 0.02	44e18ef	Best of multiple submissions
AMD Radeon RX 6800 XT	1533.60 ± 2.47	95.56 ± 0.72	N/A	Best of multiple submissions
Nvidia RTX 4070	3179.37 ± 46.16	92.29 ± 0.28	`9a48399`
AMD Radeon PRO W6800X	510.80 ± 0.13	86.47 ± 0.46	`13b4548`	MoltenVK
AMD Radeon PRO W6800X Duo	519.14 ± 0.13	87.56 ± 0.19	`13b4548`	MoltenVK
Nvidia RTX 5060 Ti	3211.73 ± 24.44	81.48 ± 3.50	`658987c`	coopmat2
AMD Radeon Instinct MI60	369.26 ± 2.48	78.16 ± 1.40	504af20
AMD Radeon Instinct MI50	387.37 ± 0.33	71.46 ± 0.10	d5fe4e8
AMD Radeon Pro VII	612.47 ± 0.87	71.37 ± 0.98	N/A	Best of multiple submissions
AMD Radeon RX 5700 XT	439.42 ± 0.28	70.13 ± 0.05	c05e8c9
Nvidia RTX 2070 SUPER	1199.13 ± 7.70	64.64 ± 0.20	`b7552cf`
Nvidia RTX 3080	1706.07 ± 139.33	62.16 ± 1.98	`4da69d1`	Result appears lower than expected, maybe non-release build?
AMD Radeon RX 7600 XT	632.88 ± 0.70	58.44 ± 0.01	`3b24d26`
AMD Radeon Instinct MI25	439.42 ± 0.34	54.69 ± 0.03	`2739a71`
Nvidia RTX 3060	1298.03 ± 23.40	54.28 ± 1.05	6171c9d
AMD Radeon RX 6600 XT	574.65 ± 0.86	53.92 ± 0.11	`091592d`	Best of multiple submissions
AMD BC-250	331.58 ± 0.06	49.76 ± 0.06	cf2270e
Nvidia RTX 3060 Mobile	1059.76 ± 3.54	49.03 ± 0.13	`dbb3a47`
Intel Arc A770	603.46 ± 0.92	49.02 ± 0.04	`5f696e8`	Best of multiple submissions
AMD Radeon RX 6600M	605.59 ± 0.65	48.21 ± 0.07	`fe5b78c`
AMD Radeon RX 6600	380.87 ± 0.21	47.47 ± 0.18	0fd7ca7
AMD Radeon RX 7600M XT	459.39 ± 2.34	45.28 ± 0.10	`b9ab0a4`	eGPU
Intel Arc B580	175.56 ± 2.65	44.12 ± 0.09	`9a48399`
Nvidia RTX 4050 Mobile	1154.28 + 15.76	41.89 + 0.10	`d79d8f3`
AMD Radeon RX 580	258.03 ± 0.71	39.32 ± 0.03	de4c07f
AMD Radeon RX 470	185.48 ± 1.17	33.94 ± 0.06	`d7a14c4`
AMD FirePro W8100	154.96 ± 0.60	28.55 ± 0.17	`d7a14c4`
AMD Radeon RX 6500 XT	255.25 ± 0.35	27.81 ± 0.10	g9fdfcd
Intel Arc A750	88.86 ± 0.14	27.57 ± 0.03	`8d59d91`
Apple M3 MacBook Pro	263.70 ± 0.02	26.39 ± 0.14	`b9ab0a4`	MoltenVK
AMD FirePro S10000	94.78 ± 0.02	25.32 ± 0.02	`914a82d`	Split across dual GPUs
Intel Core Ultra 7 258V	210.27 ± 0.86	21.63 ± 0.16	`0cf6725`
AMD Ryzen AI 9 HX 370	309.35 ± 0.93	21.23 ± 0.40	`87616f0`
AMD Ryzen 7 8840HS	245.79 ± 2.97	20.10 ± 0.07	`19d3c82`
AMD Ryzen 7 7940HS	281.62 ± 1.56	19.91 ± 0.07	`ebce03e`
AMD Ryzen Z1 Extreme	199.36 ± 7.02	18.77 ± 0.02	`53ff6b9`
AMD Ryzen 7 7840U	237.73 ± 13.98	18.22 ± 0.62	`70680c4`
AMD FirePro D700	69.95 ± 0.04	16.62 ± 0.01	`d3bd719`	MoltenVK, running in FP16 mode on FP32 only chip
Apple M2 MacBook Air	38.67 ± 0.03	11.07 ± 0.04	`017cc5f`	Asahi Linux
AMD Ryzen 7 5700G	90.55 ± 0.08	10.98 ± 0.07	d84635b
AMD Ryzen 7 5800H	90.15 ± 1.45	10.81 ± 0.14	`dbb3a47`
AMD Ryzen 5 5600H	75.60 ± 0.32	10.59 ± 0.18	`0bb2919`
AMD Ryzen 7 7730U	84.79 ± 0.88	10.23 ± 0.13	d84635b
AMD Ryzen 5 5600G	77.22 ± 0.01	9.34 ± 0.01	`8ae5ebc`
AMD Ryzen 5 5600U	61.82 ± 0.46	8.92 ± 0.02	`141a908`
MediaTek Dimensity 9400	38.36 ± 15.15	8.92 ± 0.06	`b9ab0a4`	GPU supports coopmat but pp512 is faster with it turned off
Intel i7-1185G7	42.02 ± 0.07	7.28 ± 0.24	`ff3fcab`
AMD Ryzen 5 3400G	46.47 ± 5.15	5.99 ± 0.71	`0893e01`
Intel Core i7-1065G7	25.58 ± 0.00	4.25 ± 0.18	N/A
Intel i5-8350U	25.28 ± 0.00	3.23 ± 0.00	`f26c874`

Vulkan Scoreboard for Llama 2 7B, Q4_0 (with FA)

Chip	pp512 t/s	tg128 t/s	Commit	Comments
Nvidia RTX 3090	4516.92 ± 9.55	120.44 ± 2.58	N/A	coopmat2
Nvidia RTX 4070	4293.57 ± 27.70	91.49 ± 0.89	`9a48399`	coopmat2
Nvidia RTX 5060 Ti	3492.22 ± 15.73	83.26 ± 2.03	`658987c`	coopmat2
AMD Radeon RX 7600 XT	586.16 ± 2.43	59.02 ± 0.03	`3b24d26`

netrunnereve · 2024-12-18T03:58:41Z

netrunnereve
Dec 18, 2024
Collaborator Author

AMD FirePro W8100

ggml_vulkan: 0 = AMD Radeon FirePro W8100 (RADV HAWAII) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
build: 4da69d1a (4351)

model	size	params	backend	ngl	threads	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	none	pp512	137.10 ± 0.44
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	none	tg128	28.51 ± 0.12

1 reply

netrunnereve May 1, 2025
Collaborator Author

With the latest updates:

ggml_vulkan: 0 = AMD Radeon FirePro W8100 (RADV HAWAII) (radv) | uma: 0 | fp16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: d7a14c42 (5252)

model	size	params	backend	ngl	threads	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	none	pp512	154.96 ± 0.60
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	none	tg128	28.55 ± 0.17

netrunnereve · 2024-12-18T04:00:36Z

netrunnereve
Dec 18, 2024
Collaborator Author

AMD RX 470

ggml_vulkan: 1 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
build: 4da69d1a (4351)

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	pp512	161.47 ± 0.43
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	tg128	33.45 ± 0.04

1 reply

netrunnereve May 1, 2025
Collaborator Author

With the latest updates:

ggml_vulkan: 1 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: d7a14c42 (5252)

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	pp512	185.48 ± 1.17
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	tg128	33.94 ± 0.06

max-krasnyansky · 2024-12-18T05:09:04Z

max-krasnyansky
Dec 18, 2024
Collaborator

ubuntu 24.04, vulkan and cuda installed from official APT packages.

ggml_vulkan: 0 = NVIDIA GeForce RTX 3080 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	1706.07 ± 139.33
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	62.16 ± 1.98

build: 4da69d1 (4351)

vs CUDA on the same build/setup

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp512	4499.47 ± 60.66
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	tg128	131.01 ± 0.43

build: 4da69d1 (4351)

0 replies

hkbu-kennycheng · 2025-01-08T02:57:11Z

hkbu-kennycheng
Jan 8, 2025

Macbook Air M2 on Asahi Linux

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Apple M2 (G14G B0) (Honeykrisp) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	38.67 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	11.07 ± 0.04

[build build: 017cc5f](build: 017cc5f)

3 replies

ericcurtin Jan 14, 2025
Collaborator

For the record I think this is slow on the HoneyKrisp side rather than llama.cpp

nettyso Mar 29, 2025

Can you share how you got vulkan to build on Asahi? I can't seem to get cmake to notice it.

cmake -B build -DGGML_CPU_AARCH64=OFF -DGGML_VULKAN=1
-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- Including CPU backend
-- ARM detected
-- ARM -mcpu not found, -mcpu=native will be used
-- ARM feature DOTPROD enabled
-- ARM feature MATMUL_INT8 enabled
-- ARM feature FMA enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native+dotprod+i8mm+nosve+nosme 
CMake Error at /usr/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:233 (message):
  Could NOT find Vulkan (missing: Vulkan_LIBRARY) (found version "1.3.296")
Call Stack (most recent call first):
  /usr/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:603 (_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake-3.30/Modules/FindVulkan.cmake:595 (find_package_handle_standard_args)
  ggml/src/ggml-vulkan/CMakeLists.txt:4 (find_package)


-- Configuring incomplete, errors occurred!

nettyso Mar 29, 2025

Spoke too soon, got it working! cmake -B build -DGGML_CPU_AARCH64=OFF -DGGML_VULKAN=1 -DVulkan_LIBRARY=/usr/lib64/libvulkan.so.1

hkbu-kennycheng · 2025-01-08T03:22:16Z

hkbu-kennycheng
Jan 8, 2025

Gentoo Linux on ROG Ally (2023) Ryzen Z1 Extreme

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1103_R1) (radv) | uma: 1 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	199.36 ± 7.02
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	18.77 ± 0.02

[build build: 53ff6b9](build: 53ff6b9)

0 replies

hkbu-kennycheng · 2025-01-08T10:35:31Z

hkbu-kennycheng
Jan 8, 2025

ggml_vulkan: Found 4 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 2 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 3 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1545.39 ± 6.58
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	88.12 ± 1.06

[build build: 53ff6b9](build: 53ff6b9)

4 replies

0cc4m Jan 8, 2025
Collaborator

Cool setup! Could you also post the result of 1, 2 and 3 7900 XTX GPUs? You can use only the first GPU with export GGML_VK_VISIBLE_DEVICES=0, the first two with export GGML_VK_VISIBLE_DEVICES=0,1 and so on.

hkbu-kennycheng Jan 8, 2025

env GGML_VK_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	2022.59 ± 10.08
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	136.24 ± 0.30

env GGML_VK_VISIBLE_DEVICES=1 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	2039.24 ± 18.08
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	140.68 ± 2.09

env GGML_VK_VISIBLE_DEVICES=2 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	2062.17 ± 5.36
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	143.99 ± 0.23

env GGML_VK_VISIBLE_DEVICES=3 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1997.04 ± 5.78
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	136.98 ± 1.73

env GGML_VK_VISIBLE_DEVICES=0,1 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1668.19 ± 12.78
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	100.62 ± 0.66

env GGML_VK_VISIBLE_DEVICES=0,1,2 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 3 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 2 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1566.38 ± 8.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	97.96 ± 1.13

env GGML_VK_VISIBLE_DEVICES=0,1,2,3 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 4 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 2 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 3 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1484.04 ± 6.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	91.48 ± 0.63

netrunnereve Jan 8, 2025
Collaborator Author

For this multi GPU case getting Vulkan to support #6017 pipeline parallelism might help improve the prompt processing speed.

hkbu-kennycheng Jan 9, 2025

@netrunnereve I updated the commit id in all my result.

0cc4m · 2025-01-08T11:04:08Z

0cc4m
Jan 8, 2025
Collaborator

build: 0d52a69 (4439)

NVIDIA GeForce RTX 3090 (NVIDIA)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	3301.47 ± 33.76
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	123.72 ± 0.14

AMD Radeon RX 6800 XT (RADV NAVI21) (radv)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	863.03 ± 0.70
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	91.59 ± 0.40

AMD Radeon (TM) Pro VII (RADV VEGA20) (radv)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	312.02 ± 0.97
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	70.17 ± 0.25

Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	95.52 ± 0.12
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	44.49 ± 0.03

0 replies

0cc4m · 2025-01-08T11:08:46Z

0cc4m
Jan 8, 2025
Collaborator

@netrunnereve Some of the tg results here are a little low, I think they might be debug builds. The cmake step (at least on Linux) might require cmake .. -DGGML_VULKAN=on -DCMAKE_BUILD_TYPE=Release

2 replies

netrunnereve Jan 8, 2025
Collaborator Author

I've added -DCMAKE_BUILD_TYPE=Release to the post, but honestly I've always built without this flag for both Vulkan and CPU backends and never noticed a difference in performance. Having Release set might strip the debug symbols but it shouldn't affect the compiler optimizations.

My release numbers for the RX 470 are basically identical to the ones I posted earlier without the flag.

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	pp512	160.08 ± 0.38
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	tg128	33.41 ± 0.15

0cc4m Jan 8, 2025
Collaborator

Maybe not in your case, but some other results are suspiciously low in tg (for example the RTX 3080)

qnixsynapse · 2025-01-09T02:41:52Z

qnixsynapse
Jan 9, 2025
Collaborator

Build: 8d59d91 (4450)
ggml_vulkan: 0 = Intel(R) Arc(tm) A750 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	88.86 ± 0.14
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	27.57 ± 0.03

Lack of proper Xe coopmat support in the ANV driver is a setback honestly.
Compared to SYCL:

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	pp512	1616.11 ± 5.28
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	tg128	36.64 ± 0.05

edit: retested both with the default batch size.

8 replies

0cc4m Jan 10, 2025
Collaborator

They do have vtune but it needs a third party kernel module to run which I don't like tbh.

Also, I don't know whether it supports Vulkan apps or not. But it does seem to support opencl.

I put my A770 into a Windows PC and gave Intel GPA and vtune a shot: GPA just crashes most of the time, I couldn't get it to trace anything useful. vtune works, but does not support Vulkan. It just shows some high-level metrics in that case, not really useful sadly.

qnixsynapse Jan 11, 2025
Collaborator

Your Vulkan tg result is lower than expected, can you retry with the cmake build type set like in the updated instructions? It might be due to a debug build.

I did build it with cmake with build type Release.

0cc4m Jan 11, 2025
Collaborator

In that case it's something else, cause it should be performing similarly to my A770. I suspect the mesa version, there was something in newer mesa versions that slowed down tg on Intel.

qnixsynapse Jan 11, 2025
Collaborator

A750 has 448 CUs, A770 has 512 CUs I think. Personally, I am not worried about tg. I am worried about pp here. The gemm batch quickly saturates my GPU.

qnixsynapse Feb 9, 2025
Collaborator

@0cc4m https://gitlab.freedesktop.org/mesa/mesa/-/issues/12585

0cc4m · 2025-01-09T15:32:01Z

0cc4m
Jan 9, 2025
Collaborator

Here's something exotic: An AMD FirePro S10000 dual GPU from 2012 with 2x 3GB GDDR5.

build: 914a82d (4452)

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD FirePro W8000 (RADV TAHITI) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD FirePro W8000 (RADV TAHITI) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	8	pp512	94.78 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	8	tg128	25.32 ± 0.02

1 reply

netrunnereve Jan 9, 2025
Collaborator Author

Very interesting, and looks like it's pretty close to the W8100 in tg despite being a dual GPU card. Your backend scales pretty well with layer splitting which is why I find it worthwhile to run my RX470 and W8100 together (I end up getting results that are close to the average of both cards).

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon FirePro W8100 (RADV HAWAII) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	threads	main_gpu	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	pp512	147.84 ± 0.38
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	tg128	30.77 ± 0.00

vkhodygo · 2025-01-10T12:21:36Z

vkhodygo
Jan 10, 2025

Latest arch with Vulkan Instance Version: 1.4.303 on a i7-1185G7 laptop. The config is not completely stock, I had to deal with thermals ages ago to boost the performance, so it doesn't throttle.

For the sake of consistency I run every bit in a script and also build every target from scratch (for some reason cmake doesn't want to clean everything):

kill -STOP -1

timeout 240s $COMMAND

kill -CONT -1

Vulkan only:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	42.02 ± 0.07
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	7.28 ± 0.24

build: ff3fcab (4459)

Vulkan and OpenBLAS w/ default 4 threads:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	4	pp512	42.05 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	4	tg128	7.35 ± 0.26

This bit seems to underutilise both GPU and CPU in real conditions based on top activities.

Vulkan and OpenBLAS w/ default 8 threads:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	8	pp512	41.89 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	8	tg128	7.22 ± 0.20

3 replies

0cc4m Jan 10, 2025
Collaborator

Unless you reduce the number of GPU layers, threads and openblas/non-openblas is not gonna make any difference. Try it with ngl 0, then only prompt processing is accelerated using Vulkan, the rest runs on CPU. This is often a good setting for integrated GPUs.

vkhodygo Jan 10, 2025

That's something I didn't think about, with -ngl 0 it goes like this:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	4	pp512	30.51 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	4	tg128	9.87 ± 0.05

build: ba8a1f9 (4460)

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	8	pp512	32.11 ± 0.45
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	8	tg128	9.49 ± 0.18

vkhodygo Feb 5, 2025

It seems latest patches has improved the results a bit:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	50.86 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	8.30 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	2	pp512	50.90 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	2	tg128	8.11 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	4	pp512	50.91 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	4	tg128	7.99 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	pp512	50.89 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	tg128	7.92 ± 0.24

0cc4m · 2025-01-10T20:27:15Z

0cc4m
Jan 10, 2025
Collaborator

Intel ARC A770 on Windows:

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	314.24 ± 1.04
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	45.22 ± 0.25

build: ba8a1f9 (4460)

0 replies

8XXD8 · 2025-01-11T12:48:55Z

8XXD8
Jan 11, 2025

Single GPU Vulkan

Radeon Instinct MI25

ggml_vulkan: 0 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	439.42 ± 0.34
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	54.69 ± 0.03

build: 2739a71 (4461)

Radeon PRO VII

ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	329.86 ± 0.80
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	75.22 ± 0.05

build: 2739a71 (4461)

Multi GPU Vulkan

ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 2 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	324.55 ± 0.55
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	38.39 ± 0.09

build: 2739a71 (4461)

ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 2 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 3 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 4 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	Vulkan	100	pp512	32.29 ± 0.04
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	Vulkan	100	tg128	4.75 ± 0.00

build: 2739a71 (4461)

Single GPU Rocm

Device 0: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	409.83 ± 0.23
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	63.94 ± 0.06

build: 2739a71 (4461)

Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	1064.99 ± 1.18
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	87.45 ± 0.04

build: 2739a71 (4461)

Multi GPU Rocm

Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 1: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 2: AMD Radeon Pro VII, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	1061.87 ± 0.26
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	81.49 ± 0.41

build: 2739a71 (4461)

Layer split
Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 1: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 2: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 3: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no
Device 4: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	ROCm	100	pp512	16.36 ± 0.02
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	ROCm	100	tg128	6.43 ± 0.01

build: 2739a71 (4461)

Row split
Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 1: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 2: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 3: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no
Device 4: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no

model	size	params	backend	ngl	sm	test	t/s
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	ROCm	100	row	pp512	30.86 ± 0.03
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	ROCm	100	row	tg128	12.52 ± 0.21

build: 2739a71 (4461)

Single GPU speed is decent, but multi GPU trails Rocm by a wide margin, especially with large models due to the lack of row split.

2 replies

cb88 Jan 18, 2025

What is the power profile for this MI25? Mine is 110W but its running slower than yours on git from today.

8XXD8 Jan 21, 2025

Mine defaults to 220w.
You can increase the power with rocm-smi --setpoweroverdrive 220

daniandtheweb · 2025-01-12T01:48:51Z

daniandtheweb
Jan 12, 2025

AMD Radeon RX 5700 XT on Arch using mesa-git and setting a higher GPU power limit compared to the stock card.
build: c05e8c9 (4462)

Vulkan:

ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	439.42 ± 0.28
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	70.13 ± 0.05

HIP:

  Device 0: AMD Radeon RX 5700 XT, compute capability 10.1, VMM: no

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	354.17 ± 0.18
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	67.55 ± 0.04

I also think it could be interesting adding the flash attention results to the scoreboard (even if the support for it still isn't as mature as CUDA's).

Vulkan FA:

ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	214.48 ± 2.31
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	23.21 ± 0.08

HIP FA:

  Device 0: AMD Radeon RX 5700 XT, compute capability 10.1, VMM: no

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	pp512	314.17 ± 0.29
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	tg128	62.02 ± 0.05

2 replies

0cc4m Jan 12, 2025
Collaborator

There is no Vulkan flash attention support (except with coopmat2 on very new nvidia drivers). What you're measuring here is a CPU fallback.

daniandtheweb Jan 12, 2025

I see, I was sure about the CPU fallback but didn't know there was no flash attention support at all.

FNsi · 2025-01-12T06:17:07Z

FNsi
Jan 12, 2025

I tried but there's nothing after 1 hrs , ok, might be 40 mins...

Anyway I run the llama_cli for a sample eval...

build: 4419 (46e3556e)

./llama-cli -m ~/storage/llama-2-7b.Q4_0.gguf -p "can u" -ngl 100                         ggml_vulkan: Found 1 Vulkan devices:                  ggml_vulkan: 0 = Mali-G57 (Mali-G57) | uma: 1 | fp16: 1 | warp size: 16 | matrix cores: none                build: 4419 (46e3556e) with clang version 19.1.6 for aarch64-unknown-linux-android24

llama_perf_sampler_print:    sampling time =       3.31 ms /    24 runs   (    0.14 ms per token,  7242.00 tokens per second)                                     llama_perf_context_print:        load time =   28544.85 ms                                                  llama_perf_context_print: prompt eval time =    3788.63 ms /     3 tokens ( 1262.88 ms per token,     0.79 tokens per second)                                     llama_perf_context_print:        eval time =   23248.44 ms /    20 runs   ( 1162.42 ms per token,     0.86 tokens per second)                                     llama_perf_context_print:       total time =   27591.65 ms /    23 tokens

Meanwhile OpenBLAS

llama_perf_sampler_print:    sampling time =       5.00 ms /    43 runs   (    0.12 ms per token,  8608.61 tokens per second)                                     llama_perf_context_print:        load time =   10871.74 ms                                                  llama_perf_context_print: prompt eval time =    1228.38 ms /     3 tokens (  409.46 ms per token,     2.44 tokens per second)                                     llama_perf_context_print:        eval time =   17010.39 ms /    39 runs   (  436.16 ms per token,     2.29 tokens per second)                                     llama_perf_context_print:       total time =   18639.62 ms /    42 tokens

2 replies

netrunnereve Jan 12, 2025
Collaborator Author

Even at below 1t/s llama-bench shouldn't run for an hour. The support just isn't there atm for Vulkan on Android.

FNsi Jan 13, 2025

Truth is ...

(0.79 tokens per second),

3788.63 ms / 3 tokens

So it's not even...it just slower...

characharm · 2025-04-18T13:56:54Z

characharm
Apr 18, 2025

model	size	params	backend	ngl	test	t/s
qwen2 14B Q4_0	7.95 GiB	14.77 B	Vulkan	99	pp512	647.35 ± 1.85
qwen2 14B Q4_0	7.95 GiB	14.77 B	Vulkan	99	tg128	31.09 ± 0.50
qwen2 14B Q4_0	7.95 GiB	14.77 B	Vulkan	99	tg256	30.72 ± 0.45
qwen2 14B Q4_0	7.95 GiB	14.77 B	Vulkan	99	tg512	31.04 ± 0.31

build: 7538246 (5083)

model	size	params	backend	ngl	test	t/s
qwen2 14B Q4_0	7.95 GiB	14.77 B	Vulkan	99	pp512	2039.45 ± 1.97
qwen2 14B Q4_0	7.95 GiB	14.77 B	Vulkan	99	tg128	58.78 ± 0.55
qwen2 14B Q4_0	7.95 GiB	14.77 B	Vulkan	99	tg256	57.86 ± 0.48
qwen2 14B Q4_0	7.95 GiB	14.77 B	Vulkan	99	tg512	58.16 ± 0.68

build: 7538246 (5083)

0 replies

rgerganov · 2025-04-22T13:38:31Z

rgerganov
Apr 22, 2025
Collaborator

Here are some results with the Vulkan backend running on Steam Deck:

model	size	params	backend	ngl	test	t/s
gemma3 4B Q4_0	2.93 GiB	3.88 B	Vulkan,RPC	99	pp512	156.14 ± 0.43
gemma3 4B Q4_0	2.93 GiB	3.88 B	Vulkan,RPC	99	tg128	18.39 ± 0.05

build: 5368ddd (5164)

The output of vulkaninfo on this device is here

1 reply

0cc4m Apr 22, 2025
Collaborator

Nice! You'll get a little more prompt processing performance if you compile with a more modern glslc (Vulkan SDK) that supports the VK_KHR_shader_integer_dot_product extension. The Steam Deck APU is RDNA2, so it has the required DP4A support.

Johnreidsilver · 2025-04-22T23:34:50Z

Johnreidsilver
Apr 22, 2025

RTX 5060Ti 16GB Driver Version: 575.51.02 CUDA Version: 12.9

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	3211.73 ± 24.44
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	81.48 ± 3.50

build: 658987c (5170)

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	3492.22 ± 15.73
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	83.26 ± 2.03

build: 658987c (5170)

5 replies

toncao Apr 24, 2025

How do you get your PC to recognise the 5060 Ti? Are you using Linux with ARM, or x86_64? I'm using Linux x86_64 with thr same driver version, but my PC detects the GPU but nvidia-smi does not.

cb88 Apr 24, 2025

to be fair both of those are Vulkan not CUDA benches... so maybe the same issue you have he has.

Johnreidsilver Apr 24, 2025

After some initial snafu with proprietary vs open source driver options (had to pick open source) it works with CUDA too, nvidia-smi recognizes the board right after running the nvidia beta driver .run file.
Then you need to install the pytorch nightly (it's for CUDA 12.8 but it's working with 12.9 driver), but if nvidia-smi isn't recognizing the board, likely it won't work. Does the board show in dmesg and lspci lshw? I'm using the integrated for display. Mine didn't on the first power up, simply because it wasn't 100% inserted into the pci-e slot. Fans spinning and everything, but that slot holding bracket on the right needs to really snap in and click.

I'm using Ubuntu, actually an older install with 22.04.5, kernel 6.2.0-39-generic, while newer 24.04 ships with 6.8 and 24.04.2 with 6.11.

Here's the results with CUDA, for reference:
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	4034.65 ± 2.41
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	94.30 ± 0.05

build: 3a8e9af (1)

For some reason the benchmark crashes after the pp512 run if Flash Attention is off:

`
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp512	3654.97 ± 8.15

/home/user/git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:70: CUDA error
[New LWP 15007]
[New LWP 15008]
[New LWP 15009]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fc6cbeea42f in __GI___wait4 (pid=15010, stat_loc=0x7ffc829abfe4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0 0x00007fc6cbeea42f in __GI___wait4 (pid=15010, stat_loc=0x7ffc829abfe4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x00007fc6cc561589 in ggml_abort () from /home/user/git/llama.cpp/build/ggml/src/libggml-base.so
#2 0x00007fc6c6c89cc2 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/user/git/llama.cpp/build/ggml/src/ggml-cuda/libggml-cuda.so
#3 0x00007fc6c6c92e9d in ggml_backend_cuda_synchronize(ggml_backend*) () from /home/user/git/llama.cpp/build/ggml/src/ggml-cuda/libggml-cuda.so
#4 0x00007fc6cc5780ce in ggml_backend_sched_synchronize () from /home/user/git/llama.cpp/build/ggml/src/libggml-base.so
#5 0x00007fc6cc6774f4 in llama_synchronize () from /home/user/git/llama.cpp/build/src/libllama.so
#6 0x000055b016afa0c9 in main ()
[Inferior 1 (process 15006) detached]
Aborted
`

jakogut Apr 25, 2025

AMD Ryzen AI 9 HX 370 - GPD Pocket 4 w/ 64 GB LPDDR5X 7500 MT/s

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	309.35 ± 0.93
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	21.23 ± 0.40

build: 87616f0 (5184)

toncao Apr 25, 2025

Thanks for the update. It turns out the 575.51.02 beta driver can recognise my 5060 Ti, as I switch to MIT/GPL 575.51.02 driver from proprietary. The 575.51.02 proprietary driver weirdly does not recognise my 5060 Ti.

I'm using 24.04 LTS Ubuntu Linux, and 6.11.0-24-generic kernel.

beebopkim · 2025-04-27T04:42:21Z

beebopkim
Apr 27, 2025

M3 Ultra(Mac Studio 2025) 24P+8E Cores of CPU, 80 Cores of GPU with Vulkan

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	24	pp512	1116.48 ± 0.20
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	24	tg128	114.55 ± 1.09

build: 2d451c8 (5195)

Non-BLAS

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1116.83 ± 0.55
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	115.54 ± 0.78

build: 2d451c8 (5195)

For comparison, Metal on same machine

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Metal	100	pp512	1574.09 ± 1.42
llama 7B Q4_0	3.56 GiB	6.74 B	Metal	100	tg128	102.28 ± 0.78

build: 2d451c8 (5195)

It is interesting that TG in Vulkan is faster than Metal. Faster PP in Metal is as expected.

0 replies

cdrfvb · 2025-04-28T10:25:25Z

cdrfvb
Apr 28, 2025

AMD Instinct MI50

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Instinct MI50/MI60 (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	387.37 ± 0.33
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	71.46 ± 0.10

build: d5fe4e8 (5192)

With ROCm for comparison:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,BLAS,RPC	24	pp512	1120.58 ± 0.59
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,BLAS,RPC	24	tg128	81.95 ± 0.28

build: d5fe4e8 (5192)

0 replies

Basten7 · 2025-04-28T14:57:00Z

Basten7
Apr 28, 2025

AMD Radeon PRO W6800X Duo
Build 5180

model	size	params	backend	threads	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	none	pp512	522.19 ± 0.16
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	none	tg128	87.88 ± 0.16

4 replies

netrunnereve May 1, 2025
Collaborator Author

This is running on both GPUs right? Haved you tried running it on a single GPU with -mg 0 -sm none?

Basten7 May 7, 2025

Exactly, on MacOS.

Basten7 May 7, 2025

build: 13b4548 (5180) on Vulkan (InstallVulkan-1.4.309.0)

model	size	params	backend	threads	main_gpu	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	0	none	pp512	625.53 ± 1.41
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	0	none	tg128	94.81 ± 0.29
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	1	none	pp512	519.14 ± 0.13
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	1	none	tg128	87.56 ± 0.19
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	2	none	pp512	522.39 ± 0.16
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	2	none	tg128	86.17 ± 0.86
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	3	none	pp512	510.80 ± 0.13
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	3	none	tg128	86.47 ± 0.46

build: 2356fb1 (5291) on MoltenVK4

llama.cpp % ./build/bin/llama-bench -ngl 99 -sm none -mg 0,1,2,3 -fa 0 -m ../llama-2-7b-q4_0.gguf
[mvk-info] MoltenVK version 1.3.0, supporting Vulkan version 1.3.313.

model	size	params	backend	threads	main_gpu	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	0	none	pp512	659.07 ± 0.64
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	0	none	tg128	66.81 ± 0.24
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	1	none	pp512	570.06 ± 0.64
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	1	none	tg128	78.30 ± 0.41
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	2	none	pp512	441.90 ± 0.34
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	2	none	tg128	79.35 ± 0.09
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	3	none	pp512	444.07 ± 0.63
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	3	none	tg128	79.60 ± 0.35

[mvk-info] Vulkan semaphores using MTLEvent.
[mvk-warn] VK_ERROR_FEATURE_NOT_PRESENT: Metal does not support buffer robustness.
[mvk-info] Descriptor sets binding resources using Metal3 argument buffers.
[mvk-info] Created VkDevice to run on GPU AMD Radeon PRO W6800X Duo with the following 5 Vulkan extensions enabled:
VK_KHR_16bit_storage v1
VK_KHR_maintenance4 v2
VK_KHR_shader_float16_int8 v1
VK_EXT_pipeline_robustness v1
VK_EXT_subgroup_size_control v2

Basten7 May 12, 2025

Ubuntu24.04, Xeon W W-3435, 2 GPUs 6950 XT and 6900 XT
Full PCIe Gen4 bandwidth

model	size	params	backend	ngl	main_gpu	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	0	none	pp512	1843.15 ± 2.09
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	0	none	tg128	85.60 ± 0.21
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	1	none	pp512	1869.09 ± 1.48
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	1	none	tg128	92.96 ± 0.47

build: fac63a3 (4940)
PS C:\Users\xionz\llama.cpp> ./llama-bench -m \llama-2-7b.Q4_0.gguf -mg 0,1 -sm none,layer,row -fa 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
Device 0: AMD Radeon RX 6900 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32
Device 1: AMD Radeon RX 6950 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32

model	size	params	backend	ngl	main_gpu	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	0	none	0	pp512	1843.15 ± 2.09
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	0	none	0	tg128	85.60 ± 0.21
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	0	none	1	pp512	1615.12 ± 0.86
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	0	none	1	tg128	77.84 ± 0.29
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	1	none	0	pp512	1869.09 ± 1.48
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	1	none	0	tg128	92.96 ± 0.47
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	1	none	1	pp512	1666.05 ± 0.85
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	1	none	1	tg128	82.73 ± 0.81
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	0	layer	0	pp512	1962.01 ± 3.56
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	0	layer	0	tg128	85.42 ± 0.32
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	0	layer	1	pp512	1736.24 ± 1.09
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	0	layer	1	tg128	75.24 ± 0.14
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	1	layer	0	pp512	1956.82 ± 1.95
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	1	layer	0	tg128	85.49 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	1	layer	1	pp512	1730.22 ± 3.33
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	1	layer	1	tg128	75.80 ± 0.23
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	0	row	0	pp512	16823.33 ± 3785.93

ranma · 2025-05-04T08:19:49Z

ranma
May 4, 2025

AMD Ryzen 5 5600G (Debian 12, Vulkan version 1.3.211, LLVM 15.0.6, DRM 3.61, Linux 6.12.26, Mesa 22.3.6):
GPU clocked at 1.9GHz, RAM at 2133MHz (no XMP)

$ bin/llama-bench -m ../model/llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |           pp512 |         65.06 ± 0.03 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |           tg128 |          7.24 ± 0.01 |

build: 8ae5ebcf (5273)

[edit]
Also re-ran with 4 memory sticks instead of 2, doesn't make much of a difference:

$ bin/llama-bench -m ../model/llama-2-7b.Q4_0.gguf  -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |           pp512 |         66.40 ± 0.04 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |           tg128 |          7.51 ± 0.01 |

build: 8ae5ebcf (5273)

CPU-only:

$ bin/llama-bench -m ../model/llama-2-7b.Q4_0.gguf  -ngl 100
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       6 |           pp512 |         48.24 ± 0.19 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       6 |           tg128 |          7.60 ± 0.01 |

build: 8ae5ebcf (5273)

GPU clocked at 2.2GHz, RAM at 2667MHz (XMP)

$ bin/llama-bench -m ../model/llama-2-7b.Q4_0.gguf  -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |           pp512 |         77.22 ± 0.01 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |           tg128 |          9.34 ± 0.01 |

build: 8ae5ebcf (5273)

[/edit]

0 replies

mmitch · 2025-05-04T21:31:13Z

mmitch
May 4, 2025

AMD Radeon RX 6500 XT
running on Debian 12.10 (Vulkan 1.3.239, Linux 6.1.0, Mesa 22.3.6)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	255.25 ± 0.35
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	27.81 ± 0.10

build: g9fdfcdae

0 replies

ranma · 2025-05-06T15:31:48Z

ranma
May 6, 2025

Upgrade time!
AMD Radeon RX 7600 XT (Debian 12, Vulkan version 1.3.211, LLVM 15.0.6, DRM 3.61, Linux 6.12.26, Mesa 22.3.6):

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	494.80 ± 0.43
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	51.42 ± 0.02

build: 8ae5ebc (5273)

2 replies

0cc4m May 7, 2025
Collaborator

If you install a newer Vulkan SDK (or specifically glslc and Vulkan headers) you'll get a decent increase in prompt processing from coopmat1 matrix core acceleration.

ranma May 10, 2025

After building a bunch of deps from testing:

AMD Radeon RX 7600 XT (Debian 12, Vulkan version 1.4.303, LLVM 15.0.6, DRM 3.61, Linux 6.12.26, Mesa 25.0.4-1~bpo12+1):

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	632.88 ± 0.70
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	58.44 ± 0.01

build: 3b24d26 (5339)

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	586.16 ± 2.43
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	59.02 ± 0.03

build: 3b24d26 (5339)

henfiber · 2025-05-07T23:51:53Z

henfiber
May 7, 2025

AMD 5600U (Zen3 APU 6c/12t, Radeon Vega 7, 2ch 64GB DDR4-3200)

Fedora 41, Linux 6.12.13, Vulkan version: 1.4.309

llama.cpp build: 141a908 (5298)
(May 7, 2025)

build info

-- Adding CPU backend variant ggml-cpu: -march=native   
-- GL_KHR_cooperative_matrix supported by glslc  
-- GL_NV_cooperative_matrix2 supported by glslc  
-- GL_EXT_integer_dot_product supported by glslc  
-- GL_EXT_bfloat16 not supported by glslc

On average, with various dense models, I have noticed a 1.8-2.5x increase in PP performance with the Vulkan backend compared to this CPU.

Vulkan

./build/bin/llama-bench -m ~/Downloads/_ai-models/llama-2-7b.Q4_0.gguf -ngl 999
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 999 |           pp512 |         61.82 ± 0.46 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 999 |           tg128 |          8.92 ± 0.02 |

Vulkan (Only PP - ngl 0)

./build/bin/llama-bench -m ~/Downloads/_ai-models/llama-2-7b.Q4_0.gguf -ngl 0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |   0 |           pp512 |         48.55 ± 0.42 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |   0 |           tg128 |          8.30 ± 0.76 |

CPU only

GGML_VK_VISIBLE_DEVICES=none ./build/bin/llama-bench -m ~/Downloads/_ai-models/llama-2-7b.Q4_0.gguf -ngl 0
ggml_vulkan: Found 0 Vulkan devices:
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |   0 |           pp512 |         32.46 ± 0.75 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |   0 |           tg128 |          8.85 ± 0.44 |

Just for feedback to Vulkan maintainers:

I noticed the opposite with a MoE model (Qwen3 MoE 30b-a3b) : Vulkan is 3-4x slower than the CPU in terms of PP, and about 10% slower in token generation.

Vulkan (Qwen3-30B-A3B-GGUF:Q4_K_XL)

llama_perf_context_print: prompt eval time =    5778,17 ms /   102 tokens (   56,65 ms per token,    17,65 tokens per second)
llama_perf_context_print:        eval time =   16153,25 ms /   226 runs   (   71,47 ms per token,    13,99 tokens per second)

CPU only

llama_perf_context_print: prompt eval time =    1771,38 ms /   102 tokens (   17,37 ms per token,    57,58 tokens per second)
llama_perf_context_print:        eval time =   20916,04 ms /   336 runs   (   62,25 ms per token,    16,06 tokens per second)

0 replies

jruhe-adesso · 2025-05-09T16:02:33Z

jruhe-adesso
May 9, 2025

Tested my new ASUS NUC Pro 14 AI Mini PC with Core Ultra 7 258V and 32GB LPDDR5x-8533 2ch memory. Used the pre-built Windows binary—WSL Ubuntu version isn’t working yet. Vulkan PP still lags behind llama.cpp with IPEX LLM backend.
Driver Info:

`PS C:\Users\julia\Downloads\llama-b5328-bin-win-vulkan-x64> .\llama-bench.exe -m ..\llama-2-7b.Q4_0.gguf -ngl 99
load_backend: loaded RPC backend from C:\Users\julia\Downloads\llama-b5328-bin-win-vulkan-x64\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) 140V GPU (16GB) (Intel Corporation) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from C:\Users\julia\Downloads\llama-b5328-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\Users\julia\Downloads\llama-b5328-bin-win-vulkan-x64\ggml-cpu-alderlake.dll

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	pp512	210.27 ± 0.86
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	tg128	21.63 ± 0.16

build: 0cf6725 (5328)`

IPEX LLM, same model
`PS C:\Users\julia\Downloads\llama-cpp-ipex-llm-2.3.0b20250424-win> .\llama-bench.exe -m ..\llama-2-7b.Q4_0.gguf -ngl 99

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	pp512	725.45 ± 5.98
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	tg128	22.01 ± 0.11

build: 6ecf5e8 (1)`

Unfortunately, the memory bandwidth of 136,528 MB/s is nowhere near fully utilized in both cases. This is, what I can get (WSL):

ruhe@DESKTOP-BQDU1QB:~/likwid-5.4.1$ likwid-bench -t load_avx -W N:4GB:8
Allocate: Process running on hwthread 0 (Domain N) - Vector length 500000000/4000000000 Offset 0 Alignment 512
Initialization: Each thread in domain initializes its own stream chunks
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: load_avx
--------------------------------------------------------------------------------
Using 1 work groups
Using 8 threads
--------------------------------------------------------------------------------
Running without Marker API. Activate Marker API with -m on commandline.
--------------------------------------------------------------------------------
Group: 0 Thread 0 Global Thread 0 running on hwthread 0 - Vector length 62500000 Offset 0
Group: 0 Thread 1 Global Thread 1 running on hwthread 1 - Vector length 62500000 Offset 62500000
Group: 0 Thread 6 Global Thread 6 running on hwthread 6 - Vector length 62500000 Offset 375000000
Group: 0 Thread 4 Global Thread 4 running on hwthread 4 - Vector length 62500000 Offset 250000000
Group: 0 Thread 7 Global Thread 7 running on hwthread 7 - Vector length 62500000 Offset 437500000
Group: 0 Thread 2 Global Thread 2 running on hwthread 2 - Vector length 62500000 Offset 125000000
Group: 0 Thread 5 Global Thread 5 running on hwthread 5 - Vector length 62500000 Offset 312500000
Group: 0 Thread 3 Global Thread 3 running on hwthread 3 - Vector length 62500000 Offset 187500000
--------------------------------------------------------------------------------
Cycles:                 8495395467
CPU Clock:              3302119733
Cycle Clock:            3302119733
Time:                   2.572710e+00 sec
Iterations:             512
Iterations per thread:  64
Inner loop executions:  3906250
Size (Byte):            4000000000
Size per thread:        500000000
Number of Flops:        0
MFlops/s:               0.00
Data volume (Byte):     256000000000
MByte/s:                **99505.98**
Cycles per update:      0.265481
Cycles per cacheline:   2.123849
Loads per update:       1
Stores per update:      0
Load bytes per element: 8
Store bytes per elem.:  0
Instructions:           14000000016
UOPs:                   12000000000
--------------------------------------------------------------------------------
ruhe@DESKTOP-BQDU1QB:~/likwid-5.4.1$

0 replies

smahs · 2025-05-10T18:41:34Z

smahs
May 10, 2025

OS: Arch Linux latest
Driver: nvidia-open 570.144
llama.cpp build: b5338

LD_LIBRARY_PATH=./build-vulkan/bin:$LD_LIBRARY_PATH ./build-vulkan/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5070 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |           pp512 |      5164.60 ± 24.58 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |           tg128 |        126.67 ± 0.44 |

For comparison, with CUDA build:

LD_LIBRARY_PATH=./build-cuda/bin:$LD_LIBRARY_PATH ./build-cuda/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100 -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       | 100 |  1 |           pp512 |      7802.37 ± 26.52 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       | 100 |  1 |           tg128 |        168.20 ± 0.05 |

1 reply

smahs May 12, 2025

With FA using Vulkan NV_coopmat2:

LD_LIBRARY_PATH=./build-vulkan/bin:$LD_LIBRARY_PATH ./build-vulkan/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100 -fa 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5070 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |  1 |           pp512 |      6801.97 ± 15.29 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |  1 |           tg128 |        125.89 ± 1.28 |

seijikun · 2025-05-12T17:06:50Z

seijikun
May 12, 2025

./bin/llama-bench -m ../../llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 580 Series (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	258.03 ± 0.71
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	39.32 ± 0.03

build: de4c07f (5359)

0 replies

ExtReMLapin · 2025-05-14T04:58:20Z

ExtReMLapin
May 14, 2025

That would be cool to have a graph showinv cuda vs vulkan performance over time/versions

0 replies

daniandtheweb · 2025-05-14T13:43:22Z

daniandtheweb
May 14, 2025

I've noticed that on my RX 7800 XT, the performance of the RADV driver is significantly worse than AMDVLK when using coopmat. In fact, the integer dot implementation ends up being much faster. Has anyone else run into this? It seems like it could be a driver implementation issue, but I’d like to gather some feedback before diving deeper.

COOPMAT RADV

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp512	1244.85 ± 21.29
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg128	112.01 ± 0.54
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	1258.58 ± 1.49
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	114.18 ± 0.26

COOPMAT AMDVLK

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp512	2091.34 ± 8.75
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg128	98.15 ± 0.23
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	1955.91 ± 6.36
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	95.17 ± 0.23

INT DOT RADV

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp512	1531.85 ± 1.46
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg128	111.97 ± 0.37
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	1432.70 ± 8.81
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	114.12 ± 0.31

INT DOT AMDVLK

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp512	1487.10 ± 1.74
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg128	98.13 ± 0.32
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	949.34 ± 3.92
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	95.20 ± 0.32

build: 360a9c98 (5379)

5 replies

0cc4m May 14, 2025
Collaborator

Oof yeah, that doesn't look right. There might be some tuning left to do with coopmat1, but it shouldn't be that slow. If you got the time, I would recommend trying a few mesa versions between 24 and latest, see how if it is better/worse somewhere and open an issue with what you found.

Sadly I don't have a coopmat-capable AMD GPU, so I can't help much.

netrunnereve May 14, 2025
Collaborator Author

Why is the shared memory amount different for radv and amdvlk? That might affect the tile size for mul mat.

daniandtheweb May 14, 2025

I'm not sure why it does report a different amount of shared memory. I tried running the bench from a clean linux distro and the results are the same. However I don't see how that would only affect the coopmat usage.

0cc4m May 14, 2025
Collaborator

I don't think that makes a difference here, at most the largest size for MoE wouldn't fit, but on AMD the large tiles are disabled for performance reasons anyways. The medium tiles stay at less than 10KB.

netrunnereve May 14, 2025
Collaborator Author

If you have more shared memory you can fit more waves on the core, but that shouldn't cause such a big difference.

Honestly I wonder how well radv has optimized coopmat considering how it's not used for graphics purposes, and afaik we're one of the few programs that support it. I guess I can say the same thing for Intel graphics as their coopmat implementation performs terribly.

Performance of llama.cpp with Vulkan #10879

netrunnereve Dec 18, 2024 Collaborator

Replies: 71 comments · 116 replies

netrunnereve Dec 18, 2024 Collaborator Author

netrunnereve May 1, 2025 Collaborator Author

netrunnereve Dec 18, 2024 Collaborator Author

netrunnereve May 1, 2025 Collaborator Author

max-krasnyansky Dec 18, 2024 Collaborator

ericcurtin Jan 14, 2025 Collaborator

0cc4m Jan 8, 2025 Collaborator

netrunnereve Jan 8, 2025 Collaborator Author

0cc4m Jan 8, 2025 Collaborator

NVIDIA GeForce RTX 3090 (NVIDIA)

AMD Radeon RX 6800 XT (RADV NAVI21) (radv)

AMD Radeon (TM) Pro VII (RADV VEGA20) (radv)

Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver)

0cc4m Jan 8, 2025 Collaborator

netrunnereve Jan 8, 2025 Collaborator Author

0cc4m Jan 8, 2025 Collaborator

qnixsynapse Jan 9, 2025 Collaborator

0cc4m Jan 10, 2025 Collaborator

qnixsynapse Jan 11, 2025 Collaborator

0cc4m Jan 11, 2025 Collaborator

qnixsynapse Jan 11, 2025 Collaborator

qnixsynapse Feb 9, 2025 Collaborator

0cc4m Jan 9, 2025 Collaborator

netrunnereve Jan 9, 2025 Collaborator Author

0cc4m Jan 10, 2025 Collaborator

0cc4m Jan 10, 2025 Collaborator

0cc4m Jan 12, 2025 Collaborator

netrunnereve Jan 12, 2025 Collaborator Author

rgerganov Apr 22, 2025 Collaborator

0cc4m Apr 22, 2025 Collaborator

M3 Ultra(Mac Studio 2025) 24P+8E Cores of CPU, 80 Cores of GPU with Vulkan

Non-BLAS

For comparison, Metal on same machine

netrunnereve
Dec 18, 2024
Collaborator

Replies: 71 comments 116 replies

netrunnereve
Dec 18, 2024
Collaborator Author

netrunnereve May 1, 2025
Collaborator Author

netrunnereve
Dec 18, 2024
Collaborator Author

netrunnereve May 1, 2025
Collaborator Author

max-krasnyansky
Dec 18, 2024
Collaborator

ericcurtin Jan 14, 2025
Collaborator

0cc4m Jan 8, 2025
Collaborator

netrunnereve Jan 8, 2025
Collaborator Author

0cc4m
Jan 8, 2025
Collaborator

0cc4m
Jan 8, 2025
Collaborator

netrunnereve Jan 8, 2025
Collaborator Author

0cc4m Jan 8, 2025
Collaborator

qnixsynapse
Jan 9, 2025
Collaborator

0cc4m Jan 10, 2025
Collaborator

qnixsynapse Jan 11, 2025
Collaborator

0cc4m Jan 11, 2025
Collaborator

qnixsynapse Jan 11, 2025
Collaborator

qnixsynapse Feb 9, 2025
Collaborator

0cc4m
Jan 9, 2025
Collaborator

netrunnereve Jan 9, 2025
Collaborator Author

0cc4m Jan 10, 2025
Collaborator

0cc4m
Jan 10, 2025
Collaborator

0cc4m Jan 12, 2025
Collaborator

netrunnereve Jan 12, 2025
Collaborator Author

rgerganov
Apr 22, 2025
Collaborator

0cc4m Apr 22, 2025
Collaborator