Eval bug: Qwen3 30B A3B is slow with CUDA #13211

Nepherpitou · 2025-04-30T13:47:17Z

Name and Version

.\llamacpp\llama-server.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 5228 (44cd8d91)
built with MSVC 19.29.30159.0 for

Operating systems

Windows

GGML backends

CUDA

Hardware

CPU: Ryzen 7900X
CUDA0 RTX 4090 @ x16 - primary GPU for video output
CUDA1 RTX 3090 @ x4
CUDA2 RTX 3090 @ x1
RAM 64Gb DDR5@6000MT

Models

Qwen3-30B-A3B-Q6_K from https://huggingface.co/unsloth/Qwen3-30B-A3B-128K-GGUF

Problem description & steps to reproduce

Using CUDA backend I got only 40-50 tps generation speed.
Here is parameters:

      ./llamacpp/llama-server.exe
      --jinja
      --flash-attn
      --no-mmap
      --no-warmup
      --host 0.0.0.0
      --port 5107
      --metrics
      --slots
      -m ./models/Qwen3-30B-A3B-128K-Q6_K.gguf
      -ngl 99
      --ctx-size 65536
      -ctk q8_0
      -ctv q8_0
      -dev 'CUDA1,CUDA2'
      -ts 100,100

With Vulkan backend I getting 80-90 tps generation speed with

      ./llamacpp/vulkan/llama-server.exe
      --jinja
      --flash-attn
      --no-mmap
      --no-warmup
      --host 0.0.0.0
      --port 5107
      --metrics
      --slots
      -m ./models/Qwen3-30B-A3B-128K-Q6_K.gguf
      -ngl 99
      --ctx-size 65536
      -ctk q8_0
      -ctv q8_0
      -dev 'VULKAN1,VULKAN2'
      -ts 100,100
      -b 384
      -ub 512

But! With batch size more than 384 I'm getting error with incorrect size and BSOD with video memory issues which never happening with CUDA. I've tested VRAM with memtest_vulkan-v0.5.0 and everything was fine.

First Bad Commit

No response

Relevant log output

CUDA

main: server is listening on http://0.0.0.0:5107 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /health 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 1219
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 1219, n_tokens = 1219, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 1219, n_tokens = 1219
slot      release: id  0 | task 0 | stop processing: n_past = 1738, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =    2903.47 ms /  1219 tokens (    2.38 ms per token,   419.84 tokens per second)
       eval time =   11284.06 ms /   520 tokens (   21.70 ms per token,    46.08 tokens per second)
      total time =   14187.52 ms /  1739 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200


VULKAN

[llama-swap] 192.168.1.5 [2025-04-30 16:00:35] "POST /v1/chat/completions HTTP/1.1" 200 117331 "Python/3.11 aiohttp/3.11.11" 27.9262686s
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 492 | processing task
slot update_slots: id  0 | task 492 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 349
slot update_slots: id  0 | task 492 | kv cache rm [3, end)
slot update_slots: id  0 | task 492 | prompt processing progress, n_past = 349, n_tokens = 346, progress = 0.991404
slot update_slots: id  0 | task 492 | prompt done, n_past = 349, n_tokens = 346
slot      release: id  0 | task 492 | stop processing: n_past = 9757, truncated = 0
slot print_timing: id  0 | task 492 |
prompt eval time =     358.06 ms /   346 tokens (    1.03 ms per token,   966.33 tokens per second)
       eval time =  135226.91 ms /  9409 tokens (   14.37 ms per token,    69.58 tokens per second)
      total time =  135584.97 ms /  9755 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200

The text was updated successfully, but these errors were encountered:

Nepherpitou · 2025-05-02T15:38:09Z

main: server is listening on http://0.0.0.0:5107 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /health 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 2267
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.903397
slot update_slots: id  0 | task 0 | kv cache rm [2048, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2267, n_tokens = 219, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 2267, n_tokens = 219
slot      release: id  0 | task 0 | stop processing: n_past = 3528, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =    1063.41 ms /  2267 tokens (    0.47 ms per token,  2131.82 tokens per second)
       eval time =   13540.57 ms /  1262 tokens (   10.73 ms per token,    93.20 tokens per second)
      total time =   14603.98 ms /  3529 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200

Checked performance with single RTX 3090 and got 90 TPS for Q4 quant size. I suspect it's issue with MoE and multi-gpu setup.

Nepherpitou · 2025-05-05T08:12:47Z

I'm trying to use -ot "^(?!blk\.[0-9]*\..*(exps)).*$=CUDA1" to move everything shared to one GPU, and only split experts across GPUs, but got error:

main: server is listening on http://0.0.0.0:5107 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /health 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 2267
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.903397
D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\getrows.cu:195: ggml_cuda_get_rows_switch_src0_type: unsupported src0 type: q6_K

Nepherpitou added the bug-unconfirmed label Apr 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: Qwen3 30B A3B is slow with CUDA #13211

Eval bug: Qwen3 30B A3B is slow with CUDA #13211

Nepherpitou commented Apr 30, 2025

Nepherpitou commented May 2, 2025

Nepherpitou commented May 5, 2025 •

edited

Loading

Eval bug: Qwen3 30B A3B is slow with CUDA #13211

Eval bug: Qwen3 30B A3B is slow with CUDA #13211

Comments

Nepherpitou commented Apr 30, 2025

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Nepherpitou commented May 2, 2025

Nepherpitou commented May 5, 2025 • edited Loading

Nepherpitou commented May 5, 2025 •

edited

Loading