Skip to content

Eval bug: Qwen3 30B A3B is slow with CUDA #13211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Nepherpitou opened this issue Apr 30, 2025 · 2 comments
Open

Eval bug: Qwen3 30B A3B is slow with CUDA #13211

Nepherpitou opened this issue Apr 30, 2025 · 2 comments

Comments

@Nepherpitou
Copy link

Name and Version

.\llamacpp\llama-server.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 5228 (44cd8d91)
built with MSVC 19.29.30159.0 for

Operating systems

Windows

GGML backends

CUDA

Hardware

  • CPU: Ryzen 7900X
  • CUDA0 RTX 4090 @ x16 - primary GPU for video output
  • CUDA1 RTX 3090 @ x4
  • CUDA2 RTX 3090 @ x1
  • RAM 64Gb DDR5@6000MT

Models

Qwen3-30B-A3B-Q6_K from https://huggingface.co/unsloth/Qwen3-30B-A3B-128K-GGUF

Problem description & steps to reproduce

Using CUDA backend I got only 40-50 tps generation speed.
Here is parameters:

      ./llamacpp/llama-server.exe
      --jinja
      --flash-attn
      --no-mmap
      --no-warmup
      --host 0.0.0.0
      --port 5107
      --metrics
      --slots
      -m ./models/Qwen3-30B-A3B-128K-Q6_K.gguf
      -ngl 99
      --ctx-size 65536
      -ctk q8_0
      -ctv q8_0
      -dev 'CUDA1,CUDA2'
      -ts 100,100

With Vulkan backend I getting 80-90 tps generation speed with

      ./llamacpp/vulkan/llama-server.exe
      --jinja
      --flash-attn
      --no-mmap
      --no-warmup
      --host 0.0.0.0
      --port 5107
      --metrics
      --slots
      -m ./models/Qwen3-30B-A3B-128K-Q6_K.gguf
      -ngl 99
      --ctx-size 65536
      -ctk q8_0
      -ctv q8_0
      -dev 'VULKAN1,VULKAN2'
      -ts 100,100
      -b 384
      -ub 512

But! With batch size more than 384 I'm getting error with incorrect size and BSOD with video memory issues which never happening with CUDA. I've tested VRAM with memtest_vulkan-v0.5.0 and everything was fine.

First Bad Commit

No response

Relevant log output

CUDA

main: server is listening on http://0.0.0.0:5107 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /health 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 1219
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 1219, n_tokens = 1219, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 1219, n_tokens = 1219
slot      release: id  0 | task 0 | stop processing: n_past = 1738, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =    2903.47 ms /  1219 tokens (    2.38 ms per token,   419.84 tokens per second)
       eval time =   11284.06 ms /   520 tokens (   21.70 ms per token,    46.08 tokens per second)
      total time =   14187.52 ms /  1739 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200


VULKAN

[llama-swap] 192.168.1.5 [2025-04-30 16:00:35] "POST /v1/chat/completions HTTP/1.1" 200 117331 "Python/3.11 aiohttp/3.11.11" 27.9262686s
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 492 | processing task
slot update_slots: id  0 | task 492 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 349
slot update_slots: id  0 | task 492 | kv cache rm [3, end)
slot update_slots: id  0 | task 492 | prompt processing progress, n_past = 349, n_tokens = 346, progress = 0.991404
slot update_slots: id  0 | task 492 | prompt done, n_past = 349, n_tokens = 346
slot      release: id  0 | task 492 | stop processing: n_past = 9757, truncated = 0
slot print_timing: id  0 | task 492 |
prompt eval time =     358.06 ms /   346 tokens (    1.03 ms per token,   966.33 tokens per second)
       eval time =  135226.91 ms /  9409 tokens (   14.37 ms per token,    69.58 tokens per second)
      total time =  135584.97 ms /  9755 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
@Nepherpitou
Copy link
Author

main: server is listening on http://0.0.0.0:5107 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /health 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 2267
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.903397
slot update_slots: id  0 | task 0 | kv cache rm [2048, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2267, n_tokens = 219, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 2267, n_tokens = 219
slot      release: id  0 | task 0 | stop processing: n_past = 3528, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =    1063.41 ms /  2267 tokens (    0.47 ms per token,  2131.82 tokens per second)
       eval time =   13540.57 ms /  1262 tokens (   10.73 ms per token,    93.20 tokens per second)
      total time =   14603.98 ms /  3529 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200

Checked performance with single RTX 3090 and got 90 TPS for Q4 quant size. I suspect it's issue with MoE and multi-gpu setup.

@Nepherpitou
Copy link
Author

Nepherpitou commented May 5, 2025

I'm trying to use -ot "^(?!blk\.[0-9]*\..*(exps)).*$=CUDA1" to move everything shared to one GPU, and only split experts across GPUs, but got error:

main: server is listening on http://0.0.0.0:5107 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /health 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 2267
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.903397
D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\getrows.cu:195: ggml_cuda_get_rows_switch_src0_type: unsupported src0 type: q6_K

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant