Skip to content

Feature Request: Allow disabling offload_op for backends by user #13241

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks done
hjc4869 opened this issue May 1, 2025 · 2 comments
Open
4 tasks done

Feature Request: Allow disabling offload_op for backends by user #13241

hjc4869 opened this issue May 1, 2025 · 2 comments
Labels
enhancement New feature or request

Comments

@hjc4869
Copy link
Contributor

hjc4869 commented May 1, 2025

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

llama.cpp currently uses hardcoded minimum batch size = 32 and there's no option to disable offload_op unless the user specify -ub 16 or less manually. It would be great if the user can disable offload_op manually without reducing -ub.

Motivation

With the introduction of --override-tensor, it has become practical to offload experts to host DRAM in large MoEs while keeping the dense tensors on a GPU with relatively small VRAM. However, in the current implementation, prompt processing performance is not ideal in some configurations due to offload_op being used.

For example, when running llama4 400B with -ot exps=CPU with code from master branch, the prompt processing performance is extremely poor when -ub is set to 512 (default). When -ub is set to 16 it bypasses the offload_op in CUDA backend, but the performance is not fully on par with -ub 512 w/ offload_op disabled from source code.

llama-bench -m ~/models/llama4-400b-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0 -ub 16,512

model size params backend ngl n_ubatch type_k type_v fa ot mmap test t/s
llama4 17Bx128E (Maverick) Q4_0 211.18 GiB 400.71 B ROCm,RPC 999 16 q8_0 q8_0 1 exps=CPU 0 pp512 93.68 ± 0.70
llama4 17Bx128E (Maverick) Q4_0 211.18 GiB 400.71 B ROCm,RPC 999 16 q8_0 q8_0 1 exps=CPU 0 tg128 26.68 ± 0.07
llama4 17Bx128E (Maverick) Q4_0 211.18 GiB 400.71 B ROCm,RPC 999 512 q8_0 q8_0 1 exps=CPU 0 pp512 23.14 ± 0.01
llama4 17Bx128E (Maverick) Q4_0 211.18 GiB 400.71 B ROCm,RPC 999 512 q8_0 q8_0 1 exps=CPU 0 tg128 26.61 ± 0.16

With offload_op changed to always return false in CUDA backend, there's a 10x performance boost.
./build/bin/llama-bench -m ~/models/llama4-400b-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0

model size params backend ngl type_k type_v fa ot mmap test t/s
llama4 17Bx128E (Maverick) Q4_0 211.18 GiB 400.71 B ROCm,RPC 999 q8_0 q8_0 1 exps=CPU 0 pp512 233.66 ± 1.31
llama4 17Bx128E (Maverick) Q4_0 211.18 GiB 400.71 B ROCm,RPC 999 q8_0 q8_0 1 exps=CPU 0 tg128 26.91 ± 0.10

Possible Implementation

In ggml-backend.cpp, add some additional options and checks to the following offload_op call

// check if a backend with higher prio wants to offload the op
if (src_backend_id == sched->n_backends - 1 && ggml_backend_buffer_is_host(src->buffer)) {
    for (int b = 0; b < src_backend_id; b++) {
        if (ggml_backend_supports_op(sched->backends[b], tensor) && ggml_backend_offload_op(sched->backends[b], tensor)) {
            SET_CAUSE(tensor, "1.off");
            return b;
        }
    }
}
@hjc4869 hjc4869 added the enhancement New feature or request label May 1, 2025
@slaren
Copy link
Member

slaren commented May 1, 2025

I am not against adding an option to disable this, that would be good, but I wonder if the issue here is that these operations on models with a high number of experts, simply should not be offloaded unless the batch size is much higher. If that's the case, this could be addressed by adding some heuristic to the offload_op function instead that increases the batch size depending on the number of experts.

@hjc4869
Copy link
Contributor Author

hjc4869 commented May 5, 2025

Added a common option (-mobs, --min-offload-batch-size) that allows the user to specify a minimum batch size as well as disabling it completely with mobs=-1: hjc4869@5bc63bf

./build/bin/llama-bench -m ~/models/llama4-400b-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0 -mobs -1,0 -n 0

model size params backend ngl type_k type_v fa ot mmap mobs test t/s
llama4 17Bx128E (Maverick) Q4_0 211.18 GiB 400.71 B ROCm,RPC 999 q8_0 q8_0 1 exps=CPU 0 -1 pp512 233.62 ± 1.34
llama4 17Bx128E (Maverick) Q4_0 211.18 GiB 400.71 B ROCm,RPC 999 q8_0 q8_0 1 exps=CPU 0 0 pp512 24.88 ± 0.02

If this option is considered appropriate I'll send a PR to upstream the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants