Feature Request: Allow disabling `offload_op` for backends by user #13241

hjc4869 · 2025-05-01T18:42:27Z

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

llama.cpp currently uses hardcoded minimum batch size = 32 and there's no option to disable offload_op unless the user specify -ub 16 or less manually. It would be great if the user can disable offload_op manually without reducing -ub.

Motivation

With the introduction of --override-tensor, it has become practical to offload experts to host DRAM in large MoEs while keeping the dense tensors on a GPU with relatively small VRAM. However, in the current implementation, prompt processing performance is not ideal in some configurations due to offload_op being used.

For example, when running llama4 400B with -ot exps=CPU with code from master branch, the prompt processing performance is extremely poor when -ub is set to 512 (default). When -ub is set to 16 it bypasses the offload_op in CUDA backend, but the performance is not fully on par with -ub 512 w/ offload_op disabled from source code.

llama-bench -m ~/models/llama4-400b-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0 -ub 16,512

model	size	params	backend	ngl	n_ubatch	type_k	type_v	fa	ot	test	t/s
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	16	q8_0	q8_0	1	exps=CPU	pp512	93.68 ± 0.70
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	16	q8_0	q8_0	1	exps=CPU	tg128	26.68 ± 0.07
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	512	q8_0	q8_0	1	exps=CPU	pp512	23.14 ± 0.01
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	512	q8_0	q8_0	1	exps=CPU	tg128	26.61 ± 0.16

With offload_op changed to always return false in CUDA backend, there's a 10x performance boost.
./build/bin/llama-bench -m ~/models/llama4-400b-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0

model	size	params	backend	ngl	type_k	type_v	fa	ot	mmap	test	t/s
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	q8_0	q8_0	1	exps=CPU	0	pp512	233.66 ± 1.31
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	q8_0	q8_0	1	exps=CPU	0	tg128	26.91 ± 0.10

Possible Implementation

In ggml-backend.cpp, add some additional options and checks to the following offload_op call

// check if a backend with higher prio wants to offload the op
if (src_backend_id == sched->n_backends - 1 && ggml_backend_buffer_is_host(src->buffer)) {
    for (int b = 0; b < src_backend_id; b++) {
        if (ggml_backend_supports_op(sched->backends[b], tensor) && ggml_backend_offload_op(sched->backends[b], tensor)) {
            SET_CAUSE(tensor, "1.off");
            return b;
        }
    }
}

The text was updated successfully, but these errors were encountered:

slaren · 2025-05-01T20:42:16Z

I am not against adding an option to disable this, that would be good, but I wonder if the issue here is that these operations on models with a high number of experts, simply should not be offloaded unless the batch size is much higher. If that's the case, this could be addressed by adding some heuristic to the offload_op function instead that increases the batch size depending on the number of experts.

hjc4869 · 2025-05-05T10:12:31Z

Added a common option (-mobs, --min-offload-batch-size) that allows the user to specify a minimum batch size as well as disabling it completely with mobs=-1: hjc4869@5bc63bf

./build/bin/llama-bench -m ~/models/llama4-400b-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0 -mobs -1,0 -n 0

model	size	params	backend	ngl	type_k	type_v	fa	ot	mmap	mobs	test	t/s
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	q8_0	q8_0	1	exps=CPU	0	-1	pp512	233.62 ± 1.34
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	q8_0	q8_0	1	exps=CPU	0	0	pp512	24.88 ± 0.02

If this option is considered appropriate I'll send a PR to upstream the code.

hjc4869 added the enhancement New feature or request label May 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Allow disabling `offload_op` for backends by user #13241

Feature Request: Allow disabling `offload_op` for backends by user #13241

hjc4869 commented May 1, 2025

slaren commented May 1, 2025

hjc4869 commented May 5, 2025

Feature Request: Allow disabling offload_op for backends by user #13241

Feature Request: Allow disabling offload_op for backends by user #13241

Comments

hjc4869 commented May 1, 2025

Prerequisites

Feature Description

Motivation

Possible Implementation

slaren commented May 1, 2025

hjc4869 commented May 5, 2025

Feature Request: Allow disabling `offload_op` for backends by user #13241

Feature Request: Allow disabling `offload_op` for backends by user #13241