Feature Request: Allow disabling offload_op
for backends by user
#13241
Labels
enhancement
New feature or request
Prerequisites
Feature Description
llama.cpp currently uses hardcoded minimum batch size = 32 and there's no option to disable offload_op unless the user specify -ub 16 or less manually. It would be great if the user can disable offload_op manually without reducing -ub.
Motivation
With the introduction of --override-tensor, it has become practical to offload experts to host DRAM in large MoEs while keeping the dense tensors on a GPU with relatively small VRAM. However, in the current implementation, prompt processing performance is not ideal in some configurations due to offload_op being used.
For example, when running llama4 400B with
-ot exps=CPU
with code from master branch, the prompt processing performance is extremely poor when -ub is set to 512 (default). When -ub is set to 16 it bypasses theoffload_op
in CUDA backend, but the performance is not fully on par with -ub 512 w/offload_op
disabled from source code.llama-bench -m ~/models/llama4-400b-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0 -ub 16,512
With
offload_op
changed to always return false in CUDA backend, there's a 10x performance boost../build/bin/llama-bench -m ~/models/llama4-400b-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0
Possible Implementation
In ggml-backend.cpp, add some additional options and checks to the following offload_op call
The text was updated successfully, but these errors were encountered: