Skip to content

Feature Request: add per-request "reasoning" options in llama-server #13272

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ngxson opened this issue May 2, 2025 · 1 comment
Open

Feature Request: add per-request "reasoning" options in llama-server #13272

ngxson opened this issue May 2, 2025 · 1 comment
Labels
enhancement New feature or request

Comments

@ngxson
Copy link
Collaborator

ngxson commented May 2, 2025

Feature Description

As reasoning models are becoming mainstream, we start to see some pattern:

  • Most models use <think>, <reasoning>, etc, basically a set of known tokens now
  • The "reasoning budget" can technically be supported by any models, not just Qwen, by keeping track of number of tokens between <think> and </think>
  • "no think" is just a reasoning budget == 0

So I'm thinking about accepting an object like this for each request:

"reasoning": {
    "budget": -1, // number of reasoning tokens budget
                     default: -1 (inf) ; 0 for no think
    "format": "", // equivalent of --reasoning-format
                     if set to "deepseek", reasoning will be returned in "message.reasoning_content"
                     if set to "hide", it will be completely hidden
                     default: "none", return the reasoning with the message as normal
}

The reasoning format "hide" can be implemented via #13214 ; the "deepseek" format current only supported for non-stream, but I think we can modify a bit to support this.

For the budget, we don't yet have the logic to handle it.

@ngxson ngxson added the enhancement New feature or request label May 2, 2025
@GreenCappuccino
Copy link

Another interesting option, maybe expose reasoning_effort as a Jinja templating variable? Could be used in Qwen3 where low could fill the <think> block ahead of time, and be OpenAI compatible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants