Prefilling assistant message in openai compatible API #13174

matteoserva · 2025-04-29T07:22:20Z

This adds support for prefilling assistant response (or its thought process) using the OpenAI compatible API.

The feature is used for example by Claude.

It can be tested using open-webui or with the following curl command:

curl http://localhost:8080/apply-template \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
 {
    "role": "system",
    "content": "SYSTEM"
 },
 {
    "role": "user",
    "content": "USERMESSAGE"
 },
 {
    "role": "assistant",
    "content": "ASSISTANT"
 }
]
}'

Example advanced scenario: time limit for the thinking process

launch a reasoning model and stop its thought early
append </think> to its partial response
prefill the response and let it continue generating tokens

examples/server/utils.hpp

isaac-mcfadyen · 2025-04-30T04:26:08Z

Just a heads-up that this is potentially a very breaking change, especially because this is an OpenAI compatible API but this is not OpenAI's behavior.

The main situation I can think of is if someone wants to generate a new assistant message after the last one - i.e for ChatML they want the <|im_end|><|im_start|>assistant added between the last message and the new one, rather than the last message to just be continued.

I'd suggest we add this to #9291 at a minimum.

Prefilling assistant message in openai compatible API

e829173

matteoserva requested a review from ngxson as a code owner April 29, 2025 07:22

github-actions bot added examples server labels Apr 29, 2025

fixed indentation

9d96e5c

ngxson reviewed Apr 29, 2025

View reviewed changes

examples/server/utils.hpp Outdated Show resolved Hide resolved

examples/server/utils.hpp Outdated Show resolved Hide resolved

matteo added 2 commits April 29, 2025 09:46

fixed code convention

496f08e

simplify method usage

79eb825

ngxson reviewed Apr 29, 2025

View reviewed changes

examples/server/utils.hpp Show resolved Hide resolved

no more than one assistant message at end of messages

0c316cd

ngxson reviewed Apr 29, 2025

View reviewed changes

examples/server/utils.hpp Outdated Show resolved Hide resolved

merge checks into prefill code

cb7fe04

ngxson reviewed Apr 29, 2025

View reviewed changes

examples/server/utils.hpp Outdated Show resolved Hide resolved

Update examples/server/utils.hpp

836015d

ngxson approved these changes Apr 29, 2025

View reviewed changes

ngxson merged commit e2e1ddb into ggml-org:master Apr 29, 2025
47 of 48 checks passed

ngxson mentioned this pull request Apr 30, 2025

changelog : llama-server REST API #9291

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefilling assistant message in openai compatible API #13174

Prefilling assistant message in openai compatible API #13174

matteoserva commented Apr 29, 2025 •

edited

Loading

isaac-mcfadyen commented Apr 30, 2025 •

edited

Loading

Prefilling assistant message in openai compatible API #13174

Prefilling assistant message in openai compatible API #13174

Conversation

matteoserva commented Apr 29, 2025 • edited Loading

isaac-mcfadyen commented Apr 30, 2025 • edited Loading

matteoserva commented Apr 29, 2025 •

edited

Loading

isaac-mcfadyen commented Apr 30, 2025 •

edited

Loading