Skip to content

server : vision support via libmtmd #12898

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 79 commits into from
May 9, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
466c6cd
server : (experimental) vision support via libmtmd
ngxson Apr 11, 2025
2317e61
mtmd : add more api around mtmd_image_tokens
ngxson Apr 11, 2025
a46b6db
mtmd : add more api around mtmd_image_tokens
ngxson Apr 11, 2025
7ac0b7b
mtmd : ability to calc image hash
ngxson Apr 11, 2025
58c4767
shared_ptr for mtmd_image_tokens
ngxson Apr 12, 2025
d3c3e20
move hash to user-define ID (fixed)
ngxson Apr 12, 2025
a44029a
Merge branch 'xsn/mtmd_image_api' into xsn/server_mtmd
ngxson Apr 13, 2025
5e6c7ba
abstract out the batch management
ngxson Apr 13, 2025
78a76de
Merge branch 'master' into xsn/server_mtmd
ngxson Apr 14, 2025
c734b53
Merge branch 'master' into xsn/server_mtmd
ngxson Apr 21, 2025
a6a3653
small fix
ngxson Apr 21, 2025
f8bc466
refactor logic adding tokens to batch
ngxson Apr 21, 2025
f5420e1
implement hashing image
ngxson Apr 21, 2025
aae2e69
Merge branch 'master' into xsn/server_mtmd
ngxson Apr 23, 2025
cd11585
use FNV hash, now hash bitmap instead of file data
ngxson Apr 23, 2025
8afa952
allow decoding image embedding to be split into batches
ngxson Apr 23, 2025
989730c
rm whitespace
ngxson Apr 23, 2025
19b9fe1
Merge branch 'master' into xsn/server_mtmd
ngxson Apr 24, 2025
2df8c1a
disable some features when mtmd is on
ngxson Apr 24, 2025
b9ef895
fix --no-mmproj-offload
ngxson Apr 25, 2025
add9e21
mtmd_context_params no timings
ngxson Apr 25, 2025
0f39770
Merge branch 'master' into xsn/server_mtmd
ngxson Apr 25, 2025
58100b3
refactor server_inp to server_tokens
ngxson Apr 25, 2025
e82fea8
fix the failing test case
ngxson Apr 25, 2025
4a4f35c
init
ngxson Apr 29, 2025
f6b6517
wip
ngxson Apr 29, 2025
e0806c2
Merge branch 'master' into xsn/mtmd_c_api
ngxson Apr 29, 2025
82f4246
working version
ngxson Apr 29, 2025
f8c27b9
add mtmd::bitmaps
ngxson Apr 29, 2025
3357961
add test target
ngxson Apr 29, 2025
92d2404
rm redundant define
ngxson Apr 29, 2025
111d5af
test: mtmd_input_chunks_free
ngxson Apr 29, 2025
08d0f9c
rm outdated comment
ngxson Apr 29, 2025
a230804
Merge branch 'master' into xsn/mtmd_c_api
ngxson May 2, 2025
863db31
fix merging issue
ngxson May 2, 2025
a0fb701
explicitly create mtmd::input_chunks
ngxson May 2, 2025
6bc7a30
mtmd_input_chunk_copy
ngxson May 2, 2025
4d842eb
add clone()
ngxson May 2, 2025
f91fb97
Merge branch 'master' into xsn/server_mtmd
ngxson May 3, 2025
2cedd18
improve server_input struct
ngxson May 3, 2025
3ee071c
clip : fix confused naming ffn_up and ffn_down
ngxson May 3, 2025
3fbf0bd
rm ffn_i/o/g naming
ngxson May 3, 2025
f3870a6
rename n_embd, n_ff
ngxson May 3, 2025
ae83229
small fix
ngxson May 3, 2025
0009f76
Merge branch 'master' into xsn/clip_ffn_up_down_fix
ngxson May 3, 2025
246a4e0
no check n_ff
ngxson May 3, 2025
57b288f
Merge branch 'xsn/clip_ffn_up_down_fix' into xsn/server_mtmd
ngxson May 3, 2025
5f1fe1b
fix detokenize
ngxson May 3, 2025
06cb595
Merge branch 'master' into xsn/mtmd_c_api
ngxson May 4, 2025
e9f7ff9
add const to various places
ngxson May 4, 2025
049ae24
add warning about breaking changes
ngxson May 4, 2025
91613c0
Merge branch 'xsn/mtmd_c_api' into xsn/server_mtmd
ngxson May 4, 2025
d3fece5
add c api
ngxson May 4, 2025
076e3b9
helper: use mtmd_image_tokens_get_n_pos
ngxson May 4, 2025
574d403
Merge branch 'xsn/mtmd_c_api' into xsn/server_mtmd
ngxson May 4, 2025
036f682
Merge branch 'master' into xsn/server_mtmd
ngxson May 4, 2025
01c623e
fix ctx_shift
ngxson May 4, 2025
a0f2562
fix name shadowing
ngxson May 4, 2025
9149f39
Merge branch 'master' into xsn/server_mtmd
ngxson May 5, 2025
b353038
Merge branch 'master' into xsn/server_mtmd
ngxson May 6, 2025
3304b44
more strict condition
ngxson May 6, 2025
88461f2
support remote image_url
ngxson May 6, 2025
4adce86
Merge branch 'master' into xsn/server_mtmd
ngxson May 6, 2025
a9b21f4
remote image_url log
ngxson May 6, 2025
2f30530
add CI test
ngxson May 6, 2025
5ffde38
do not log base64
ngxson May 6, 2025
aaebc33
add "has_multimodal" to /props
ngxson May 8, 2025
eeda075
remove dangling image
ngxson May 8, 2025
bef122e
speculative: use slot.cache_tokens.insert
ngxson May 8, 2025
7282456
Merge branch 'master' into xsn/server_mtmd
ngxson May 8, 2025
51afc0a
Apply suggestions from code review
ngxson May 9, 2025
f10fc56
rm can_be_detokenized
ngxson May 9, 2025
689035c
on prmpt processing done, assert cache_tokens.size
ngxson May 9, 2025
b2906a9
handle_completions_impl returns void
ngxson May 9, 2025
abfd821
Merge branch 'master' into xsn/server_mtmd
ngxson May 9, 2025
f5fbc03
adapt the new web ui
ngxson May 9, 2025
5fe8d72
update docs and hot topics
ngxson May 9, 2025
b8000fd
rm assert
ngxson May 9, 2025
9ed430c
small fix (2)
ngxson May 9, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,9 @@ Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others)

## Hot topics

- 🔥 Multimodal support arrived in `llama-server`: [#12898](https://github.com/ggml-org/llama.cpp/pull/12898) | [documentation](./docs/multimodal.md)
- **GGML developer experience survey (organized and reviewed by NVIDIA):** [link](https://forms.gle/Gasw3cRgyhNEnrwK9)
- A new binary `llama-mtmd-cli` is introduced to replace `llava-cli`, `minicpmv-cli`, `gemma3-cli` ([#13012](https://github.com/ggml-org/llama.cpp/pull/13012)) and `qwen2vl-cli` ([#13141]((https://github.com/ggml-org/llama.cpp/pull/13141))), `libllava` will be deprecated
- A new binary `llama-mtmd-cli` is introduced to replace `llava-cli`, `minicpmv-cli`, `gemma3-cli` ([#13012](https://github.com/ggml-org/llama.cpp/pull/13012)) and `qwen2vl-cli` ([#13141](https://github.com/ggml-org/llama.cpp/pull/13141)), `libllava` will be deprecated
- VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode
- Universal [tool call support](./docs/function-calling.md) in `llama-server` https://github.com/ggml-org/llama.cpp/pull/9639
- Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim
Expand Down
2 changes: 1 addition & 1 deletion common/arg.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ using json = nlohmann::ordered_json;

std::initializer_list<enum llama_example> mmproj_examples = {
LLAMA_EXAMPLE_LLAVA,
// TODO: add LLAMA_EXAMPLE_SERVER when it's ready
LLAMA_EXAMPLE_SERVER,
};

static std::string read_file(const std::string & fname) {
Expand Down
69 changes: 69 additions & 0 deletions docs/multimodal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Multimodal

llama.cpp supports multimodal input via `libmtmd`. Currently, there are 2 tools support this feature:
- [llama-mtmd-cli](../tools/mtmd/README.md)
- [llama-server](../tools/server/README.md) via OpenAI-compatible `/chat/completions` API

To enable it, can use use one of the 2 methods below:

- Use `-hf` option with a [supported model](../../docs/multimodal.md)
- To load a model using `-hf` while disabling multimodal, use `--no-mmproj`
- To load a model using `-hf` while using a custom mmproj file, use `--mmproj local_file.gguf`
- Use `-m model.gguf` option with `--mmproj file.gguf` to specify text and multimodal projector respectively

By default, multimodal projector will be offloaded to GPU. To disable this, add `--no-mmproj-offload`

For example:

```sh
# simple usage with CLI
llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF

# simple usage with server
llama-server -hf ggml-org/gemma-3-4b-it-GGUF

# using local file
llama-server -m gemma-3-4b-it-Q4_K_M.gguf --mmproj mmproj-gemma-3-4b-it-Q4_K_M.gguf

# no GPU offload
llama-server -hf ggml-org/gemma-3-4b-it-GGUF --no-mmproj-offload
```

## Pre-quantized models

These are ready-to-use models, most of them come with `Q4_K_M` quantization by default.

Replaces the `(tool_name)` with the name of binary you want to use. For example, `llama-mtmd-cli` or `llama-server`

NOTE: some models may require large context window, for example: `-c 8192`

```sh
# Gemma 3
(tool_name) -hf ggml-org/gemma-3-4b-it-GGUF
(tool_name) -hf ggml-org/gemma-3-12b-it-GGUF
(tool_name) -hf ggml-org/gemma-3-27b-it-GGUF

# SmolVLM
(tool_name) -hf ggml-org/SmolVLM-Instruct-GGUF
(tool_name) -hf ggml-org/SmolVLM-256M-Instruct-GGUF
(tool_name) -hf ggml-org/SmolVLM-500M-Instruct-GGUF
(tool_name) -hf ggml-org/SmolVLM2-2.2B-Instruct-GGUF
(tool_name) -hf ggml-org/SmolVLM2-256M-Video-Instruct-GGUF
(tool_name) -hf ggml-org/SmolVLM2-500M-Video-Instruct-GGUF

# Pixtral 12B
(tool_name) -hf ggml-org/pixtral-12b-GGUF

# Qwen 2 VL
(tool_name) -hf ggml-org/Qwen2-VL-2B-Instruct-GGUF
(tool_name) -hf ggml-org/Qwen2-VL-7B-Instruct-GGUF

# Qwen 2.5 VL
(tool_name) -hf ggml-org/Qwen2.5-VL-3B-Instruct-GGUF
(tool_name) -hf ggml-org/Qwen2.5-VL-7B-Instruct-GGUF
(tool_name) -hf ggml-org/Qwen2.5-VL-32B-Instruct-GGUF
(tool_name) -hf ggml-org/Qwen2.5-VL-72B-Instruct-GGUF

# Mistral Small 3.1 24B (IQ2_M quantization)
(tool_name) -hf ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
```
33 changes: 1 addition & 32 deletions tools/mtmd/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,38 +16,7 @@ The naming and structure related to multimodal support have evolved, which might

## Pre-quantized models

These are ready-to-use models, most of them come with `Q4_K_M` quantization by default:

```sh
# Gemma 3
llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF
llama-mtmd-cli -hf ggml-org/gemma-3-12b-it-GGUF
llama-mtmd-cli -hf ggml-org/gemma-3-27b-it-GGUF

# SmolVLM
llama-mtmd-cli -hf ggml-org/SmolVLM-Instruct-GGUF
llama-mtmd-cli -hf ggml-org/SmolVLM-256M-Instruct-GGUF
llama-mtmd-cli -hf ggml-org/SmolVLM-500M-Instruct-GGUF
llama-mtmd-cli -hf ggml-org/SmolVLM2-2.2B-Instruct-GGUF
llama-mtmd-cli -hf ggml-org/SmolVLM2-256M-Video-Instruct-GGUF
llama-mtmd-cli -hf ggml-org/SmolVLM2-500M-Video-Instruct-GGUF

# Pixtral 12B
llama-mtmd-cli -hf ggml-org/pixtral-12b-GGUF

# Qwen 2 VL
llama-mtmd-cli -hf ggml-org/Qwen2-VL-2B-Instruct-GGUF
llama-mtmd-cli -hf ggml-org/Qwen2-VL-7B-Instruct-GGUF

# Qwen 2.5 VL
llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-3B-Instruct-GGUF
llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-7B-Instruct-GGUF
llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-32B-Instruct-GGUF
llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-72B-Instruct-GGUF

# Mistral Small 3.1 24B (IQ2_M quantization)
llama-mtmd-cli -hf ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
```
See the list of pre-quantized model [here](../../docs/multimodal.md)

## How it works and what is `mmproj`?

Expand Down
3 changes: 2 additions & 1 deletion tools/server/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,9 @@ endforeach()
add_executable(${TARGET} ${TARGET_SRCS})
install(TARGETS ${TARGET} RUNTIME)

target_include_directories(${TARGET} PRIVATE ../llava)
target_include_directories(${TARGET} PRIVATE ${CMAKE_SOURCE_DIR})
target_link_libraries(${TARGET} PRIVATE common ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(${TARGET} PRIVATE common mtmd ${CMAKE_THREAD_LIBS_INIT})

if (LLAMA_SERVER_SSL)
find_package(OpenSSL REQUIRED)
Expand Down
12 changes: 12 additions & 0 deletions tools/server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,12 @@ services:
LLAMA_ARG_PORT: 8080
```
### Multimodal support
Multimodal support was added in [#12898](https://github.com/ggml-org/llama.cpp/pull/12898) and is currently an experimental feature.
For more details, please refer to [multimodal documentation](../../docs/multimodal.md)
## Build
`llama-server` is built alongside everything else from the root of the project
Expand Down Expand Up @@ -749,6 +755,9 @@ This endpoint is public (no API key check). By default, it is read-only. To make
"total_slots": 1,
"model_path": "../models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
"chat_template": "...",
"modalities": {
"vision": false
},
"build_info": "b(build number)-(build commit hash)"
}
```
Expand All @@ -757,6 +766,7 @@ This endpoint is public (no API key check). By default, it is read-only. To make
- `total_slots` - the total number of slots for process requests (defined by `--parallel` option)
- `model_path` - the path to model file (same with `-m` argument)
- `chat_template` - the model's original Jinja2 prompt template
- `modalities` - the list of supported modalities

### POST `/props`: Change server global properties.

Expand Down Expand Up @@ -1069,6 +1079,8 @@ print(completion.choices[0].text)

Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only models with a [supported chat template](https://github.com/ggml-org/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) can be used optimally with this endpoint. By default, the ChatML template will be used.

If model supports multimodal, you can input the media file via `image_url` content part. We support both base64 and remote URL as input. See OAI documentation for more.

*Options:*

See [OpenAI Chat Completions API documentation](https://platform.openai.com/docs/api-reference/chat). llama.cpp `/completion`-specific features such as `mirostat` are also supported.
Expand Down
Loading
Loading