llama : Add Gemma 3 support (+ experimental vision capability) #12343

ngxson · 2025-03-12T06:39:20Z

Model info

Official model page: https://ai.google.dev/gemma/docs/core
Pre-quantized GGUF: https://huggingface.co/collections/ggml-org/gemma-3-67d126315ac810df1ad9e913
Available sizes: 1B, 4B, 12B, 27B

Important

This PR only covers the text inference. Vision tower will be ignored upon converting to GGUF.
PR for vision support: #12344

Demo running text-only model via llama-cli

Download the pre-quantized GGUF from the link above

Clone this project and build llama-cli:

cd llama.cpp
cmake -B build
cmake --build build -j --target llama-cli

Run it

./build/bin/llama-cli -m ./gemma-3-4b-it-Q4_K_M.gguf

> who are you
I'm Gemma, a large language model created by the Gemma team at Google DeepMind. I’m an open-weights model, which means I’m widely available for public use! 

I can take text and images as inputs and generate text-based outputs. 

You can learn more about me on the Gemma project page: [https://ai.google.dev/gemma](https://ai.google.dev/gemma)

Huge thanks to Google team and Hugging Face team for providing me invaluable support in this PR 🤗

bartowski1182 · 2025-03-12T06:54:20Z

Thanks so much for your efforts in getting this available quickly!

ngxson · 2025-03-12T07:37:52Z

Thanks for the approval, I'll merge once the 3 remanding workflows are green

ngxson · 2025-03-12T07:42:41Z

(Small correction: because this PR is base of #12344, I'll firstly merge #12344 into this, then this to master)

* clip : Experimental support for Gemma 3 vision * fix build * PRId64

zhouwg · 2025-03-12T07:55:07Z

Thanks so much for your efforts in getting this available quickly!

this is the work of world-class AI expert with valuable support from the Google team and Hugging Face team although it's really quickly.

btw, I"m a bit curious of what's the plan or roadmap of Qualcomm's official ggml-qnn backend? does Qualcomm China have an independent team working on this?

ngxson · 2025-03-12T08:02:58Z

Yes, part of the reason why I can get it work so quickly is thanks to the clean code and documentation from HF and Google!

CI is now 3/4 green, should be able to merge soon 🚀

ngxson · 2025-03-12T08:13:08Z

The iOS workflow is unrelated to this PR, will merge once ubuntu and macOS both pass

…org#12343) * llama : Add Gemma 3 text-only support * fix python coding style * fix compile on ubuntu * python: fix style * fix ubuntu compile * fix build on ubuntu (again) * fix ubuntu build, finally * clip : Experimental support for Gemma 3 vision (ggml-org#12344) * clip : Experimental support for Gemma 3 vision * fix build * PRId64

Dampfinchen · 2025-03-12T14:29:50Z

I've heard multiple reports of VRAM usage skyrocketing for even lower amounts of context. It needs even more than G2, which was already far worse in terms of efficiency compared to Mistral, LLama and Qwen models. That's a bit disappointing, especially considering in the architecture paper they mention memory savings.

Is there optimization potential left? I remember reading in the paper that they make use of SWA for lower memory consumption, which is AFAIk not supported by llama.cpp.

ngxson · 2025-03-12T14:42:19Z

Is there optimization potential left?

Yes, we currently allocate full KV even for sliding window layers. We planned to fix this, see #11213 and #12181

ikcikoR · 2025-03-13T15:05:29Z

Amazing work! It works quite well. Are there any chances to add support for vision in llama-server as well?

ngxson · 2025-03-13T17:16:27Z

100% chance, but we are just not sure when

ikcikoR · 2025-03-13T19:22:12Z

100% chance, but we are just not sure when

Super happy to hear that! If I may ask, is the "not sure when" part closer to "a couple of days" or to "a couple of months"? I mean, I know it won't change anything and it'll be done when it's done, but in case it's likely to take a large amount of time I'd be really glad to know what the situation looks like approximately.

ngxson · 2025-03-13T19:59:15Z

It is most likely a couple of months, the main issue is that we also want to have a better project structure that supports other modalities in the future (like TTS or speech-in speech-out), so from an outside POV, things will look quite slow in the next months.

ikcikoR · 2025-03-13T20:23:22Z

I see, good luck with the project then!

…org#12343) * llama : Add Gemma 3 text-only support * fix python coding style * fix compile on ubuntu * python: fix style * fix ubuntu compile * fix build on ubuntu (again) * fix ubuntu build, finally * clip : Experimental support for Gemma 3 vision (ggml-org#12344) * clip : Experimental support for Gemma 3 vision * fix build * PRId64

Dampfinchen · 2025-04-08T10:11:00Z

Is there optimization potential left?

Yes, we currently allocate full KV even for sliding window layers. We planned to fix this, see #11213 and #12181

Hello, even with that PR implemented, the KV Cache sizes are still huge. For exampe, Gemma 3 4B IT takes roughly the same amount of KV cache at 4 bit for 4K context as LLama 3.1 8B does despite Llama being much bigger. This may be because interleaved sliding window attention is not implemented in llama.cpp. #12637 and with bigger models and context sizes, it's even more noticeable, which makes them hard to run for many people.

Do you think there are any limitations in llama.cpp that prevents iSWA from being supported? Curious to hear your thoughts.

llama : Add Gemma 3 text-only support

45ff609

ngxson requested review from ggerganov and slaren March 12, 2025 06:39

github-actions bot added the python python script changes label Mar 12, 2025

ngxson mentioned this pull request Mar 12, 2025

clip : Experimental support for Gemma 3 vision #12344

Merged

ngxson added 3 commits March 12, 2025 07:48

fix python coding style

2ad6538

fix compile on ubuntu

e0cc2b4

python: fix style

2a0a81a

ngxson added 3 commits March 12, 2025 07:55

fix ubuntu compile

737a50e

fix build on ubuntu (again)

1a6a102

fix ubuntu build, finally

d0a27d6

ggerganov approved these changes Mar 12, 2025

View reviewed changes

clip : Experimental support for Gemma 3 vision (#12344)

afcc335

* clip : Experimental support for Gemma 3 vision * fix build * PRId64

github-actions bot added the examples label Mar 12, 2025

ngxson changed the title ~~llama : Add Gemma 3 text-only support~~ llama : Add Gemma 3 support (+ experimental vision capability) Mar 12, 2025

ngxson merged commit 7841fc7 into master Mar 12, 2025
49 of 50 checks passed

MoonRide303 mentioned this pull request Mar 12, 2025

Feature Request: Support for Gemma 3 models family #12345

Closed

4 tasks

StyleT mentioned this pull request Mar 14, 2025

Gemma 3 support guinmoon/LLMFarm#123

Open

hanthor mentioned this pull request Mar 16, 2025

Unable to run new gemma3 models unknown model architecture: 'gemma3' containers/ramalama#972

Closed

zhouwg mentioned this pull request Apr 30, 2025

LLM: integrate Gemma3-4B for multi-modal inference on Android phone kantv-ai/kantv#295

Merged

ngxson deleted the xsn/gemma3_text branch May 1, 2025 20:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : Add Gemma 3 support (+ experimental vision capability) #12343

llama : Add Gemma 3 support (+ experimental vision capability) #12343

ngxson commented Mar 12, 2025 •

edited

Loading

bartowski1182 commented Mar 12, 2025

ngxson commented Mar 12, 2025

ngxson commented Mar 12, 2025

zhouwg commented Mar 12, 2025 •

edited

Loading

ngxson commented Mar 12, 2025

ngxson commented Mar 12, 2025 •

edited

Loading

Dampfinchen commented Mar 12, 2025

ngxson commented Mar 12, 2025 •

edited

Loading

ikcikoR commented Mar 13, 2025

ngxson commented Mar 13, 2025

ikcikoR commented Mar 13, 2025

ngxson commented Mar 13, 2025 •

edited

Loading

ikcikoR commented Mar 13, 2025

Dampfinchen commented Apr 8, 2025

llama : Add Gemma 3 support (+ experimental vision capability) #12343

llama : Add Gemma 3 support (+ experimental vision capability) #12343

Conversation

ngxson commented Mar 12, 2025 • edited Loading

Model info

Demo running text-only model via llama-cli

bartowski1182 commented Mar 12, 2025

ngxson commented Mar 12, 2025

ngxson commented Mar 12, 2025

zhouwg commented Mar 12, 2025 • edited Loading

ngxson commented Mar 12, 2025

ngxson commented Mar 12, 2025 • edited Loading

Dampfinchen commented Mar 12, 2025

ngxson commented Mar 12, 2025 • edited Loading

ikcikoR commented Mar 13, 2025

ngxson commented Mar 13, 2025

ikcikoR commented Mar 13, 2025

ngxson commented Mar 13, 2025 • edited Loading

ikcikoR commented Mar 13, 2025

Dampfinchen commented Apr 8, 2025

ngxson commented Mar 12, 2025 •

edited

Loading

zhouwg commented Mar 12, 2025 •

edited

Loading

ngxson commented Mar 12, 2025 •

edited

Loading

ngxson commented Mar 12, 2025 •

edited

Loading

ngxson commented Mar 13, 2025 •

edited

Loading