Skip to content

llama : Add Gemma 3 support (+ experimental vision capability) #12343

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 12, 2025

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Mar 12, 2025

Model info

Important

This PR only covers the text inference. Vision tower will be ignored upon converting to GGUF.
PR for vision support: #12344

Demo running text-only model via llama-cli

Download the pre-quantized GGUF from the link above

Clone this project and build llama-cli:

cd llama.cpp
cmake -B build
cmake --build build -j --target llama-cli

Run it

./build/bin/llama-cli -m ./gemma-3-4b-it-Q4_K_M.gguf

> who are you
I'm Gemma, a large language model created by the Gemma team at Google DeepMind. I’m an open-weights model, which means I’m widely available for public use! 

I can take text and images as inputs and generate text-based outputs. 

You can learn more about me on the Gemma project page: [https://ai.google.dev/gemma](https://ai.google.dev/gemma)

Huge thanks to Google team and Hugging Face team for providing me invaluable support in this PR 🤗

@ngxson ngxson requested review from ggerganov and slaren March 12, 2025 06:39
@github-actions github-actions bot added the python python script changes label Mar 12, 2025
@bartowski1182
Copy link
Contributor

Thanks so much for your efforts in getting this available quickly!

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 12, 2025

Thanks for the approval, I'll merge once the 3 remanding workflows are green

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 12, 2025

(Small correction: because this PR is base of #12344, I'll firstly merge #12344 into this, then this to master)

* clip : Experimental support for Gemma 3 vision

* fix build

* PRId64
@ngxson ngxson changed the title llama : Add Gemma 3 text-only support llama : Add Gemma 3 support (+ experimental vision capability) Mar 12, 2025
@zhouwg
Copy link
Contributor

zhouwg commented Mar 12, 2025

Thanks so much for your efforts in getting this available quickly!

this is the work of world-class AI expert with valuable support from the Google team and Hugging Face team although it's really quickly.

btw, I"m a bit curious of what's the plan or roadmap of Qualcomm's official ggml-qnn backend? does Qualcomm China have an independent team working on this?

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 12, 2025

Yes, part of the reason why I can get it work so quickly is thanks to the clean code and documentation from HF and Google!

CI is now 3/4 green, should be able to merge soon 🚀

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 12, 2025

The iOS workflow is unrelated to this PR, will merge once ubuntu and macOS both pass

@ngxson ngxson merged commit 7841fc7 into master Mar 12, 2025
49 of 50 checks passed
ishaangandhi pushed a commit to ishaangandhi/llama.cpp that referenced this pull request Mar 12, 2025
…org#12343)

* llama : Add Gemma 3 text-only support

* fix python coding style

* fix compile on ubuntu

* python: fix style

* fix ubuntu compile

* fix build on ubuntu (again)

* fix ubuntu build, finally

* clip : Experimental support for Gemma 3 vision (ggml-org#12344)

* clip : Experimental support for Gemma 3 vision

* fix build

* PRId64
@Dampfinchen
Copy link

I've heard multiple reports of VRAM usage skyrocketing for even lower amounts of context. It needs even more than G2, which was already far worse in terms of efficiency compared to Mistral, LLama and Qwen models. That's a bit disappointing, especially considering in the architecture paper they mention memory savings.

Is there optimization potential left? I remember reading in the paper that they make use of SWA for lower memory consumption, which is AFAIk not supported by llama.cpp.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 12, 2025

Is there optimization potential left?

Yes, we currently allocate full KV even for sliding window layers. We planned to fix this, see #11213 and #12181

@ikcikoR
Copy link

ikcikoR commented Mar 13, 2025

Amazing work! It works quite well. Are there any chances to add support for vision in llama-server as well?

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 13, 2025

100% chance, but we are just not sure when

@ikcikoR
Copy link

ikcikoR commented Mar 13, 2025

100% chance, but we are just not sure when

Super happy to hear that! If I may ask, is the "not sure when" part closer to "a couple of days" or to "a couple of months"? I mean, I know it won't change anything and it'll be done when it's done, but in case it's likely to take a large amount of time I'd be really glad to know what the situation looks like approximately.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 13, 2025

It is most likely a couple of months, the main issue is that we also want to have a better project structure that supports other modalities in the future (like TTS or speech-in speech-out), so from an outside POV, things will look quite slow in the next months.

@ikcikoR
Copy link

ikcikoR commented Mar 13, 2025

I see, good luck with the project then!

jpohhhh pushed a commit to Telosnex/llama.cpp that referenced this pull request Mar 14, 2025
…org#12343)

* llama : Add Gemma 3 text-only support

* fix python coding style

* fix compile on ubuntu

* python: fix style

* fix ubuntu compile

* fix build on ubuntu (again)

* fix ubuntu build, finally

* clip : Experimental support for Gemma 3 vision (ggml-org#12344)

* clip : Experimental support for Gemma 3 vision

* fix build

* PRId64
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Mar 19, 2025
…org#12343)

* llama : Add Gemma 3 text-only support

* fix python coding style

* fix compile on ubuntu

* python: fix style

* fix ubuntu compile

* fix build on ubuntu (again)

* fix ubuntu build, finally

* clip : Experimental support for Gemma 3 vision (ggml-org#12344)

* clip : Experimental support for Gemma 3 vision

* fix build

* PRId64
@Dampfinchen
Copy link

Is there optimization potential left?

Yes, we currently allocate full KV even for sliding window layers. We planned to fix this, see #11213 and #12181

Hello, even with that PR implemented, the KV Cache sizes are still huge. For exampe, Gemma 3 4B IT takes roughly the same amount of KV cache at 4 bit for 4K context as LLama 3.1 8B does despite Llama being much bigger. This may be because interleaved sliding window attention is not implemented in llama.cpp. #12637 and with bigger models and context sizes, it's even more noticeable, which makes them hard to run for many people.

Do you think there are any limitations in llama.cpp that prevents iSWA from being supported? Curious to hear your thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants