-
Notifications
You must be signed in to change notification settings - Fork 11.7k
llama : Add Gemma 3 support (+ experimental vision capability) #12343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks so much for your efforts in getting this available quickly! |
Thanks for the approval, I'll merge once the 3 remanding workflows are green |
* clip : Experimental support for Gemma 3 vision * fix build * PRId64
this is the work of world-class AI expert with valuable support from the Google team and Hugging Face team although it's really quickly. btw, I"m a bit curious of what's the plan or roadmap of Qualcomm's official ggml-qnn backend? does Qualcomm China have an independent team working on this? |
Yes, part of the reason why I can get it work so quickly is thanks to the clean code and documentation from HF and Google! CI is now 3/4 green, should be able to merge soon 🚀 |
The iOS workflow is unrelated to this PR, will merge once ubuntu and macOS both pass |
…org#12343) * llama : Add Gemma 3 text-only support * fix python coding style * fix compile on ubuntu * python: fix style * fix ubuntu compile * fix build on ubuntu (again) * fix ubuntu build, finally * clip : Experimental support for Gemma 3 vision (ggml-org#12344) * clip : Experimental support for Gemma 3 vision * fix build * PRId64
I've heard multiple reports of VRAM usage skyrocketing for even lower amounts of context. It needs even more than G2, which was already far worse in terms of efficiency compared to Mistral, LLama and Qwen models. That's a bit disappointing, especially considering in the architecture paper they mention memory savings. Is there optimization potential left? I remember reading in the paper that they make use of SWA for lower memory consumption, which is AFAIk not supported by llama.cpp. |
Amazing work! It works quite well. Are there any chances to add support for vision in llama-server as well? |
100% chance, but we are just not sure when |
Super happy to hear that! If I may ask, is the "not sure when" part closer to "a couple of days" or to "a couple of months"? I mean, I know it won't change anything and it'll be done when it's done, but in case it's likely to take a large amount of time I'd be really glad to know what the situation looks like approximately. |
It is most likely a couple of months, the main issue is that we also want to have a better project structure that supports other modalities in the future (like TTS or speech-in speech-out), so from an outside POV, things will look quite slow in the next months. |
I see, good luck with the project then! |
…org#12343) * llama : Add Gemma 3 text-only support * fix python coding style * fix compile on ubuntu * python: fix style * fix ubuntu compile * fix build on ubuntu (again) * fix ubuntu build, finally * clip : Experimental support for Gemma 3 vision (ggml-org#12344) * clip : Experimental support for Gemma 3 vision * fix build * PRId64
…org#12343) * llama : Add Gemma 3 text-only support * fix python coding style * fix compile on ubuntu * python: fix style * fix ubuntu compile * fix build on ubuntu (again) * fix ubuntu build, finally * clip : Experimental support for Gemma 3 vision (ggml-org#12344) * clip : Experimental support for Gemma 3 vision * fix build * PRId64
Hello, even with that PR implemented, the KV Cache sizes are still huge. For exampe, Gemma 3 4B IT takes roughly the same amount of KV cache at 4 bit for 4K context as LLama 3.1 8B does despite Llama being much bigger. This may be because interleaved sliding window attention is not implemented in llama.cpp. #12637 and with bigger models and context sizes, it's even more noticeable, which makes them hard to run for many people. Do you think there are any limitations in llama.cpp that prevents iSWA from being supported? Curious to hear your thoughts. |
Model info
Important
This PR only covers the text inference. Vision tower will be ignored upon converting to GGUF.
PR for vision support: #12344
Demo running text-only model via llama-cli
Download the pre-quantized GGUF from the link above
Clone this project and build llama-cli:
cd llama.cpp cmake -B build cmake --build build -j --target llama-cli
Run it
Huge thanks to Google team and Hugging Face team for providing me invaluable support in this PR 🤗