Skip to content

kv-cache : add SWA support #13194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
Open

kv-cache : add SWA support #13194

wants to merge 13 commits into from

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Apr 29, 2025

Overview

Add class llama_kv_cache_unified_iswa for interleaved SWA attention support.

The implementation internally utilizes 2 instances of the existing llama_kv_cache_unified - one for the non-SWA and one for the SWA layers of the model. To achieve that, the llama_kv_cache_unified implementation is updated to be able to cache a subset of the model's layers (instead of always caching all layers as it is on master). The 2 internal caches behave almost in exactly the same way with 2 main differences:

  • The SWA cache is much smaller
  • The SWA cache automatically "forgets" old tokens upon successful commit (i.e. successful batch decode)

The size of the SWA cache is computed as:

PAD(n_swa*n_seq_max + n_batch)

This way we can store the last n_swa tokens for all sequences and we also have room to evaluate a new batch of tokens with size up to n_batch.

Note that by pruning the SWA tokens in the llama_kv_cache_unified_iswa::commit() call, we are able to correctly handle errors during the llama_decode() and thus restore the KV cache to it's original state in such cases.

The new llama_kv_cache_unified_iswa can be used for non-SWA models with n_swa = n_ctx_train.

Note that advanced cache operations such as removing tokens or shifting their positions are not mathematically equivalent to full processing, when using iSWA caches. For such cases, we can "fallback" to the old implementation by expanding the SWA cache size to the full context and disabling the SWA token pruning. This of course would lead to more memory usage. However, this logic is currently disabled because the results appear to be good enough even when the SWA cache has been pruned in such cases. This needs more testing and verification.


  • Move KV cache store and view logic from llama-graph to llama-kv-cache
  • Move KV cache mask creation logic from llama-graph to llama-kv-cache
  • The inputs to build_attn_mha() are now not permuted
  • The QKV self-attention code is now more harmonious:
      const llama_kv_cache_unified * kv_self = static_cast<const llama_kv_cache_unified *>(memory);
    
      // store to KV cache
      {
          ggml_build_forward_expand(gf, kv_self->cpy_k(ctx0, k_cur, il));
          ggml_build_forward_expand(gf, kv_self->cpy_v(ctx0, v_cur, il));
      }
    
      const auto & kq_mask = inp->get_kq_mask();
    
      ggml_tensor * q = q_cur;
      ggml_tensor * k = kv_self->get_k(ctx0, il);
      ggml_tensor * v = kv_self->get_v(ctx0, il);
    
      ggml_tensor * cur = build_attn_mha(gf, q, k, v, kq_b, kq_mask, v_mla, kq_scale);
      cb(cur, "kqv_out", il);
  • Add enum hparams.swa_type to support chunked and non-chunked SWA (remove hparams.n_attn_chunk)
  • Add class llama_kv_cache_unified_iswa - new iSWA cache that internally utilizes 2 standard llama_kv_cache_unified instances
  • Make the llama_kv_cache_unified implementation more private and polish the interface
  • Move the Llama 4 build function to a new llm_build_llama_iswa()
  • llama-server now respects llama_kv_self_can_shift(ctx)
outdated

This is still very WIP - the goal is to redesign the unified KV cache to properly support layers with sliding-window attention (SWA) in order to reduce the memory usage for models such as Gemma3.

However, while working on this, I realized that enabling this option would prevent context caching, which IMO is a pretty big deal. So I am wondering if I am missing something.

The reason we cannot do context caching with SWA enabled is because when the window slides, we "forget" the old KV stuff and there is no way to recover it without recomputing it. This means, no prefix cache in llama-server (ok, just last-prefix caching works), no context shift, no context reuse, etc. So I am having some doubts if this is really worth supporting.

Any thoughts?

TODO

  • Cut-off old SWA tokens in llama_kv_cache_unified_iswa::commit()
  • Pass n_seq_max and n_batch to the KV cache and utilize it to determine SWA cache size
  • Allow KV shift when SWA window size is big enough
  • Add limits to batch size based on SWA window
  • llama-server check for llama_kv_self_can_shift
  • Add context parameter for adjusting SWA window size (i.e. switch between small and large SWA cache kv-cache : add SWA support #13194 (comment))

Testing

Any help with testing the following scenarios and reporting the results are highly appreciated:

  • Llama 4
  • Phi 3
  • Gemma 2
  • Gemma 3
  • Cohere 2
  • Multi-user
  • Context shift
  • Context reuse
  • Speculative decoding?

Next PRs

  • Split KV cache implementations in separate source files
  • Remove llama_kv_cache_view API (not useful, can be replaced with internal debugging functions)

@slaren
Copy link
Member

slaren commented Apr 29, 2025

It's not very clear to me how to handle SWA with a unified cache where there may be multiple sequences, and it is not always obvious what tokens can be dropped from the cache. However I think it is definitely worth it for the single user case, which after all is the main use case of llama.cpp.

@ngxson
Copy link
Collaborator

ngxson commented Apr 29, 2025

However, while working on this, I realized that enabling this option would prevent context caching, which IMO is a pretty big deal. So I am wondering if I am missing something.

Yes this is what I was thinking about for months now. There is no better solution than to disable context caching in this case.

An alternative solution is to allow user to choose one of the 2: either a proper SWA cache (good for memory) or allocate full (good for reusing cache)

So I am having some doubts if this is really worth supporting.

I'm feeling 50/50 here. One of the biggest use case would be to process large and diverse set of documents locally. In this case, user may never reuse the cache because each new request is a new document

@ggerganov
Copy link
Member Author

It's not very clear to me how to handle SWA with a unified cache where there may be multiple sequences, and it is not always obvious what tokens can be dropped from the cache. However I think it is definitely worth it for the single user case, which after all is the main use case of llama.cpp.

The way I am approaching it is to have the "KV cells" information maintained separately for the non-SWA and SWA layers. This way, upon each KV cache commit (see #12799), we can do a pass over the SWA cells and automatically remove those that have position pos < pos_max(seq_id) - n_swa. Note that such tokens are only pruned from the SWA cells, while they remain in the non-SWA cells. When constructing the KQ mask for the graph, we use the non-SWA cells to construct the kq_mask and the SWA cells to construct the kq_mask_swa.

The rest of the logic is the same - it just operates on both set of cells. For example, find_slot searches in both the non-SWA and SWA cells.

@JohannesGaessler
Copy link
Collaborator

My experience with the Gemma models in the context of Elo HeLLM has been that they required a disproportionate amount of computational resources to run benchmarks. The reason is that I was able to fit comparatively fewer parallel slots on 1 or 2 GPUs and my throughput was lower as a consequence. At least for my use case I value low memory usage for the context more than I value prompt caching because I have O(10000) short prompts and I'm bottlenecked mostly by generation throughput.

@ggerganov
Copy link
Member Author

Continuing thinking about the logic for when to discard tokens from the cache, it's indeed tricky and not very clear how to do. For example, when doing speculative decoding, we can submit a draft batch with D tokens to the target model. If we apply the pruning logic from my previous comment strictly, then this would cause to "forget" D-1 of the oldest tokens in the SWA layers, which depending if the draft gets rejected would be problematic. This makes me think that we should probably have some "extra room" in the SWA cache - for example n_swa + 2*n_batch. And the prune logic should be something like: pos < pos_max(seq_id) - n_swa - n_batch.

@ggerganov ggerganov force-pushed the gg/llama-kv-cache-v6 branch from e37f112 to 7e4b545 Compare April 30, 2025 07:22
@ymcki
Copy link
Contributor

ymcki commented Apr 30, 2025

It's not very clear to me how to handle SWA with a unified cache where there may be multiple sequences, and it is not always obvious what tokens can be dropped from the cache. However I think it is definitely worth it for the single user case, which after all is the main use case of llama.cpp.

I second slaren's opinion. As far as I know, vllm also doesn't support iSWA while hf transformers and ollama does. vllm is geared toward multi-user server use case. I suppose that's why they don't support it.

Ideally, it should be implemented as a switch to let user choose which one to use. By default, iSWA should be on for llama-cli but off for llama-server.

@ngxson
Copy link
Collaborator

ngxson commented Apr 30, 2025

This makes me think that we should probably have some "extra room" in the SWA cache - for example n_swa + 2*n_batch. And the prune logic should be something like: pos < pos_max(seq_id) - n_swa - n_batch.

Yes I was thinking about this too, I think it can be a bit complicated to manage this case, but totally possible.

We can let user specify how many tokens are allocated in the sliding layers. For example, given n_swa=512, if llama_context is created with n_ctx=4096 and n_ctx_swa=1024, this will allow user to rollback until n_past - (1024 - 512)

We can further let n_ctx_swa = n_ctx * scale by default to make it transparent to end-user, with scale=0.5 by default for example. If scale=-1 then n_ctx_swa=n_swa

And finally, we may need to add an API to return the furthest n_past that user can rollback to, maybe something like llama_kv_self_get_minimum_pos ?

@isaac-mcfadyen
Copy link
Contributor

isaac-mcfadyen commented Apr 30, 2025

I'd +1 the ability to allow the user to switch.

Some use-cases benefit greatly from the prefix caching (example: on Metal systems with 48GB of RAM/VRAM, where pp is much slower than non-Metal pp and we have plenty of VRAM anyway) so allowing the user to choose would be optimal.

@ExtReMLapin
Copy link
Contributor

It's not very clear to me how to handle SWA with a unified cache where there may be multiple sequences, and it is not always obvious what tokens can be dropped from the cache. However I think it is definitely worth it for the single user case, which after all is the main use case of llama.cpp.

Is llama.cpp single user mode the most used case because that’s what the user base prefer or is it like that because the server performance goes down a lot with more than 3 users ? (#10860 )

We are really thankful of all the work you main contributors do on this project, but please do not fall in this « self-fulfilling prophecy » trap.

@aviallon
Copy link
Contributor

aviallon commented May 1, 2025

I personally use llama.cpp for server use (with multiple users).
I wonder if we could do something hybrid between iSWA and what is currently done.
I wonder if partial kV cache offload could work, with iSWA on the accelerator, and slower cache on RAM.

@ggerganov ggerganov force-pushed the gg/llama-kv-cache-v6 branch 2 times, most recently from 58115a2 to 7e79a42 Compare May 2, 2025 13:02
Base automatically changed from gg/llama-kv-cache-v6 to master May 2, 2025 14:48
@Dampfinchen
Copy link

According to the Gemma3 paper, interleaved Sliding Window Attention reduces KV Cache memory usage by 1/5, so it would be much easier to run as right now KV Cache size is much heavier than comparable models.

If the drawback is the absence of prompt caching, then indeed it would make sense to give the option to the user and let them decide on a per use case basis. I think for cases where you use RAG/Vector DB it would prove to be very useful as prompt caching does not work when beginning of the context changes anyway. I would personally agree with Johannes here, faster token generation thanks to SWA would be more useful for me as well since I'm using vector DB.

So for the use cases short prompts/RAG it would make a lot of sense. For simple chat use cases without any RAG, prompt caching would probably make it faster overall compared to SWA and no prompt cache. Overall, I think having the option would be a great addition to llama.cpp.

If it helps, Ollama implemented iSWA support for Gemma 3, since the project is pretty similar to llama.cpp, perhaps it's useful to get a rough idea on how to implement it (although Ollama is a different coding language): https://github.com/ollama/ollama/blob/2fec73eef6e9482f606f185ebb2ae4f75ad1a37c/model/models/gemma3/model_text.go#L190

I've been thinking, does Ollama support prompt caching? Since Gemma 3 SWA is supported in Ollama, how did they handle it?

@ggerganov ggerganov force-pushed the gg/swa branch 3 times, most recently from 1c69466 to 1e10743 Compare May 9, 2025 12:15
@LostRuins
Copy link
Collaborator

Some people recently mentioned concerns with this PR - I think caching is quite important for a subset of users who don't have GPUs and run purely CPU only.

They are fine spending initial minutes or more ingesting a large initial prompts which they then reuse for many future turns - generation speed itself is usable, but the inability to cache would be crippling for such users.

@ggerganov
Copy link
Member Author

Both the old cache (i.e. more memory usage, but with advanced caching supported) and the new cache (less memory with just last-prefix caching) will be supported. Still figuring the implementation details - will likely be supported via a flag or a parameter.

@ggerganov
Copy link
Member Author

Thanks for all the feedback in this discussion. This branch should be ready for testing - I've listed some important use cases that need to be exercised. If something does not work, please let me know - at the moment I've done very little testing, so there could be some issues remaining.

I will soon write up a detailed summary of the changes and the approach taken. And after that will add some comments to the code and open the PR for review.

Regarding the parameter for controlling the size of the SWA cache - for now I haven't introduced it because some initial tests show that Gemma 3 remains coherent even when it "forgets" the local SWA cache - likely thanks to the data in the non-SWA cache. So I am thinking about giving this approach a try because it keeps the UX simple (i.e. we won't have to add new parameter and handle the use cases where context editing is not possible). If we determine that this breaks some important use cases, we can add the parameter - the libllama change is simple and the behavior would basically fallback to what currently happens on master.

@ExtReMLapin
Copy link
Contributor

ExtReMLapin commented May 11, 2025

To people who have the bandwidth to test models, FYI Cohere 2 arch includes R7B which is much smaller than Command-A

@andportnoy
Copy link

for now I haven't introduced it because some initial tests show that Gemma 3 remains coherent even when it "forgets" the local SWA cache

Does this mean in the current implementation the model isn't executed correctly?

@andportnoy
Copy link

FWIW, Gemma 3 worked better for me on main with Q8 cache quantization than on this branch + unquantized kv cache.

@ggerganov
Copy link
Member Author

ggerganov commented May 11, 2025

@andportnoy It's evaluated correctly, as long as you don't use context shift, cache reuse or branching from old states. Do you do any of that in your tests? Can you provide a repro?

Edit: Also don't change 2 things at the same time when testing. Use the same KV cache type, so we can rule out differences that are not relevant to the changes in this branch.

@ggerganov
Copy link
Member Author

So there is one nasty case that I found today. Let's assume n_swa = 1024, n_seq_max = 1 and n_batch = 2048. Also, defrag logic is disabled. We create the cache with size 1024*1 + 2048 = 3072.

First the SWA cache is empty:

# one dot is 256 cells
............

Next we decode 2048 tokens, placing them at the start of the cache:

xxxxxxxx....

We slide the window to prune the old tokens and keep the last 1024:

....xxxx....

Now we try to process the next batch of 2048 tokens, but it fails because it cannot find a continuous empty slot with the necessary size to fit the batch.

So it seems like defrag must always be enabled for SWA models. My proposal is to make the llama_context to auto-enable it if the model is SWA-based and the unified KV cache is used.

@slaren
Copy link
Member

slaren commented May 14, 2025

It should probably run a defragmentation if there is no enough contiguous space, but enough slots to run the evaluation, regardless of the auto defrag setting.

@ggerganov
Copy link
Member Author

Yes, that would work. Btw, there is one more option - retry with an ubatch of half the size. Not sure which is better.

@slaren
Copy link
Member

slaren commented May 14, 2025

I would guess that a defragmentation would be faster overall, especially since the SWA cache is almost always going to be small.

@ggerganov ggerganov force-pushed the gg/swa branch 2 times, most recently from cefd037 to 02d9a19 Compare May 15, 2025 13:20
@stduhpf
Copy link
Contributor

stduhpf commented May 15, 2025

I sometimes get <unused32> spam when running gemma3 27B qat with this branch.
image
(It also happens without images)
I also got a case of the model referencing something from the context of the previous conversation (with llama-server)

@ngxson
Copy link
Collaborator

ngxson commented May 15, 2025

@stduhpf which config (n_batch, n_ctx, etc) you are using?

@ggerganov I can re-run another ppl test if you like, lmk when you're ready to do that

@stduhpf
Copy link
Contributor

stduhpf commented May 15, 2025

@stduhpf which config (n_batch, n_ctx, etc) you are using?

-ngl 99 -c 8192 -b 1024 -ub 512 --no-mmproj-offload (vulkan backend)

@ggerganov
Copy link
Member Author

I can't seem to reproduce:

./build-chat/bin/llama-server \
    -hf bartowski/google_gemma-3-27b-it-qat-GGUF \
    --host 0.0.0.0 --port 8013 \
    -ngl 99 -c 8192 -b 1024 -ub 512 --no-mmproj-offload

image

Does this config work on master? Are you on the latest commit here (cf33051)?

@stduhpf
Copy link
Contributor

stduhpf commented May 15, 2025

Yes it works on master and I'm on cf33051

Forgot to mention I had a custom system prompt I took that screenshot:

You are Emma, a large language model trained by HuggingFace.
Knowledge cutoff: 2024-06
Current date: 2025-05-15

Over the course of conversation, adapt to the user’s tone and preferences. Try to match the user’s vibe, tone, and generally how they are speaking. You want the conversation to feel natural. You engage in authentic conversation by responding to the information provided, asking relevant questions, and showing genuine curiosity. If natural, use information you know about the user to personalize your responses and ask a follow up question.

I also have trouble reproducing it right now, I will get back to you if I can get it to misbehave again

@stduhpf
Copy link
Contributor

stduhpf commented May 15, 2025

I also got a case of the model referencing something from the context of the previous conversation (with llama-server)

I got an example for reproducing this other issue though, it seems consistent with this setup:

first conv new conv
image image
conversation_conv-1747332526110.json conversation_conv-1747332608477.json

(This is a complete hallucination, there are no actual images in the second prompt, just broken links to images that don't even look anything like lenna)

@ggerganov
Copy link
Member Author

Hm, I'm not sure what's wrong. I can't find a repro on the Mac. Let me know if you find some more hints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.