llama : add Xiaomi Mimo (with proper MTP - multi token predict) #13236

ngxson · 2025-05-01T14:53:52Z

This is a WIP,

Given N input token, I can now generate either the next N+1 or N+2, but not yet all the 2 tokens at the same time.

The way it works is:

The model has 36+1 layers, the 36 layers are normal layers and 1 extra is the MTP layer
If we want to generate N+1 token, we pass N input tokens through the 36 layers, then pass the output to lm_head
If we want to generate N+2, take the output from 36 layers + the input embedding, pass it through MTP layer, then finally go to lm_head

If we had an API for multiple output head, it might have been easier. WDYT @ggerganov ?

Illustration from their technical report:

ngxson · 2025-05-01T15:08:39Z

Hmm ok I could be missing something here, I'm reusing the same set of input tokens for both N+1 and N+2 steps, while N+2 token need the sampled token from N+1

Not sure yet how we can do this

sorasoras · 2025-05-01T15:21:22Z

Would this implementation work on deepseev3?

ngxson · 2025-05-01T15:28:01Z

@sorasoras Judging from this illustration, yes it's the same:

https://dataturbo.medium.com/deepseek-technical-analysis-3-multi-token-prediction-f8f3ea7eaf9c

ggerganov · 2025-05-01T17:35:11Z

Maybe this can be implemented by loading the MTP layers as a separate draft model and reusing the speculative decoding functionality. AFAICT, the predicted tokens from the MTP blocks are technically draft tokens and they have to be accepted by the main decoder.

Btw, based on these 2 diagrams, if they are correct, there is a small difference between DS and Mimo - Mimo uses the same h from the main decoder for all MTP blocks, while DS updates the h after each MTP block.

What is not clear to me is how big is N. In both diagrams, we have N = 4. Is this a parameter?

ngxson · 2025-05-01T19:07:58Z

Ok thanks for the clue, it sounds like what you suggest is exactly what they did on vLLM implementation

I think the N is not important as the MTP layer has its own KV cache. The number of N token need to be corresponse to the number of embedding vectors from h. In other words, this is a way to implement residual connection, basically making the input token embedding to bypass the whole 36 "normal" layers

ngxson · 2025-05-04T09:38:32Z

I'm thinking more about the implementation today, having 2 main ideas in my mind but both have downsides:

First idea is to have an API like llama_model_get_mtp(struct llama_model * model, int32_t i) which returns a shallow copy of the llama_model object, meaning only the pointer-to-tensors are copy, but not the actual data. The copied llama_model object will have a different layer index (or n_layer) to specify that it is a "child" model.

The downside is that:

Because this is a shallow copy, llama_model_free will free tensors or both main and child models
More importantly, this mean we need 2 different llama_context, and this mean we need yet another API to pass the hidden embeddings from one context to another

My second idea is to have something equivalent to llama_set_causal_attn, meaning the will be an attribute in llama_cparams to specify if we are about to run the llama_decode using main layers or MTP layers. However, managing KV cache in this case is a bit tricky and I still don't yet have any idea how to handle this.

ngxson added 3 commits May 1, 2025 16:22

convertible to gguf

571a45d

loadable, missing cgraph now

dddee85

cgraph ok but need to output 2 tokens, we can only out 1 for now

c191a3b

ngxson linked an issue May 1, 2025 that may be closed by this pull request

Feature Request: XiaomiMiMo/MiMo-7B-RL #13218

Open

4 tasks

github-actions bot added the python python script changes label May 1, 2025

bwshen-mi mentioned this pull request May 1, 2025

Can someone setup a llama.cpp implementation? XiaomiMiMo/MiMo#6

Open

ngxson added 2 commits May 2, 2025 21:50

Merge branch 'master' into xsn/xiaomi_mimo

ff17782

Merge branch 'master' into xsn/xiaomi_mimo

2745373

try add set_mpt_head api

e8b115b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : add Xiaomi Mimo (with proper MTP - multi token predict) #13236

llama : add Xiaomi Mimo (with proper MTP - multi token predict) #13236

ngxson commented May 1, 2025 •

edited

Loading

ngxson commented May 1, 2025 •

edited

Loading

sorasoras commented May 1, 2025

ngxson commented May 1, 2025

ggerganov commented May 1, 2025

ngxson commented May 1, 2025 •

edited

Loading

ngxson commented May 4, 2025

llama : add Xiaomi Mimo (with proper MTP - multi token predict) #13236

Are you sure you want to change the base?

llama : add Xiaomi Mimo (with proper MTP - multi token predict) #13236

Conversation

ngxson commented May 1, 2025 • edited Loading

ngxson commented May 1, 2025 • edited Loading

sorasoras commented May 1, 2025

ngxson commented May 1, 2025

ggerganov commented May 1, 2025

ngxson commented May 1, 2025 • edited Loading

ngxson commented May 4, 2025

ngxson commented May 1, 2025 •

edited

Loading

ngxson commented May 1, 2025 •

edited

Loading

ngxson commented May 1, 2025 •

edited

Loading