-
Notifications
You must be signed in to change notification settings - Fork 11.7k
llama : (mrope) allow using normal 1D position for text token #13138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/llama-graph.cpp
Outdated
@@ -1031,11 +1042,12 @@ ggml_tensor * llm_graph_context::build_inp_pos() const { | |||
} | |||
|
|||
ggml_tensor * llm_graph_context::build_inp_attn_scale() const { | |||
auto inp = std::make_unique<llm_graph_input_attn_temp>(n_pos_per_token(), hparams.n_attn_temp_floor_scale, hparams.f_attn_temp_scale); | |||
auto inp = std::make_unique<llm_graph_input_attn_temp>(n_pos_per_embd(), hparams.n_attn_temp_floor_scale, hparams.f_attn_temp_scale); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov Because build_inp_attn_scale
is currently used exclusively by llama 4, do you think we should get rid of n_pos_per_embd
and replace it with a GGML_ASSERT(n_pos_per_embd() == 1)
?
The main motivation is to make this code looks less complicated, as there is ~0% chance Qwen model gonna use this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, build_inp_attn_scale
should work well even in the case of N pos per token.
That's because the scale is applied per embedding, and the number of embedding is independent from N pos per token.
In any cases, I removed the n_pos_per_embd
in 9cd16a3 , merging this PR once the CI is green
src/llama-graph.cpp
Outdated
@@ -1031,11 +1042,12 @@ ggml_tensor * llm_graph_context::build_inp_pos() const { | |||
} | |||
|
|||
ggml_tensor * llm_graph_context::build_inp_attn_scale() const { | |||
auto inp = std::make_unique<llm_graph_input_attn_temp>(n_pos_per_token(), hparams.n_attn_temp_floor_scale, hparams.f_attn_temp_scale); | |||
auto inp = std::make_unique<llm_graph_input_attn_temp>(n_pos_per_embd(), hparams.n_attn_temp_floor_scale, hparams.f_attn_temp_scale); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can do that.
For M-RoPE, we want to use normal 1D position for text token.
This is done to simplify the use case of
llama_decode()
with text tokens, which is needed for adding Qwen2VL tolibmtmd
and toserver.cpp
This should also align with #11875, because in the future we want text position to be tracked internally by
libllama