-
Notifications
You must be signed in to change notification settings - Fork 5
Private batch API (for AI review) #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
WalkthroughThis pull request refactors the batch processing system throughout the project. The legacy API using Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant BatchAPI
participant Context
Caller->>BatchAPI: Initialize batch (llama_batch_ext_init_from_text)
BatchAPI-->>Caller: Return new batch pointer
Caller->>BatchAPI: Add tokens (llama_batch_ext_add_text)
Caller->>BatchAPI: Clear batch if needed (llama_batch_ext_clear)
Caller->>Context: Decode tokens (llama_decode_ext(ctx, batch))
Context-->>Caller: Return decoded output
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 13
🔭 Outside diff range comments (1)
src/llama-batch.cpp (1)
356-386
:⚠️ Potential issuePotential out-of-bounds writes when embd != 0 but n_tokens_alloc = 0.
Ifllama_batch_ext_init_from_embd
is called withn_tokens_alloc=0
andn_embd>0
, the code copies data (memcpy(...)
) into a buffer of zero length. This is a critical bug.-static struct llama_batch_ext * llama_batch_ext_init_impl(int32_t n_tokens_alloc, int32_t embd, int32_t n_seq_max) { - ... - if (embd) { - batch->embd = (float *) malloc(sizeof(float) * n_tokens_alloc * embd); - } - ... - memcpy(batch->embd, embd, n_embd * sizeof(float)); // potential OOB if n_tokens_alloc = 0 + // Fix by using `n_tokens_alloc = n_embd` if we treat each float as a separate token: + // or re-check the logic to handle embeddings properly }
🧹 Nitpick comments (46)
common/common.cpp (2)
1580-1584
: Deprecated function: common_batch_clear.
Since this is now marked DEPRECATED, verify if it is still referenced elsewhere. Removing dead code could reduce maintenance overhead.
1585-1604
: Deprecated function: common_batch_add.
Similar tocommon_batch_clear
, confirm whether any call sites remain. If unused, consider removing to keep the codebase clean.common/speculative.cpp (1)
242-242
: Repeated clearing.
Be mindful of performance overhead from frequent batch clears. If performance is a concern, consider re-adding tokens incrementally in a single batch.examples/imatrix/imatrix.cpp (1)
516-517
: Parameter check for llama_batch_ext_add_text.
You're using a single sequence ID (seq_id = 0
), and passingtrue
for the last parameter each iteration. Verify whether this semantics (indicating end-of-sequence or final token) is intended at this stage of the code.examples/gritlm/gritlm.cpp (4)
16-16
: Check return from llama_batch_ext_init.
Ensure you handle possible allocation failures or consider adding a quick null check.
44-45
: Ensure correct usage of sequence ID and final token flag.
Using the last parameterj >= n_inst
setsis_end_of_token
conditionally. Validate if that logic aligns with your intended design (i.e., ignoring instruction tokens in embeddings).
121-122
: Sequence ID usage.
Similarly, passing a zeroedseq_id
each time is valid if you have a single sequence context. If you plan to support multiple sequences, consider a dynamic approach.
129-129
: Sampling from last token.
llama_batch_ext_get_n_tokens(bat) - 1
is correct for indexing the last token. Ensure there are no off-by-one scenarios when the batch is empty.examples/llava/gemma3-cli.cpp (3)
77-77
: Initialize batch safely.
You might want to throw/log an error ifbatch
allocation fails to avoid null pointer usage.
184-185
: Sequence ID usage for next token.
Passingctx.n_past++
ensures correct offset. Using the last parameter astrue
for final token might be by design. Double-check if you want the model to consider each token final.
186-186
: Decoding each new token.
Immediately decoding the newly added token is a valid approach for token-by-token generation. Confirm it matches your performance and concurrency requirements if you have advanced usage.examples/run/run.cpp (1)
911-911
: Validate the batch pointer
Before callingllama_batch_ext_get_n_tokens()
, consider checking thatbatch.get()
is valid to avoid unexpected null pointer issues.common/common.h (1)
572-576
: Deprecated functions
Markingcommon_batch_clear
andcommon_batch_add
as deprecated is fine. If they're fully replaced by the new extended batch mechanism, consider removing them in a future cleanup.examples/parallel/parallel.cpp (1)
278-278
: Index handling
client.i_batch = llama_batch_ext_get_n_tokens(batch) - 1
is correct for referencing the last token’s index, but ensure no negative results if empty.examples/lookahead/lookahead.cpp (3)
120-120
: Use a smart pointer for consistency.
You already usellama_batch_ext_ptr
in lines 95 and 96. For consistency, consider returning allama_batch_ext_ptr
here as well, which will simplify memory management.-llama_batch_ext * batch = llama_batch_ext_init(params.n_ctx, W + G + 1); +llama_batch_ext_ptr batch(llama_batch_ext_init(params.n_ctx, W + G + 1));
250-251
: Consider improving readability by avoiding line breaks in function parameters.
The parameter list is split across lines. Keeping the function call on one line, or using a local struct/object, can enhance clarity.
483-483
: Consider unified resource management.
Here, you callllama_batch_ext_free(batch)
directly, whereas lines 95 and 96 rely onllama_batch_ext_ptr
. For uniformity and safer resource handling, consider adopting the same pattern.examples/speculative/speculative.cpp (1)
48-648
: Disabling a large block of logic with#if 0
.You have effectively deactivated a significant portion of the logic by wrapping lines 49–648 in
#if 0
/#endif
. This is a convenient way to temporarily disable code paths, but ensure this block of code is either intentionally deprecated or will be re-enabled after the experiments/tests. If it’s no longer needed, consider removing it to keep the codebase clean.examples/perplexity/perplexity.cpp (3)
904-910
: Randomization or partial evaluation logic changes.If these lines handle random selection or partial usage of tasks, confirm that the distribution logic remains correct, especially for concurrency.
Consider extracting the sampling logic into a helper function for clearer test coverage and potential reusability.
999-1011
: Parsing CSV or user input for tasks.Be sure input length checks are robust. If quoting logic changed, watch for edge cases where quotes or commas appear unescaped.
Include validation steps or fallback for malformed lines instead of just logging an error.
Also applies to: 1014-1014
1264-1271
: Loading or deserializing tasks.Changes likely revolve around ensuring data from the new format is read consistently. Make sure the new logic handles large tasks or malformed input gracefully.
Consider adding an extra check for string lengths and re-check stream errors to reduce potential parsing failures.
Also applies to: 1274-1287, 1290-1291, 1293-1294
examples/passkey/passkey.cpp (1)
147-148
: Consider varying seq_id.
Currently,seq_id
is always 0. If you anticipate handling multiple sequences in the future, consider incrementing or specifying distinct IDs.src/llama-batch.h (3)
8-21
: Doc comments are helpful.
These clarifications are valuable. Consider addressing the TODO to renamelogits
tooutput
now to avoid confusion.
76-76
: Pointer ownership.
Storingconst llama_batch_ext *
here may obscure ownership. If shared ownership is intended, consider a reference-counted or RAII interface to prevent accidental misuse.
104-104
: Potential naming conflict.
Withinllama_batch_allocr
, using the field namebatch
can be confusing given other usage. Rename to something likeext_batch
for clarity.examples/batched/batched.cpp (2)
116-116
: Assertion clarity.
GGML_ASSERT
may halt execution abruptly. Consider providing a user-friendly error message for better diagnostics.
160-160
: Naming.
n_cur
might be interpreted differently. A more self-explanatory name could improve readability.examples/simple/simple.cpp (1)
183-183
: Check usage sequence ofllama_batch_ext_clear
Clearing the batch right before repopulating or freeing it might be unnecessary. Evaluate if the call can be removed for cleaner, more efficient code.examples/retrieval/retrieval.cpp (4)
84-86
: Improve naming consistency
batch_decode
handles both encoder-only and decoder-only paths, yet the name suggests decoder logic. Consider renaming or commenting to reduce confusion for future maintainers.
129-129
: Document the single-sequence initialization
common_batch(n_batch, 1)
implies exactly one sequence. If you're planning multi-sequence usage, expand or comment on how to adapt the second parameter.
150-150
: Log or debug large input handling
This capacity check helps avoid overfilling the batch. Logging when the threshold is reached may aid in diagnosing errors.
274-274
: Encourage helper function reuse
Code that createscommon_batch(n_batch, 1)
reappears across the file. Extracting a small helper can improve maintainability and consistency.src/llama-context.cpp (2)
2724-2729
: Deprecatedllama_encode
Maintaining the old method enhances backward compatibility. If deprecation is planned, ensure existing references migrate tollama_encode_ext
before removal.
2731-2736
: Retaining oldllama_decode
Similar tollama_encode
, this preserves a legacy function. Confirm that new code paths adoptllama_decode_ext
, and plan for eventual phase-out.examples/save-load-state/save-load-state.cpp (3)
135-137
: Repeated code pattern.
Lines 135–137 duplicate the logic used in lines 82–84 (and later 217–219) for clearing and adding a token before decoding. Consider refactoring these steps into a small helper function to avoid duplication.
217-219
: Repeated code pattern.
Again, identical logic to lines 82–84, 135–137. Factor out common logic for maintainability.
221-223
: Proper error check but repeated code.
As with the earlier blocks, consider DRY (“Don’t Repeat Yourself”) to keep the code maintainable.src/llama-batch.cpp (3)
392-406
: Consider clarifying the “one float per token” approach.
llama_batch_ext_init_from_embd
duplicates logic to treat each float index as a position. Make sure this design is correct for your embeddings scenario. If you intend multi-float embeddings per token, you need a different dimensioning strategy.
436-453
: Potential performance bottleneck.
llama_batch_ext_set_output
does a nested loop to find the matchingseq_id
. If usage grows large, consider a more efficient lookup structure (e.g., hashing).
464-466
:llama_batch_ext_clear()
resets onlyn_tokens
.
While intentional, remain aware that leftover data (token arrays, embeddings, etc.) is not zeroed out. In contexts requiring data scrubbing (e.g., security or privacy), you might also want to wipe memory.examples/batched-bench/batched-bench.cpp (2)
122-122
: Looping over “pos” with the same token.
Similarly, you passtoken=0
,pos=i
,seq_ids=&j
. If multiple sequences or advanced usage arises, confirm it aligns with your text vs. token logic.
150-150
: Adding tokens for each sequence.
Here,token=0
pluspos = pp + i
suggests a minimal placeholder approach. Confirm that rollingpos
is correct if you expect to track larger conversation history.include/llama.h (4)
921-927
: Validate Parameters
Consider checkingn_tokens
andn_seq_max
to guard against negative or zero, which may cause allocation issues.
928-939
: Initialization From Text
Usage and documentation aroundoutput_last
seem clear. However, clarifying edge cases (e.g.,n_tokens <= 0
) would improve robustness.
950-952
: Return Type Consideration
llama_batch_ext_get_n_tokens
returnsint32_t
. If a large batch is expected, consider using a wider type likesize_t
.
953-965
: Error Code Enumeration
llama_batch_ext_add_text
returns negative values as error codes. Defining named constants or an enum for these codes would aid clarity.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (38)
common/common.cpp
(3 hunks)common/common.h
(1 hunks)common/speculative.cpp
(5 hunks)examples/batched-bench/batched-bench.cpp
(5 hunks)examples/batched/batched.cpp
(6 hunks)examples/cvector-generator/cvector-generator.cpp
(1 hunks)examples/embedding/embedding.cpp
(4 hunks)examples/eval-callback/eval-callback.cpp
(1 hunks)examples/gritlm/gritlm.cpp
(6 hunks)examples/imatrix/imatrix.cpp
(3 hunks)examples/infill/infill.cpp
(1 hunks)examples/llama-bench/llama-bench.cpp
(2 hunks)examples/llava/gemma3-cli.cpp
(6 hunks)examples/llava/llava-cli.cpp
(1 hunks)examples/llava/llava.cpp
(2 hunks)examples/llava/minicpmv-cli.cpp
(1 hunks)examples/llava/qwen2vl-cli.cpp
(2 hunks)examples/lookahead/lookahead.cpp
(6 hunks)examples/lookup/lookup.cpp
(5 hunks)examples/main/main.cpp
(2 hunks)examples/parallel/parallel.cpp
(7 hunks)examples/passkey/passkey.cpp
(6 hunks)examples/perplexity/perplexity.cpp
(16 hunks)examples/retrieval/retrieval.cpp
(5 hunks)examples/run/run.cpp
(3 hunks)examples/save-load-state/save-load-state.cpp
(5 hunks)examples/server/server.cpp
(18 hunks)examples/simple-chat/simple-chat.cpp
(2 hunks)examples/simple/simple.cpp
(3 hunks)examples/speculative-simple/speculative-simple.cpp
(5 hunks)examples/speculative/speculative.cpp
(6 hunks)examples/tts/tts.cpp
(6 hunks)include/llama-cpp.h
(1 hunks)include/llama.h
(5 hunks)src/llama-batch.cpp
(4 hunks)src/llama-batch.h
(4 hunks)src/llama-context.cpp
(4 hunks)src/llama-context.h
(1 hunks)
🔥 Files not summarized due to errors (4)
- examples/llava/llava-cli.cpp: Error: Disallowed special token found: <|im_end|>
- examples/llava/minicpmv-cli.cpp: Error: Disallowed special token found: <|im_start|>
- examples/llava/qwen2vl-cli.cpp: Error: Disallowed special token found: <|im_start|>
- examples/tts/tts.cpp: Error: Disallowed special token found: <|im_start|>
⏰ Context from checks skipped due to timeout of 90000ms (20)
- GitHub Check: macOS-latest-cmake-tvos
- GitHub Check: android-build
- GitHub Check: ubuntu-22-cmake-musa
- GitHub Check: ubuntu-22-cmake-hip
- GitHub Check: server-windows
- GitHub Check: ubuntu-22-cmake-sycl
- GitHub Check: ios-xcode-build
- GitHub Check: ubuntu-22-cmake-sycl-fp16
- GitHub Check: ubuntu-latest-cmake-cuda
- GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
- GitHub Check: macOS-latest-cmake-ios
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: ubuntu-latest-cmake-rpc
🔇 Additional comments (206)
examples/cvector-generator/cvector-generator.cpp (1)
346-347
: Refactoring to use extended batch API improves resource managementThe code has been updated to use the new
llama_batch_ext_ptr
with RAII semantics that automatically manages memory. This replaces the legacy batch API and follows the broader transition in the codebase.examples/eval-callback/eval-callback.cpp (1)
137-138
: Good transition to extended batch APIThe code correctly implements the new extended batch API which offers improved resource management through RAII. The batch is properly initialized with tokens data and automatically cleaned up when it goes out of scope.
include/llama-cpp.h (2)
27-29
: Well-designed deleter for batch extension resourcesThe implementation follows the established pattern in the codebase for resource management. The deleter ensures proper cleanup of
llama_batch_ext
resources.
35-35
: Good addition of smart pointer typedef for batch extensionThis typedef enables automatic resource management for batch extensions, consistent with the existing pattern for models, contexts, and other resources in the codebase.
examples/llava/llava.cpp (2)
5-5
: Appropriate header inclusionAdding the llama-cpp.h header is necessary to access the newly defined
llama_batch_ext_ptr
type.
451-452
: Simplified implementation using extended batch APIThe code replaces the previous
llava_embd_batch
structure with a more streamlined approach usingllama_batch_ext_ptr
. This change reduces code complexity while maintaining the same functionality, and ensures proper resource cleanup via RAII.examples/main/main.cpp (2)
551-553
: Appropriate batch API migrationThe code properly transitions from using the deprecated
llama_encode
function to using the new extended batch API by initializing allama_batch_ext_ptr
with input buffer data and callingllama_decode_ext
.
672-675
: Clean implementation of the new batch APIThe code correctly implements the new batch handling mechanism, creating a batch from text tokens and setting the output flag before passing to
llama_decode_ext
. This approach is more flexible and consistent with the refactoring pattern used throughout the codebase.src/llama-context.h (2)
84-86
: Good practice in API deprecationProperly marking the old batch methods as deprecated while keeping them functional ensures backward compatibility during the transition period.
88-89
: Clean API extensionThe new extended batch methods follow consistent naming patterns and maintain the same interface style as the deprecated methods, making it intuitive for users to migrate to the new API.
examples/llama-bench/llama-bench.cpp (2)
1447-1448
: Clean implementation of batch API in test_promptThe function correctly adopts the extended batch API by using
llama_batch_ext_ptr
withllama_batch_ext_init_from_text
andllama_decode_ext
, maintaining the same functionality with the new interface.
1465-1466
: Proper API migration in test_genSimilar to the changes in test_prompt, this code properly transitions to the new batch API, making the implementation consistent across the codebase.
examples/lookup/lookup.cpp (6)
94-97
: Effective initialization with the extended batch APIThe code properly replaces the previous encoding approach with the new extended batch API. Using smart pointers (
llama_batch_ext_ptr
) for both batch objects ensures proper memory management.
113-113
: Appropriate type change for batch_tgtThe variable type has been correctly updated from
llama_batch
tollama_batch_ext*
to align with the new API, which is essential for all subsequent operations on this batch.
199-202
: Proper batch management with the extended APIThe code correctly uses
llama_batch_ext_clear
andllama_batch_ext_add_text
to manage the batch, maintaining the same functionality with the new interface.
211-211
: Consistent token addition with extended APIThe token addition logic is properly updated to use
llama_batch_ext_add_text
, maintaining consistency with the new API throughout the function.
217-217
: Updated decode function callProperly changed to use
llama_decode_ext
to match the new batch type, ensuring API consistency.
249-249
: Proper resource cleanupChanged to use
llama_batch_ext_free
for proper deallocation of the new batch type, preventing memory leaks.examples/infill/infill.cpp (1)
356-357
: Confirm no out-of-bounds for embd.
This approach of creating the extended batch with a segment of the embd array looks correct. Ensuren_eval
never exceeds the array bounds, and confirm that the offseti
and chunk sizen_eval
are carefully validated to prevent out-of-range memory reads.examples/simple-chat/simple-chat.cpp (5)
111-114
: Check for null batch returns.
When initializing the extended batch withllama_batch_ext_init_from_text()
, ensure you gracefully handle potential allocation or initialization failures if the returned pointer is null or invalid.
120-120
: Looks good for context size checks.
The updated logic usingllama_batch_ext_get_n_tokens(batch)
is valid and should correctly limit the total context usage.
126-126
: Abort path is consistent.
Aborting on decode failure is acceptable here. Just confirm no resources remain unfreed ifllama_decode_ext
returns an error.
150-153
: Batch clearing and position increment.
Clearing the batch before adding the new token is a good practice. Updatingn_past
aligns with how the new token is appended to the sequence.
156-156
: Validate usage after free.
After callingllama_batch_ext_free(batch)
, ensure there's no further use or accidental references tobatch
. Consider setting it tonullptr
if used in a broader scope.common/speculative.cpp (10)
17-17
: Transition to extended batch pointer.
Replacing a rawllama_batch
withllama_batch_ext_ptr
is aligned with the new approach. Ensure you handle memory ownership correctly to avoid leaks or double frees.
26-26
: Initialization with llama_batch_ext_init.
Usingllama_n_batch(ctx_dft)
is logical for the draft context capacity. Keep an eye on scenarios wherectx_dft
might have zero or minimal capacity.
153-153
: Single sequence ID usage.
If the code later needs multi-sequence handling, consider makingseq_id
adjustable. For now, zero is acceptable for a single sequence scenario.
209-209
: Clearing the batch before prompt additions.
This ensures the new calls tollama_batch_ext_add_text()
don’t mix with leftover tokens, which is a good initial practice.
213-213
: Adding tokens to the batch.
Correctly setslogits=false
for the prompt tokens. This prevents them from interfering with sampling checks.
220-223
: Conditional decode.
Skipping decode if there are no tokens ensures efficiency. The error path logs and returns, so resource usage is well-handled.
229-230
: Clear then add next token.
This pattern is consistent with maintaining a clean slate for each newly appended token.
236-236
: Check error flows.
Same approach as above: on failure, the code logs and stops. Ensure future expansions handle partial success cases if needed.
269-269
: Maintaining incremental token positions.
Usingn_past + i + 1
appears correct. Validate this indexing if you later change token insertion logic.
272-272
: Final decode call.
Consistent with the new extended-batch pattern. Ensure concurrency use-cases are tested if multiple threads might decode simultaneously.examples/imatrix/imatrix.cpp (4)
514-514
: Clear the batch carefully.
Callingllama_batch_ext_clear(batch)
inside this loop is valid, but ensure that no tokens are needed from previous iterations. If tokens from earlier steps must persist, this call could unintentionally discard necessary data.
520-520
: Check return value of llama_decode_ext.
Usingif (llama_decode_ext(ctx, batch))
is good. If an error is returned, you handle it. Confirm that you log or propagate the error upstream if it needs attention.
522-522
: Correctly freeing the batch on error.
Freeingbatch
on the decode error path mitigates memory leaks. This is a good use of cleanup logic.
535-535
: Memory cleanup is sound.
You free thebatch
after use. Confirm there's no use-after-free if any subsequent function tries to reference it.examples/gritlm/gritlm.cpp (6)
19-19
: Batch clearing logic.
Clearing the batch before token addition is typically correct, but confirm that residual tokens or states aren’t required from a previous iteration.
54-54
: Check llama_decode_ext.
Decoding is initiated without error checks here. If memory or other issues occur, the function will fail silently. Consider verifying the return value.
93-93
: Proper resource deallocation.
Freeing thebatch
resource after finishing the embeddings is correct.
110-110
: Initializing batch with dynamic size.
Usingllama_n_batch(ctx)
ensures automatic alignment with context settings. Looks good.
116-116
: Appropriate batch clearing.
Once more, verify no leftover tokens are needed from earlier calls. Clearing each iteration is usually correct in a stateless loop.
150-150
: Batch freed after usage.
Resources are released. Good practice.examples/llava/gemma3-cli.cpp (8)
8-8
: New header import.
#include "llama-cpp.h"
looks consistent with the usage of extended APIs. Ensure no redundant includes remain.
67-67
: Use of llama_batch_ext_ptr.
Smart pointer management withllama_batch_ext_ptr
helps avoid manual frees, which is a good pattern for exception safety.
93-93
: Clear batch at start of new text.
Verifies that no stale tokens remain. This is typically correct.
96-97
: Parameter usage in llama_batch_ext_add_text.
Usingseq_id=0
and final token flag =false
might be correct if these tokens belong to a single sequence. Confirm logic if multi-sequence usage is anticipated.
99-99
: Output last token setting.
llama_batch_ext_set_output_last
is effective if you want to retrieve the final token’s output. Ensure this is logically consistent with your decoding approach.
102-102
: Check llama_decode_ext status.
You do a properif
check. Good job! Ensure that you handle the error accordingly so the rest of the code remains stable.
152-153
: Initializing batch from embeddings.
llama_batch_ext_init_from_embd
is a specialized constructor. Confirm first param matches the correct dimension. Also check for out-of-bounds onimage_embd_v
.
182-182
: Clearing batch in each loop iteration.
Ensure tokens from prior steps are no longer needed. This is typically correct for generating each subsequent token.examples/speculative-simple/speculative-simple.cpp (5)
116-117
: Ensure non-empty input before calling batch initialization.
Ifinp.size()
is 0, subtracting 1 can lead to an out-of-range size. Consider adding a check to avoid potential issues.
136-136
: Initialization of extended batch
Looks correct. Make sure thatllama_n_batch(ctx_tgt)
is sufficient for all tokens in your use case.
155-158
: Clearing batch before re-adding tokens
This approach ensures each step re-initializes the batch. The logic is sound for evaluating just one new token at a time.
167-167
: Validatedraft.size()
vs. context capacity.
When adding tokens withn_past + i
, confirm that this won't exceedllama_n_ctx(ctx_tgt)
.
258-258
: Freeing the extended batch
This is good practice to prevent memory leaks. Ensure no further usage ofbatch_tgt
after this point.examples/run/run.cpp (4)
908-908
: Refactor to accept extended batch pointer
Switching the function signature toconst llama_batch_ext_ptr & batch
is a good step for extended batch handling.
949-949
: Initialize batch from tokens
Creating a new batch withllama_batch_ext_init_from_text
is straightforward. Ensuretokens.size()
is non-zero to avoid an empty batch pass.
953-953
: Decoding with extended API
You’re already handling the return value ofllama_decode_ext
. Good job ensuring errors are caught properly.
972-972
: Resetting the batch for a single token
Re-initializing the batch for the next token is a clear approach. Ensure that repeated resets don’t cause confusion in logs or debug outputs.common/common.h (1)
583-642
: Introduction ofcommon_batch
struct
This structure nicely encapsulates batch handling, tokens, and output logic. A few considerations:
- Ensure concurrency safety if accessed from multiple threads.
- Check for performance overhead of inlined methods in multiple translation units.
Overall, this design is a clear improvement for global code clarity.examples/parallel/parallel.cpp (10)
177-177
: Extended batch initialization
Allocating a new extended batch withllama_batch_ext_init(n_ctx, 1)
is straightforward. Confirm thatn_ctx
is large enough for the worst-case prompt.
220-220
: Clearing the batch
Resetting the batch viallama_batch_ext_clear
ensures a fresh state before adding tokens. Good approach for iterative decoding loops.
228-228
: Storing the current batch size
client.i_batch = llama_batch_ext_get_n_tokens(batch)
is lightweight but confirm the logic if multiple clients share the same batch.
231-231
: Added text with sequence ID
Usingn_tokens_system + client.n_prompt + client.n_decoded
as the position is logical, but confirm it won’t exceed context.
236-236
: Checking for an empty batch
Resetting the KV cache if no tokens are present is a robust approach to keep context consistent.
248-248
: Conditional batching
if (cont_batching || llama_batch_ext_get_n_tokens(batch) == 0)
effectively handles new requests. Good pattern for concurrency.
267-268
: Assigning sequence ID per client
Ensures each client’s tokens remain isolated. Keep an eye on concurrency if multiple threads modifyseq_id
concurrently.
272-273
: Output logits for last token
Toggling the output once the batch has tokens is an effective approach for partial result analysis. No issues found.
300-311
: Chunked decoding with batch views
Splitting the batch into chunks and callingllama_decode_ext
on each view is a valid approach for large prompts. Watching out for partial remainders or repeated decode failures is important.
417-417
: Freeing the main batch
A proper cleanup to prevent memory leaks at the end of the main function. Looks good.examples/lookahead/lookahead.cpp (6)
95-96
: Validaten_input
to avoid out-of-bounds and confirm arguments.
Ensure thatn_input
is at least 2, otherwise havingn_input - 1
here can cause unintended behavior. Also verify that passing0
for the offset parameters is correct for your use case inllama_batch_ext_init_from_text
.
209-209
: Clearing the batch before repopulating looks fine.
This ensures a fresh slate for each iteration. No concerns here.
212-212
: Good usage of multi-seq ID addition.
Passing the entireseq_id_all
array aligns with the extended batch approach. Nicely done.
235-235
: Verify index usage.
Storingllama_batch_ext_get_n_tokens(batch)
for each n-gram is logical, but please verify that these indices align with subsequent sampler operations.
237-238
: Confirm extended batch parameters.
Here, you passtrue
for the last argument ofllama_batch_ext_add_text
, indicating a final token in the chunk. Ensure this is the intended usage for each verification n-gram token.
257-258
: Validate partial final token logic.
You passj == N - 2
as the final argument, presumably to finalize tokens at the last iteration. Double-check that this matches your intended partial or final token logic.examples/server/server.cpp (22)
1227-1227
: Use ofcommon_batch
for batch_spec.
Replacingllama_batch
withcommon_batch
appears consistent with the new extended API.
1799-1799
: Switched globalbatch
tocommon_batch
.
Ensures alignment with the new batch abstraction. No immediate concerns.
1925-1925
: Initializebatch_spec
with speculative capacity.
Consider verifying thatparams_base.speculative.n_max
is non-negative to avoid unintended zero/negative sizing.
1950-1950
: Moving slot intoslots
container.
Using move semantics here is efficient for transferring ownership.
1961-1961
: Re-initializing the mainbatch
.
No issues seen; this logic properly sets the batch capacity for tokens and sequences.
2052-2052
: Resetting the slot.
Callingslot.reset()
ensures a clean slate before usage; seems appropriate.
2096-2096
: Checking for a draft context.
Conditionally handling the speculative model context is clear and concise.
2097-2097
: Settingbatch_spec
for speculation.
As above, ensureslot.params.speculative.n_max
is not negative to avoid invalid sizing.
2857-2857
: Clearing the main batch.
A straightforward reset before assembling new tokens.
2879-2879
: Tracking the batch offset.
Storingbatch.get_n_tokens()
intoslot.i_batch
is a clear way to align slot state with batch indexing.
2881-2881
: Adding sampled token to the batch.
This usage ofbatch.add_text(...)
aligns with the new API pattern.
3064-3064
: Ensuring batch capacity isn’t exceeded.
Good safeguard against overfilling the batch.
3085-3085
: Appending prompt token to the batch.
Logic is sound; no immediate concerns.
3104-3104
: Assertion to ensure non-empty batch.
Helps catch error states early.
3115-3115
: Setting logits for the last token.
This API call is straightforward, presumably controlling final logits usage.
3118-3118
: Recording last token index inslot.i_batch
.
Keeps slot state aligned with the batch's final token.
3150-3150
: Callingllama_decode_ext
.
Return code is checked below; usage appears consistent with error handling.
3244-3244
: Skipping speculation checks.
Logic is clear: no speculation if the slot isn’t active or the draft model is unavailable.
3284-3284
: Clearing the speculative batch.
Ensures a fresh batch for the speculation pass.
3285-3285
: Adding initial token to the speculative batch.
No issues found; usage matches the new API.
3288-3288
: Populating speculative batch with draft tokens.
Implementation looks correct; just be mindful of potential batch size limits.
3292-3292
: Processing the speculation batch.
This final decode extends the approach to test and accept tokens from the draft.examples/speculative/speculative.cpp (4)
168-173
: Adoption of extended batch initialization and decoding.These lines show a transition from older batch processing calls to
llama_batch_ext_init_from_text
plusllama_decode_ext
. This is consistent with the extended batch mechanism introduced across the codebase. No correctness issues are apparent. Ensure that downstream usage (e.g., sampling logic) expects the new structure and checks decode return codes properly.
205-206
: New batch initialization withllama_batch_ext_init
.Switching to
llama_batch_ext_init
forbatch_dft
andbatch_tgt
matches the new extended batch design. This looks correct, assuming the first argument selects the batch size and the second argument the sequence count. Confirm these match your concurrency or parallelism requirements.
447-449
: Refining the draft batch between steps.
llama_batch_ext_clear(batch_dft);
helps you reuse the same batch object.- Setting
seq_id
to 0 each time ensures a consistent, single-sequence approach here.llama_batch_ext_add_text(...)
properly appends the token for decoding on the target side.- Finally, calling
llama_decode_ext
onbatch_dft
runs the decoding.Everything is aligned with the extended batch usage. Just verify that the correct sequence ID is used when branching or splitting.
Also applies to: 453-453
478-480
: Re-initializingbatch_tgt
before the speculation loop.Clearing
batch_tgt
and reassigning a freshseq_id
helps isolate each iteration’s partial tokens, avoiding contamination from previous steps. The usage ofllama_batch_ext_add_text
withdrafts[0].tokens[0]
also appears correct. This approach should simplify logic, provided the rest of the speculation logic properly tracks the new sequences.examples/perplexity/perplexity.cpp (34)
366-366
: Refactored batch usage for perplexity calculation.
- Line 366 clears the KV cache with
llama_kv_self_clear(ctx)
—appropriate before processing a new chunk of tokens.- Line 372 calls
batch.clear()
, ensuring each batch starts clean.- Line 375 uses
batch.add_text(...)
to add tokens, matching the new extended usage.- Line 378 calls
llama_decode_ext(...)
for decoding.All steps look consistent with the revised
common_batch
approach. Verify that error handling forllama_decode_ext
is consistently applied to detect decode failures.Also applies to: 372-372, 375-375, 378-378
505-505
: Instantiation ofcommon_batch
with dynamic size.Using
std::min(n_batch, n_ctx*n_seq)
ensures the batch won’t exceed the model’s context or the configured limit. This is a safe approach to avoid potential buffer overruns. Looks good.
555-557
: Batch clearing and token addition within the loop.
- Line 555 calls
batch.clear()
before filling each sub-batch, preventing leftover tokens.- Lines 557, 568-571 add tokens with
batch.add_text(...)
and handle flags likeoutput = true
.This pattern helps ensure each sub-batch is well-defined. Confirm that the correct sequence IDs are used if some lines rely on multi-sequence contexts.
Also applies to: 568-571
652-653
: Replacing or refactoring lines in test or logging logic?No explicit logic appears in your snippet for lines 652-653 beyond them having
~
. If they involve minor refactoring (e.g., logging or clearing), confirm consistency with surrounding code.Please ensure these modified lines align with updated usage of the batch or reflect a no-op if it’s purely a comment/log change.
778-782
: Transition fromllama_decode
tollama_decode_ext
.Replacing the older decode method with the extended version ensures consistent usage of the new batch approach. Confirm that all downstream calls to sampling or perplexity computations expect the new structures or call signatures.
865-868
: Ensuring chunk-based loop validity.Any changes to these lines likely adjust how chunk iteration or indexes are incremented. Double-check off-by-one errors around loop boundaries, especially if prompt sizes can vary.
If the chunk size logic changed, run a test with the largest possible prompt to confirm no out-of-range indexing.
962-962
: Comparing predicted endings.Here, verifying gold labels vs. predicted tokens likely changed to reflect the new batch structure. Ensure all references to
log_probs
or token arrays remain consistent after the refactor.Also applies to: 964-965
977-978
: Data structure or function call alignment for multiple-choice tasks.If these lines adjust parameter passing or function calls to the updated batch system, confirm that the new type or name exactly matches.
982-983
: Tokenizing multiple-choice tasks and storing them.These lines likely replace direct calls with
common_batch
or updated logic. Watch out for memory usage if multiple tasks are large. The approach seems correct if it matches the new batch design.Also applies to: 985-987, 989-990, 991-993
1030-1041
: Insertion logic for_
in the sentence.Verifying underscore placeholders ensures correct string manipulation. The approach looks fine, but confirm that replaced code matches new specs for multi-sequence handling.
1042-1044
: Ensuring model state is consistent for multiple-choice tasks.Refactoring might require re-initializing the KV cache or sampler. No immediate issues are visible, but watch for concurrency if tasks are processed in parallel.
Also applies to: 1058-1060
1062-1066
: Invocation of Winogrande scoring.Refactored calls that unify with the extended batch approach might require consistent
n_batch
handling. Confirm that the new logic does not skip final tokens.
1069-1077
: Refreshing random selection or partial usage for Winogrande tasks.If we’re limiting tasks or randomizing them, ensure the changes do not bias the distribution or produce out-of-range selection. The approach of removing tasks after selection looks correct.
To confirm correctness, run repeated tests with different seeds and compare distribution outcomes for coverage.
Also applies to: 1080-1093
1095-1107
: Tokenizing Winogrande tasks with common code.These lines presumably unify the code with the new
common_tokenize
approach. Ensure no off-by-one bugs if partial tokens are included.Also applies to: 1114-1116
1130-1134
: Decoding or evaluating tasks in a new batch structure.Check that subsequent references to the log probabilities or embeddings match the updated indexing if extended-batch logic changed how tokens are stored.
Also applies to: 1138-1138
1146-1146
: Batch handling for Winogrande.
- Clearing the batch and reinitializing it.
- Possibly new arguments for sequence IDs.
Looks consistent if it tracks each test scenario properly.Also applies to: 1148-1148, 1159-1162
1165-1166
: Refined sequence iteration or partial logic.Any newly introduced checks or pointer arithmetic must be double-checked for boundary conditions. The changes appear structurally correct if tested with different batch sizes.
Encourage a boundary test to ensure no stepping out of range on
seq_tokens
.Also applies to: 1200-1208, 1211-1212
1219-1223
: Collecting and normalizing token log probabilities.These lines appear to gather logprobs from
eval_results
. The logic looks consistent, but confirm that newly introduced or replaced lines do not skip an index or double-count tokens.Also applies to: 1225-1238
1240-1244
: Granular comparison of predicted endings.This final acceptance check for the correct answer is consistent with the rest of the new logic. No immediate issues seen.
Also applies to: 1247-1247
1254-1262
: Final results and stats.If these lines finalize the acceptance ratio or set up the scoring, ensure they handle the updated batch size or newly introduced partial chunking. Otherwise, it looks correct.
1307-1311
:multiple_choice_prepare_one_task
adjustments.These lines presumably adapt the method to the new batch mechanism. Look for any mismatch with the new sequence offset or token indexing.
Also applies to: 1314-1322, 1324-1338
1342-1346
: Setting up the multiple-choice scoring environment.Ensuring context size and concurrency are properly enforced is key. The changes appear consistent but keep an eye on potential off-by-one if
n_ctx
changed.Run scale tests with different
n_ctx
values to confirm no boundary exceptions occur.Also applies to: 1482-1483
1535-1537
: Performing the final decode for multiple-choice tasks (batch usage).Deleting or rewriting older calls with
llama_decode
tollama_decode_ext
can break logic if some metadata is missing. Everything looks aligned so far but confirm that all tasks are accounted for in logs.Also applies to: 1539-1539, 1542-1545
1552-1558
: Clearing KV and decoding all tasks in[i0, i1)
.This persistent pattern is consistent with earlier changes. Just ensure you handle partial tasks if the final chunk is smaller than a full batch.
Also applies to: 1560-1560
1581-1584
: Computing logprobs in parallel.Where concurrency is introduced or updated, confirm that shared data structures (e.g.,
eval_pairs
,eval_results
) are protected from race conditions. The usage withstd::mutex
or an atomic counter looks correct.Also applies to: 1586-1596, 1600-1608
1612-1612
: Aggregating or normalizing partial results.Double-check these lines for potential floating-point rounding issues, especially if the partial sums approach changed. All appears logically consistent if thoroughly tested.
Also applies to: 1614-1614, 1616-1616, 1618-1618, 1620-1620, 1622-1626
1628-1631
: Calculating the final accuracy or partial results.No major issues, but watch for integer division if any changed lines handle sums vs. floats.
Also applies to: 1634-1634, 1637-1643, 1647-1647
1668-1668
: KL divergence readiness checks and parameter usage.Any new lines around version compatibility or ensuring the correct vocab size must match the current model. The changes look straightforward, but test any new model with varied vocab sizes.
Also applies to: 1670-1670, 1726-1727
1732-1735
: Reading log-prob file and chunking logic.Confirm that your indexing into the input file aligns with these new line changes. Possibly, we skip unneeded lines or track chunk sizes differently.
Test with logs from older versions to ensure backward compatibility or fail gracefully if incompatible.
Also applies to: 1738-1742
1750-1751
: Clearing batch and re-inserting tokens (KL analysis).If we rely on
common_batch
again for re-processing tokens, ensure the correct offsets are used. Looks consistent with earlier refactors.Also applies to: 1754-1758, 1760-1763
1774-1776
: Aggregating chunk-based results.Double-check that the final metrics (time per pass, etc.) are unaffected by changes to chunk indexing or partial decoding.
Also applies to: 1793-1801
1802-1826
: Calculating partial perplexities and KL stats.Large floating-point operations are always a spot to watch for numerics. The refactor is basically hooking to new data structures. No big concerns if tested.
1832-1909
: Collecting distribution statistics.Here you unify the stats with percentile calculations. No immediate red flags unless the refactor removed a boundary check. Code still looks consistent.
1910-1926
: Final logs and prints.These lines finalize the results. The changes appear to focus on printing or storing updated stats. Everything seems coherent with the new approach.
examples/embedding/embedding.cpp (5)
29-30
: Adding tokens withcommon_batch
.Using
batch.add_text()
in a loop (lines 29–30, 32) replaces older batch logic. This is consistent with the new approach. No issues found, but confirm proper sequence IDs if multiple sequences are processed at once.Also applies to: 32-32
36-36
: Refactored embedding decode calls.
- Line 36: Introduces
batch_decode(...)
referencing the new batch approach.- Lines 44–45: Logging adjusted to show
llama_batch_ext_get_n_tokens(...)
instead of directbatch.n_tokens
.- Line 48: Checking return code from
llama_encode_ext
/llama_decode_ext
.- Line 52: Another check or log statement updated similarly.
These changes consistently adopt the extended API. Recognize that the logic now branches between encoder-only vs. decoder-only. This is technically robust so far.
Also applies to: 44-45, 48-48, 52-52
58-59
: Retrieving embeddings from new functions.
- Lines 58–59 switch from direct usage of
logits
to checkingbatch.tokens[i].logits
, aligning with the extended batch struct.- Lines 73–74 retrieve sequence embeddings with
llama_get_embeddings_seq(...)
, ensuring the correctseq_id
.Looks correct if you consistently manage
seq_id
. This is critical for correct embedding retrieval.Also applies to: 73-74
172-172
: Instantiatingcommon_batch
for embeddings.Replacing older code with
common_batch(n_batch, 1)
clarifies that you have one sequence in each batch for embeddings. This is consistent with the single-prompt embedding scenario or repeated single-sequence usage.Also applies to: 174-174
200-200
: Triggering partial batch decode for multiple lines.
- Line 200 checks if adding new tokens would overflow the batch.
- Lines 202–203 decode the batch, then line 205 increments the pointer for embeddings.
- Line 207 calls
batch.clear()
to prepare for the next chunk.Ensures repeated usage is stable. Good approach for embedding multiple inputs in a single run.
Also applies to: 202-203, 205-205, 207-207
examples/passkey/passkey.cpp (9)
5-5
: Confirm include path.
Including"llama-cpp.h"
here looks correct, but ensure that this header is available at compile time.
126-126
: Check for potential null return.
llama_batch_ext_init
may returnnullptr
on failure. Consider adding a null check for robust error handling.
144-144
: Proper clearing usage.
Callingllama_batch_ext_clear(batch.get())
here is consistent with the extended batch mechanism.
152-152
: Immediate output setting.
llama_batch_ext_set_output_last
is used correctly to designate output for the final token in this batch.
155-155
: Return code check is good.
Decoding errors are handled properly by exiting on non-zero return.
179-179
: Batch clearing repeated.
Clearing the batch again is consistent with the iterative chunk processing flow.
182-183
: Duplicate pattern with seq_id.
This repeats the pattern of always usingseq_id = 0
. Same considerations as lines 147-148 apply.
187-187
: Final output flagging.
Designating this as the last output token in the chunk is consistent with the extended approach.
190-190
: Decoding error handling repeated.
Same robust mechanism as before.src/llama-batch.h (3)
22-32
: New extended batch structure.
Fields and layout look sensible for extended usage, and they unify token/embedding details.
99-99
: Updated signature.
Adoptingllama_batch_ext
infrom_batch
aligns with the broader refactor.
114-115
: Constructor and destructor introduced.
Check that any allocated resources (e.g.batch
) are fully freed in the destructor to prevent leaks.examples/batched/batched.cpp (14)
105-105
: Validate return pointer.
Add a null check forllama_batch_ext_init
to safeguard against allocation failures.
114-114
: Positioning logic check.
Usingi
as both token index and position might be correct here, but confirm it matches your intended sequence logic.
119-119
: Encoding error handling.
Non-zero return leads to an error. This matches the approach used for decoding.
129-129
: Batch clearing usage.
Resetting the batch before adding decoder tokens is appropriate.
130-130
: Decoder start token position.
Position 0 is reused for all sequences. Confirm if this suits your decoding logic.
134-134
: Last output token set.
llama_batch_ext_set_output_last
is used correctly, consistent with standard patterns.
136-136
: Decoding error handling.
Same pattern as above—this is good for consistency.
158-158
: Last-token indexing.
Settingi_batch[i] = llama_batch_ext_get_n_tokens(batch) - 1
could lead to an off-by-one if tokens are added after.
167-167
: Batch reset.
Clearing the batch again works for the iterative generation loop.
196-196
: Updating i_batch.
i_batch[i] = llama_batch_ext_get_n_tokens(batch)
updates the reference to newly added tokens. Confirm that the index remains in sync across parallel sequences.
199-199
: Setting position to n_cur.
Passingn_cur
as the token position seems logical.
205-205
: Zero tokens check.
Breaking out when no tokens remain helps avoid extraneous decode calls.
212-212
: Decoding loop.
Consistent error handling for incremental decoding.
237-237
: Memory cleanup.
llama_batch_ext_free(batch)
ensures no leaks for the extended batch.examples/simple/simple.cpp (3)
146-146
: Consider checking return value fromllama_batch_ext_init_from_text
Ifprompt_tokens.size()
is zero or memory allocation fails,batch
might benullptr
. Checking for a null return can help avoid potential segmentation faults.
185-185
: Validate sequence ID usage
The code always uses sequence ID 0 when adding text. If multi-sequence contexts are ever needed, ensure the correct sequence ID is passed, or add logic to handle multiple sequences.
203-203
: Confirm memory cleanup
Freeing thebatch
pointer is correct for preventing resource leaks. Ensure no subsequent logic reads from this batch structure after it's freed.examples/retrieval/retrieval.cpp (5)
77-80
: Add error handling for empty tokens
Iftokens
is empty, this function silently skips adding anything to the batch. Consider logging or otherwise surfacing the scenario to avoid silent failures.
92-102
: Check model type assumptions
This logic branches for encoder-only vs. decoder-only models. If future support expands to multi-encoder-decoder models, additional checks or error handling might be required.
105-106
: Clarify skip condition
Skipping tokens wherebatch.tokens[i].logits
is null might lead to missing or incomplete processing. Verify if additional logging or alternative handling is warranted.
253-254
: Verify timing of batch clear
Callingbatch_decode
immediately followed bybatch.clear()
may cause issues if decoding is asynchronous or if data is accessed post-decode. Confirm decode completion before clearing.
288-288
: Clearing query batch is appropriate
Erasing the query batch after its use ensures a clean slate. This aligns well with typical usage patterns here.src/llama-context.cpp (3)
7-7
: Reference new batch implementation
Including"llama-batch.h"
enables the extended batch functionality. This is a good step toward the new API.
2738-2742
: Newllama_encode_ext
The extended API is a clearer entry point for advanced usage. This approach consistently leveragesllama_batch_ext
throughout.
2744-2747
: Newllama_decode_ext
Properly references the extended batch, aligning with the rest of the new APIs. This promotes a unified pattern across the codebase.examples/save-load-state/save-load-state.cpp (5)
51-51
: Ensure sufficient batch capacity.
llama_batch_ext_init_from_text()
setsbatch->max_tokens = n_tokens
, so it should typically handle the text size. However, if there's a chance thattokens.size()
is unexpectedly large, it might be prudent to confirm no overflow or out-of-range scenario occurs inside the internal loop.
82-84
: Logic appears correct.
Clearing the batch each iteration, settingseq_id = 0
, and adding the next token is a proper pattern for incremental decoding.
86-88
: Good error handling.
You capture the error code fromllama_decode_ext
and free the batch on failure. This prevents resource leaks and undefined behavior.
139-141
: Error handling remains consistent.
Same pattern as before: checking the return code and freeing the batch upon error.
235-235
: Safe cleanup.
Freeing the batch after the final usage is correct. Just ensure that no other code path callsllama_batch_ext_free(batch)
again to avoid double free.src/llama-batch.cpp (6)
192-274
: Potential nullness assumption forbatch.n_seq_id
.
Insidefrom_batch
, you rely onbatch.n_seq_id
to accessbatch.n_seq_id[bi]
unconditionally in multiple places. Ifbatch.n_seq_id
were truly null, this could segfault. Thellama_batch_allocr
constructor does allocaten_seq_id
if missing, but verifying that all callsites respect that invariant would help avoid unexpected crashes.
326-329
: Convenience constructor logic is straightforward.
This helper for creating a minimalllama_batch
with tokens and no embeddings looks fine. Just ensure that the fields remain valid for the expected usage.
388-391
: Initialize with the chosen number of tokens.
llama_batch_ext_init(n_kv_max, 0)
or(n_kv_max, 1)
is succinct, but ensure it matches the required sequence usage.
407-409
: Simple accessor method.
llama_batch_ext_get_n_tokens()
has no issues; it’s a straightforward getter.
455-463
: Final-token marking is concise and clear.
llama_batch_ext_set_output_last
sets the last token for output. This is a direct approach that matches the needs for typical “terminal token” usage.
489-505
: Proper selective freeing.
You skip freeing arrays if the batch is a view. This approach is correct for shared data lifetimes. Just ensure no references remain active once the parent is freed.examples/batched-bench/batched-bench.cpp (7)
62-62
: Checkn_seq_max
inllama_batch_ext_init
.
Here, you usellama_batch_ext_init(n_kv_max, 1)
. If you intend multiple sequences, ensure that using “1” asn_seq_max
doesn’t block multi-sequence usage.
66-72
: Nicely modular decoding step.
Splitting the batch into sub-batches usingllama_batch_ext_get_view
is a clean approach. Just confirm you don’t exceed the parent’s token count in edge cases.
87-88
: Usage of token=0.
You repeatedly add the token value0
. Verify that this is a valid placeholder or system token in your domain logic (likely a start-of-sequence or dummy token).
118-118
: Clearing the batch is correct.
This resetsn_tokens
, preparing for the next iteration. No immediate issues.
125-125
: Marking last output token.
llama_batch_ext_set_output_last(batch)
is consistent, given you want the final token flagged for output.
147-147
: Recurring batch clear.
Again, systematically resettingn_tokens
is fine. Watch for performance overhead if done too frequently, but likely acceptable in a benchmark.
188-188
: Proper final cleanup.
Freeing the extended batch at the end ensures no memory leaks.include/llama.h (8)
236-238
: Deprecation Notice Acknowledged
This clarifies that the olderllama_batch
struct will be phased out in favor ofllama_batch_ext
.
263-266
: Forward Declaration
No functional issues. Definingstruct llama_batch_ext;
aligns with the plan to deprecatellama_batch
.
966-980
: Confirm Concurrency
llama_batch_ext_set_output
andllama_batch_ext_set_output_last
modify batch data. Confirm if usage is serialized or if these functions are intended to be thread-safe.
989-991
: Clear Function
llama_batch_ext_clear
straightforwardly removes all tokens. Implementation appears sound.
992-994
: Free Function
Appropriately handles both the original batch and sub-views. Looks correct.
1004-1007
: Extended Encode
llama_encode_ext
is a logical companion to the existing API, enabling the new batch structure.
1014-1015
: Minor Modification
No functional issues observed in this signature adjustment.
1016-1018
: Extended Decode
llama_decode_ext
follows the same pattern asllama_encode_ext
, enabling the extended batch approach.
src/llama-batch.cpp
Outdated
struct llama_batch_ext * llama_batch_ext_get_view( | ||
struct llama_batch_ext * batch, | ||
int32_t offset, | ||
int32_t n_tokens) { | ||
if (batch->embd) { | ||
return nullptr; // not yet supported | ||
} | ||
llama_batch_ext * batch_view = new llama_batch_ext{ | ||
/*n_tokens =*/ n_tokens, | ||
/*max_tokens =*/ n_tokens, | ||
/*is_view =*/ true, | ||
/*tokens =*/ batch->token + offset, | ||
/*embd =*/ nullptr, | ||
/*pos =*/ batch->pos + offset, | ||
/*n_seq_id =*/ batch->n_seq_id + offset, | ||
/*seq_id =*/ batch->seq_id + offset, | ||
/*logits =*/ batch->logits + offset, | ||
}; | ||
return batch_view; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Offset bounds are unchecked.
llama_batch_ext_get_view
uses offset
and n_tokens
to compute ranges. If the offset plus size exceeds the parent’s capacity, it’ll create an invalid pointer range. Add boundary checks to avoid silent out-of-bounds usage.
include/llama.h
Outdated
// Same with llama_batch_init, but initializes the batch with the provided raw embeddings | ||
// First token will be at position pos0 | ||
// The sequence ID will be fixed to seq_id | ||
// The batch has to be freed with llama_batch_ext_free() | ||
LLAMA_API struct llama_batch_ext * llama_batch_ext_init_from_embd( | ||
float * embd, | ||
size_t n_embd, | ||
int32_t pos0, | ||
int32_t seq_id); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Initialize From Embeddings
Ensure embd
is non-null and n_embd > 0
; otherwise, the function could run into undefined behavior.
+// Example check inside llama_batch_ext_init_from_embd:
+if (!embd || n_embd <= 0) {
+ // handle error or return null
+}
Committable suggestion skipped: line range outside the PR's diff.
@coderabbitai help |
CodeRabbit Commands (Invoked using PR comments)
|
@coderabbitai can I add some more context to the review? |
Tip For best results, initiate chat on the files or code changes. @ngxson Yes, you can absolutely add more context to the review! Additional information about your PR would be very helpful for me to better understand the purpose and implementation of the private batch API changes. Feel free to share details about:
The more context you provide, the more thorough and accurate my review can be. |
ggml-ci
server : avoid common_batch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
♻️ Duplicate comments (4)
examples/run/run.cpp (1)
954-957
:⚠️ Potential issueConsider checking the return value of
check_context_size
.
Currently, the return value is ignored at line 998, which could lead to potential errors if the context is exceeded.Apply this diff to handle the error properly:
- check_context_size(llama_data.context, batch); + if (check_context_size(llama_data.context, batch) != 0) { + // context size exceeded, handle error here + return 1; + }examples/save-load-state/save-load-state.cpp (1)
54-56
:⚠️ Potential issueDecode result is ignored.
Check the return code fromllama_decode_ext
to catch potential failures.Possible patch:
- llama_decode_ext(ctx, batch.get()); + int ret_dec = llama_decode_ext(ctx, batch.get()); + if (ret_dec != 0) { + fprintf(stderr, "llama_decode_ext failed with code %d\n", ret_dec); + return 1; + } n_past += llama_batch_ext_get_n_tokens(batch.get());src/llama-batch.cpp (2)
404-427
:⚠️ Potential issuePotential out-of-bounds if n_seq_ids exceeds allocated n_seq_max.
Although the code setsbatch->n_seq_id[output_id] = n_seq_ids
, there is no check thatn_seq_ids
fits into the allocated number of sequence IDs per token. This can cause out-of-bounds writes inbatch->seq_id[batch->n_tokens][j]
.int32_t llama_batch_ext_add_text(..., size_t n_seq_ids, ...) { + // Hypothetical fix: assume n_seq_max is known (e.g., stored in batch->some_field) + if (n_seq_ids > batch->some_n_seq_max) { + return -3; // or handle error + } ... }
461-480
:⚠️ Potential issueMissing boundary checks for offset and n_tokens.
Ifoffset + n_tokens
is beyond the parent batch’s capacity, pointer arithmetic can go out of bounds. Consider clarifying or enforcing valid ranges.if (offset + n_tokens > batch->max_tokens) { // handle error return nullptr; }
🧹 Nitpick comments (3)
examples/simple/simple.cpp (1)
184-187
: Consider checking return value of batch operations.While the transition to the new batch API is correct, consider checking the return value of
llama_batch_ext_add_text
, which can return error codes (such as -1 if batch size is exceeded or -2 if embeddings are already set).-llama_batch_ext_add_text(batch.get(), new_token_id, n_pos, &seq_id, 1, true); +int add_result = llama_batch_ext_add_text(batch.get(), new_token_id, n_pos, &seq_id, 1, true); +if (add_result < 0) { + fprintf(stderr, "%s: failed to add token to batch, error code %d\n", __func__, add_result); + return 1; +}examples/lookahead/lookahead.cpp (1)
120-120
: Consider using smart pointer for consistent resource management.Using raw
llama_batch_ext
pointer here, unlike other batch initializations in the codebase that usellama_batch_ext_ptr
. While this works, a smart pointer would provide more consistent RAII-style resource management.-llama_batch_ext * batch = llama_batch_ext_init(params.n_ctx, W + G + 1); +auto batch = llama_batch_ext_ptr(llama_batch_ext_init(params.n_ctx, W + G + 1));src/llama-batch.cpp (1)
192-274
: Consider extracting sorting logic into a helper for clarity.
This new function is quite large and has multiple branching conditions. Extracting the sorting and sequence grouping logic into helper methods could improve readability and maintainability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
common/common.cpp
(2 hunks)examples/lookahead/lookahead.cpp
(6 hunks)examples/lookup/lookup.cpp
(5 hunks)examples/run/run.cpp
(4 hunks)examples/save-load-state/save-load-state.cpp
(5 hunks)examples/simple-chat/simple-chat.cpp
(3 hunks)examples/simple/simple.cpp
(3 hunks)examples/speculative/speculative.cpp
(9 hunks)include/llama-cpp.h
(1 hunks)include/llama.h
(5 hunks)src/llama-batch.cpp
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- examples/simple-chat/simple-chat.cpp
- examples/lookup/lookup.cpp
🧰 Additional context used
🧠 Learnings (2)
examples/lookahead/lookahead.cpp (1)
Learnt from: ngxson
PR: ngxson/llama.cpp#14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
src/llama-batch.cpp (1)
Learnt from: ngxson
PR: ngxson/llama.cpp#14
File: src/llama-batch.cpp:356-386
Timestamp: 2025-03-13T23:30:11.172Z
Learning: Null checks after malloc calls are not required in the llama.cpp codebase, as the project follows a design pattern where memory allocation failures are handled differently or allowed to propagate.
🧬 Code Definitions (7)
examples/run/run.cpp (3)
include/llama-cpp.h (4) (4)
batch
(28-28)batch
(28-28)tokens
(44-57)tokens
(44-48)src/llama-context.cpp (2) (2)
llama_decode_ext
(2777-2781)llama_decode_ext
(2777-2779)src/llama-batch.cpp (2) (2)
llama_batch_ext_get_n_tokens
(400-402)llama_batch_ext_get_n_tokens
(400-400)
include/llama-cpp.h (2)
include/llama.h (4) (4)
llama_batch_ext_init
(924-926)llama_batch_ext_add_text
(955-961)llama_batch_ext_set_output_last
(977-977)llama_batch_ext_init_from_embd
(934-939)src/llama-batch.cpp (8) (8)
llama_batch_ext_init
(372-374)llama_batch_ext_init
(372-372)llama_batch_ext_add_text
(404-427)llama_batch_ext_add_text
(404-410)llama_batch_ext_set_output_last
(448-455)llama_batch_ext_set_output_last
(448-448)llama_batch_ext_init_from_embd
(376-390)llama_batch_ext_init_from_embd
(376-381)
examples/lookahead/lookahead.cpp (1)
include/llama-cpp.h (2) (2)
batch
(28-28)batch
(28-28)
examples/save-load-state/save-load-state.cpp (4)
include/llama-cpp.h (4) (4)
batch
(28-28)batch
(28-28)tokens
(44-57)tokens
(44-48)include/llama.h (5) (5)
llama_decode_ext
(1013-1015)llama_batch_ext_get_n_tokens
(948-948)llama_batch_ext_clear
(987-987)llama_batch_ext_add_text
(955-961)llama_batch_ext_free
(991-991)src/llama-context.cpp (2) (2)
llama_decode_ext
(2777-2781)llama_decode_ext
(2777-2779)src/llama-batch.cpp (8) (8)
llama_batch_ext_get_n_tokens
(400-402)llama_batch_ext_get_n_tokens
(400-400)llama_batch_ext_clear
(457-459)llama_batch_ext_clear
(457-457)llama_batch_ext_add_text
(404-427)llama_batch_ext_add_text
(404-410)llama_batch_ext_free
(482-498)llama_batch_ext_free
(482-482)
examples/speculative/speculative.cpp (2)
include/llama.h (6) (6)
llama_decode_ext
(1013-1015)llama_batch_ext_init
(924-926)llama_batch_ext_clear
(987-987)llama_batch_ext_add_text
(955-961)llama_batch_ext_get_n_tokens
(948-948)llama_batch_ext_free
(991-991)src/llama-context.cpp (2) (2)
llama_decode_ext
(2777-2781)llama_decode_ext
(2777-2779)
include/llama.h (3)
include/llama-cpp.h (4) (4)
tokens
(44-57)tokens
(44-48)batch
(28-28)batch
(28-28)src/llama-batch.cpp (2) (2)
llama_batch_ext_init
(372-374)llama_batch_ext_init
(372-372)examples/perplexity/perplexity.cpp (1) (1)
batch
(1470-1470)
common/common.cpp (2)
include/llama.h (2) (2)
llama_encode_ext
(1001-1003)llama_decode_ext
(1013-1015)src/llama-context.cpp (4) (4)
llama_encode_ext
(2771-2775)llama_encode_ext
(2771-2773)llama_decode_ext
(2777-2781)llama_decode_ext
(2777-2779)
⏰ Context from checks skipped due to timeout of 90000ms (20)
- GitHub Check: android-build
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: ubuntu-latest-cmake-cuda
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: ubuntu-22-cmake-vulkan
- GitHub Check: ubuntu-22-cmake-sycl-fp16
- GitHub Check: windows-2019-cmake-cuda (11.7, cuda)
- GitHub Check: windows-2019-cmake-cuda (12.4, cuda)
- GitHub Check: macOS-latest-cmake-ios
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: ubuntu-22-cmake-musa
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
- GitHub Check: server (UNDEFINED, RelWithDebInfo)
- GitHub Check: server (Release)
- GitHub Check: server (ADDRESS, RelWithDebInfo)
- GitHub Check: server-windows
🔇 Additional comments (54)
include/llama-cpp.h (2)
27-29
: Well-designed deleter for batch_ext objects.The
llama_batch_ext_deleter
struct follows the standard C++ deleter pattern and properly calls the C API functionllama_batch_ext_free
to release resources when allama_batch_ext
object goes out of scope.
36-67
: Great RAII-compliant smart pointer wrapper.This
llama_batch_ext_ptr
implementation is a well-designed wrapper that:
- Inherits from
std::unique_ptr
with a custom deleter for automatic resource management- Provides convenient factory methods for common initialization patterns
- Encapsulates batch creation and memory management logic
The static factory methods provide clear, readable interfaces for creating batches from different inputs while handling memory management internally.
common/common.cpp (2)
1019-1021
: Good use of the new batch API.The code correctly transitions from manual batch initialization to using the new
llama_batch_ext_ptr
smart pointer and its initialization methods. This improves memory safety while simplifying the code.
1029-1031
: Proper batch handling for decoder.Consistent implementation of the batch API transition for decoder initialization.
examples/simple/simple.cpp (4)
2-2
: Added required header for the new batch API.The inclusion of "llama-cpp.h" is necessary for accessing the
llama_batch_ext_ptr
class and related functionality.
147-147
: Improved batch initialization with smart pointer.The new batch initialization using
llama_batch_ext_ptr::init_from_text
provides cleaner memory management and more explicit parameter handling than the previous approach.
155-157
: Updated loop condition and decoding method.The code correctly transitions to the new API with
llama_batch_ext_get_n_tokens
andllama_decode_ext
. This change ensures proper interaction with the new batch structure.
162-162
: Proper token position tracking.Correctly updated to use
llama_batch_ext_get_n_tokens
for incrementing position counter.examples/lookahead/lookahead.cpp (7)
95-98
: Good RAII batch management with smart pointers.Using
llama_batch_ext_ptr
for both batches ensures proper resource management, and the initialization parameters are correctly specified.Based on your previous feedback, I understand you prefer to keep this part simple for now and plan to add error handling later.
209-213
: Updated batch API usage correctly.The transition from older batch functions to
llama_batch_ext_clear
andllama_batch_ext_add_text
is implemented correctly.
235-239
: Proper sequence ID handling.The code correctly specifies sequence IDs when adding tokens to the batch, which is important for the context management in the lookahead implementation.
250-252
: Correctly updated batch API for token addition.The batch API is used correctly here for adding the lookahead tokens with proper sequence management.
257-259
: Attention to logits flag details.Good attention to detail by setting the logits flag only for the last level of tokens (
j == N - 2
), which helps optimize performance.
263-266
: Good error handling for decode operation.The code properly checks the return value of
llama_decode_ext
and provides a meaningful error message if decoding fails.
483-483
: Proper resource cleanup.The code correctly frees the batch resource at the end of the program.
examples/run/run.cpp (4)
995-995
: Initialization of extended batch looks good.
No concerns.
999-999
: Decoding call checks return value correctly.
This is a best practice to ensure errors are handled promptly.
1004-1005
: Incrementingn_past
is appropriate.
This correctly tracks how many tokens have been processed so far.
1020-1020
: Single-token batch initialization is clearly written.
Helps produce incremental generation.examples/speculative/speculative.cpp (6)
168-173
: Validate non-empty input length before usingn_input - 1
.
Ifn_input
is 0 or 1 (e.g., empty prompt), subtracting 1 risks negative sizing.Would you like a verification script checking that
inp.size()
is at least 2 before these calls?
204-205
: Extended batch initialization is aligned with the new API.
No immediate concerns here.
340-341
: Potential out-of-bounds when accessingdrafts[i].tokens[i_dft]
.
This concern was raised in earlier reviews for exactly the same lines.
478-485
: Introducingbatch_info
struct and storing token drafting info.
This cleanly keeps token ID, position, and sequence IDs together.
489-489
: Clearing the draft batch is straightforward.
No issues found.
651-652
: Freeing both draft & target batches properly.
Looks correct and prevents memory leaks.examples/save-load-state/save-load-state.cpp (5)
51-51
: Extended batch initialization for prompt looks correct.
Usesinit_from_text
with the new structure.
86-86
: Good job checkingllama_decode_ext
here.
Properly handles any decoding errors.Also applies to: 88-89
139-139
: Return code from decode is properly handled.
Consistent with safe error checking.Also applies to: 141-142
221-221
: Immediately checksllama_decode_ext
result.
This consistent approach is good for robust error handling.Also applies to: 223-224
235-235
: Resource freed successfully.
Minimal risk of memory leaks.src/llama-batch.cpp (12)
276-316
: Ensure no double-free occurs if batch is also freed manually.
This constructor creates and owns a newllama_batch_ext
which is later deleted in the destructor. If user code also callsllama_batch_ext_free(batch)
, a double-free may occur.Would you like a verification script to confirm that no code path calls
llama_batch_ext_free()
on thisbatch
pointer?
318-320
: Destructor ownership may cause confusion.
Deletingbatch
unconditionally here is acceptable as long as usage never callsllama_batch_ext_free(batch)
separately. Otherwise, it leads to undefined behavior.
327-329
: No issues with the updated helper function.
This simple helper is in line with the deprecatedllama_batch
usage.
340-370
: Internal initialization logic looks solid.
Allocations fortoken
,embd
, and other fields match the design approach. Null-checks are intentionally omitted as per codebase convention.
372-374
: Straightforward wrapper function.
It delegates tollama_batch_ext_init_impl
with zero embedding dimension.
376-390
: Embeddings initialization function is consistent.
Copies embeddings into a newly allocated batch. Ifembd
orn_embd
is invalid, behavior is undefined, but that aligns with prior design decisions.
392-398
: Position assignment method is concise.
Returns an error if size mismatches. Otherwise, copies positions. No further concerns.
400-403
: Simple getter for batch size.
No issues here – the function correctly returnsbatch->n_tokens
.
429-446
: Output marking logic is acceptable.
Assigninglogits[i] = true
upon finding the matching sequence and position is clear. Document negativepos
semantics (meaning “any position”) if not already stated.
448-455
: Last-token output marker is straightforward.
No further suggestions: marking the final token as output is correct.
457-459
: Clearing n_tokens is sufficient.
This function resets the batch by settingn_tokens = 0
. No concerns here.
482-498
: Free function is consistent with the dynamic allocations.
The pattern of skipping frees for views is correct; user must ensure not to pass a view to any code path that also expects to own memory.include/llama.h (12)
238-239
: Clear deprecation notice is helpful.
Explicitly warning thatllama_batch
will be removed in future promotes correct migration.
263-266
: Forward declaration of llama_batch_ext is good.
This effectively prevents exposure of internal structure details.
921-927
: Introducing llama_batch_ext_init as a replacement is appropriate.
This matches the transition away fromllama_batch
.
928-934
: Embeddings-based init function parallels old API.
Just ensure callers provide valid non-null embeddings.
941-946
: Pos-setting function documentation is clear.
It warns thatn_pos
must match the batch size.
947-948
: Getter for the token count
Straightforward usage, consistent with the rest of the API.
963-968
: Setting output for a sequence ID
No major issues. Negativepos
is accepted as “match any position,” which is consistent with the source.
969-977
: Last output function
Straightforward approach to mark the last token in the batch as output.
986-988
: Clear function
Resets n_tokens to zero. No other concerns.
989-992
: Free function
Same logic as in the CPP file, acknowledging special handling of views.
1001-1003
: New encode_ext
Allowing the new extended batch structure is in line with the migration strategy.
1011-1011
: Minor signature adjustment
Likely just indentation or doc updates. No new issues here.
examples/speculative/speculative.cpp
Outdated
llama_batch_ext_clear(batch_tgt); | ||
for (int i = 0; i < (int) batch_tgt_data.size(); ++i) { | ||
const auto & data = batch_tgt_data[i]; | ||
|
||
llama_batch_ext_add_text(batch_tgt, data.id, data.pos, data.seq_id.data(), data.seq_id.size(), true); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Target-side decoding result is unchecked.
Propose capturing return to detect and handle errors.
Also applies to: 608-610
examples/speculative/speculative.cpp
Outdated
drafts[s].i_batch_dft = llama_batch_ext_add_text(batch_dft, id, n_past_cur, &s, 1, true); | ||
|
||
if (batch_tgt.n_tokens > n_draft) { | ||
if (batch_tgt_data.size() > (size_t) n_draft) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check the return code from llama_batch_ext_add_text
.
Errors (-1 / -2) are never examined, which can lead to silent failures.
Proposed fix:
drafts[s].i_batch_dft = llama_batch_ext_add_text(batch_dft, id, n_past_cur, &s, 1, true);
-if (batch_tgt_data.size() > (size_t) n_draft) {
+if (drafts[s].i_batch_dft < 0) {
+ LOG_ERR("llama_batch_ext_add_text failed.\n");
+ drafts[s].drafting = false;
+ break;
+} else if (batch_tgt_data.size() > (size_t) n_draft) {
drafts[s].drafting = false;
...
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
drafts[s].i_batch_dft = llama_batch_ext_add_text(batch_dft, id, n_past_cur, &s, 1, true); | |
if (batch_tgt.n_tokens > n_draft) { | |
if (batch_tgt_data.size() > (size_t) n_draft) { | |
drafts[s].i_batch_dft = llama_batch_ext_add_text(batch_dft, id, n_past_cur, &s, 1, true); | |
if (drafts[s].i_batch_dft < 0) { | |
LOG_ERR("llama_batch_ext_add_text failed.\n"); | |
drafts[s].drafting = false; | |
break; | |
} else if (batch_tgt_data.size() > (size_t) n_draft) { | |
drafts[s].drafting = false; | |
... | |
} |
llama_batch_ext_clear(batch.get()); | ||
llama_seq_id seq_id = 0; | ||
llama_batch_ext_add_text(batch.get(), next_token, 0, &seq_id, 1, true); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verify llama_batch_ext_add_text
return.
A negative result indicates an error (e.g., batch size exceeded).
llama_batch_ext_clear(batch.get()); | ||
llama_seq_id seq_id = 0; | ||
llama_batch_ext_add_text(batch.get(), next_token, 0, &seq_id, 1, true); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, llama_batch_ext_add_text
return is not used.
Silently ignoring errors can cause unexpected behavior.
llama_batch_ext_clear(batch.get()); | ||
llama_seq_id seq_id = 1; // seq 1 instead of 0 | ||
llama_batch_ext_add_text(batch.get(), next_token, 0, &seq_id, 1, true); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding token for seq 1 ignores error code.
Similar risk of silent failure from llama_batch_ext_add_text
.
// Add text tokens to the batch | ||
// Return values: | ||
// -1 : not enough space in the batch | ||
// -2 : embd is already set, cannot add text tokens | ||
// otherwise, returns the output ID | ||
LLAMA_API int32_t llama_batch_ext_add_text( | ||
struct llama_batch_ext * batch, | ||
llama_token token, | ||
llama_pos pos, | ||
const llama_seq_id * seq_ids, | ||
size_t n_seq_ids, | ||
bool output); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same n_seq_ids out-of-bounds risk applies.
Please note that n_seq_ids
can exceed the allocated sequence capacity, leading to potential OOB when writing batch->seq_id[i][j]
.
// Get a "view" from a number of tokens offset | ||
// Return returned batch must be freed with llama_batch_ext_free() | ||
LLAMA_API struct llama_batch_ext * llama_batch_ext_get_view( | ||
struct llama_batch_ext * batch, | ||
int32_t offset, | ||
int32_t n_tokens); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
View creation lacks boundary checks
Similar to llama_batch_ext_get_view
in the CPP file, offset + tokens could go beyond allocated memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
examples/perplexity/perplexity.cpp (1)
933-933
: 💡 Verification agent🧩 Analysis chain
Validate 'eval_results' indexing.
Ensureir
does not exceedeval_results.size()
to avoid out-of-bounds writes.
🏁 Script executed:
#!/bin/bash rg -A 10 "eval_results"Length of output: 11099
Action: Ensure safe indexing for
eval_results
to prevent out-of-bound errors.After reviewing the occurrences of
eval_results
in the file, note that while the vector is resized based oneval_pairs.size()
and the lambda incompute_logprobs
guards against writing past its end, later loops use a sequential index (ir
) to read from it (e.g., in lines likehs_cur.ending_logprob[s] += eval_results[ir++];
). There isn’t an explicit check to confirm thatir
remains within bounds when these accesses occur. Please verify that the total number of reads exactly matches the size ofeval_results
. In addition, consider one or more of the following improvements:
- Add an assertion or boundary check: Before each increment (e.g., before using
eval_results[ir]
), insert a check such asassert(ir < eval_results.size())
to catch any overrun during development.- Revisit the logic that computes the expected number of indices: Ensure that the loops accumulating log probabilities (both for tasks and multi-choice evaluations) are correctly synchronized with the number of logprob values written by
compute_logprobs
.Addressing these points will help prevent possible out‐of‐bounds accesses and strengthen the code’s robustness.
🧹 Nitpick comments (3)
include/llama-cpp.h (2)
40-57
: Consider making tokens parameter constThe implementation correctly handles positions incrementally from pos0, which aligns with the migration pattern mentioned in PR objectives. However, the
tokens
parameter could be declared asconst llama_token*
since the function doesn't modify the token values.- static llama_batch_ext_ptr init_from_text(llama_token * tokens, + static llama_batch_ext_ptr init_from_text(const llama_token * tokens, int32_t n_tokens, llama_pos pos0, llama_seq_id seq_id, bool output_last) {
59-66
: Consider making embd parameter constGood implementation that directly wraps the C function. However, the
embd
parameter could be declared asconst float*
to match the signature of the underlyingllama_batch_ext_init_from_embd
function (as shown in the relevant snippets).- static llama_batch_ext_ptr init_from_embd(float * embd, + static llama_batch_ext_ptr init_from_embd(const float * embd, size_t n_tokens, size_t n_embd, llama_pos pos0, llama_seq_id seq_id) {examples/speculative/speculative.cpp (1)
477-483
: Consider documenting the 'batch_info' struct for clarity.
While the struct is mostly self-explanatory, adding a brief overview of how and why it is used would improve maintainability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
examples/perplexity/perplexity.cpp
(19 hunks)examples/speculative/speculative.cpp
(8 hunks)examples/tts/tts.cpp
(6 hunks)include/llama-cpp.h
(1 hunks)src/llama-context.cpp
(4 hunks)
🔥 Files not summarized due to errors (1)
- examples/tts/tts.cpp: Error: Disallowed special token found: <|im_start|>
🚧 Files skipped from review as they are similar to previous changes (1)
- src/llama-context.cpp
🧰 Additional context used
🧬 Code Definitions (4)
examples/tts/tts.cpp (1)
include/llama.h (7) (7)
llama_batch_ext_init
(924-926)llama_batch_ext_add_text
(955-961)llama_batch_ext_get_n_tokens
(948-948)llama_batch_ext_set_output_last
(977-977)llama_decode_ext
(1013-1015)llama_batch_ext_clear
(987-987)llama_batch_ext_free
(991-991)
examples/speculative/speculative.cpp (3)
include/llama.h (6) (6)
llama_decode_ext
(1013-1015)llama_batch_ext_init
(924-926)llama_batch_ext_clear
(987-987)llama_batch_ext_add_text
(955-961)llama_batch_ext_get_n_tokens
(948-948)llama_batch_ext_free
(991-991)src/llama-context.cpp (2) (2)
llama_decode_ext
(2780-2784)llama_decode_ext
(2780-2782)src/llama-batch.cpp (10) (10)
llama_batch_ext_init
(372-374)llama_batch_ext_init
(372-372)llama_batch_ext_clear
(457-459)llama_batch_ext_clear
(457-457)llama_batch_ext_add_text
(404-427)llama_batch_ext_add_text
(404-410)llama_batch_ext_get_n_tokens
(400-402)llama_batch_ext_get_n_tokens
(400-400)llama_batch_ext_free
(482-498)llama_batch_ext_free
(482-482)
include/llama-cpp.h (2)
src/llama-batch.cpp (8) (8)
llama_batch_ext_init
(372-374)llama_batch_ext_init
(372-372)llama_batch_ext_add_text
(404-427)llama_batch_ext_add_text
(404-410)llama_batch_ext_set_output_last
(448-455)llama_batch_ext_set_output_last
(448-448)llama_batch_ext_init_from_embd
(376-390)llama_batch_ext_init_from_embd
(376-381)include/llama.h (4) (4)
llama_batch_ext_init
(924-926)llama_batch_ext_add_text
(955-961)llama_batch_ext_set_output_last
(977-977)llama_batch_ext_init_from_embd
(934-939)
examples/perplexity/perplexity.cpp (4)
src/llama-batch.h (1) (1)
batch
(99-99)include/llama.h (5) (5)
llama_batch_ext_init
(924-926)llama_batch_ext_clear
(987-987)llama_decode_ext
(1013-1015)llama_get_logits
(1053-1053)llama_batch_ext_set_output_last
(977-977)src/llama-context.cpp (4) (4)
llama_decode_ext
(2780-2784)llama_decode_ext
(2780-2782)llama_get_logits
(2415-2419)llama_get_logits
(2415-2415)src/llama-context.h (3) (3)
i
(49-49)i
(52-52)ctx
(157-161)
⏰ Context from checks skipped due to timeout of 90000ms (20)
- GitHub Check: windows-2019-cmake-cuda (12.4, cuda)
- GitHub Check: windows-2019-cmake-cuda (11.7, cuda)
- GitHub Check: ubuntu-22-cmake-sycl-fp16
- GitHub Check: ubuntu-latest-cmake-cuda
- GitHub Check: macOS-latest-swift (generic/platform=macOS)
- GitHub Check: windows-latest-cmake (kompute-x64, -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML...
- GitHub Check: windows-latest-cmake (vulkan-x64, -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_...
- GitHub Check: windows-latest-cmake (avx-x64, -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_AVX...
- GitHub Check: ubuntu-22-cmake-sycl
- GitHub Check: ubuntu-22-cmake-musa
- GitHub Check: ubuntu-22-cmake-vulkan
- GitHub Check: macOS-latest-cmake-ios
- GitHub Check: windows-latest-cmake-hip
- GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
- GitHub Check: macOS-latest-cmake-tvos
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: server (UNDEFINED, RelWithDebInfo)
- GitHub Check: server-windows
- GitHub Check: server (ADDRESS, RelWithDebInfo)
- GitHub Check: check-requirements
🔇 Additional comments (25)
include/llama-cpp.h (5)
27-29
: LGTM: Deleter for batch_ext follows established patternThe implementation follows the same pattern as other deleter structures in the file, correctly calling
llama_batch_ext_free
for memory cleanup.
36-38
: Clean smart pointer implementationGood implementation of the smart pointer wrapper following the established pattern in the file. The constructors properly initialize the base class.
68-77
: Proper position handling in add_seq methodGood implementation of the sequence addition method. The incremental position handling (
i + pos0
) correctly follows the migration pattern emphasis on proper position handling mentioned in the PR objectives.
79-85
: LGTM: Appropriate handling of vector of sequence IDsThe method correctly uses
seq_id.data()
andseq_id.size()
to pass the vector's data to the underlying C function, and properly handles the output flag.
87-93
: LGTM: Consistent implementation for single sequence ID overloadThis overload maintains consistent behavior with the multi-sequence version and provides a convenient interface for the common case of a single sequence ID.
examples/speculative/speculative.cpp (6)
204-205
: Initialization looks correct.
Creating extended batches with specified capacities for the draft and target contexts appears correct.
489-489
: Clearing the draft batch is appropriate here.
No concerns.
579-579
: ****
Check llama_decode_ext's return value again.
As previously noted, ignoring the decode return code can hide errors. Checking and handling it ensures robust error handling.Also applies to: 584-584, 588-588
600-605
: ****
Check llama_decode_ext's return value for error handling.Also applies to: 608-608
651-652
: Proper cleanup.
Memory is freed for both the draft and target batches, which looks correct.
565-566
:❓ Verification inconclusive
Ensure consistent indexing between drafts and batch_tgt_data.
Because the code dynamically pushes to both data structures, verify that downstream references to these indexes remain valid.Also applies to: 570-570, 572-572
🏁 Script executed:
#!/bin/bash rg "i_batch_tgt|batch_tgt_data" -A 5Length of output: 6928
Review: Confirm Consistent Indexing for Dynamic Data Structures
The code correctly uses
batch_tgt_data.size()
to determine the index that is pushed intodrafts[s].i_batch_tgt
(as seen at lines 565–566) and similar operations at lines 570 and 572. This approach appears to maintain the invariant that each index stored in a draft reliably corresponds to the element's position inbatch_tgt_data
.However, please double-check that any future modifications to how these vectors are updated—such as clearing, copying (notably in the assignment to
drafts[n_seq_cur].i_batch_tgt
), or additional push/pop operations—do not break this consistency. Also, verify that all downstream accesses (e.g., in logging and sampler function calls) correctly account for these indices.examples/perplexity/perplexity.cpp (14)
366-367
: Decode return code is properly handled.
The approach to initialize, clear, add text, and decode with error checking is a good practice.Also applies to: 374-374, 378-378
504-504
: Adaptive batch capacity initialization.
Usingstd::min(n_batch, n_ctx*n_seq)
for batch size is appropriate.
652-656
: Robust error handling in decode_helper.
It's good that you check the return code fromllama_decode_ext
and skipmemcpy
on failure.Also applies to: 659-659
834-834
: Batch initialization for HellaSwag.
No issues found.
849-849
: Clearing the batch.
No issues found.
899-899
: Proper decode helper usage.
You correctly checkdecode_helper
's result.
1115-1115
: Batch initialization for Winogrande.
No issues found.
1134-1134
: Clearing the batch again.
No immediate concerns.
1175-1175
: Proper decode_helper usage.
Return value is checked; no issues.
1338-1338
: Extended batch with 'max_seq'.
No issues found.
1448-1448
: Clearing the batch.
No issues found.
1546-1546
: decode_helper usage for multiple choice tasks.
You check the return value, which is consistent with good error handling.
1733-1733
: Initializes a new batch.
No issues found.
1747-1747
: Batch usage pattern with decode check.
The pattern of clearing, adding text, and checking decode return code is consistent.Also applies to: 1749-1749, 1752-1752
examples/speculative/speculative.cpp
Outdated
llama_batch_ext_clear(batch_dft); | ||
llama_seq_id seq_id = 0; | ||
llama_batch_ext_add_text(batch_dft, token_id, n_past_dft, &seq_id, 1, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Check llama_decode_ext's return value.
The llama_decode_ext
call’s return code is ignored, which can obscure any runtime errors. Handle the return code to detect and address potential decoding failures.
Also applies to: 452-452
examples/speculative/speculative.cpp
Outdated
// eval the prompt with both models | ||
llama_decode(ctx_tgt, llama_batch_get_one( inp.data(), n_input - 1)); | ||
llama_decode(ctx_tgt, llama_batch_get_one(&inp.back(), 1)); | ||
llama_decode(ctx_dft, llama_batch_get_one( inp.data(), n_input)); | ||
auto batch0 = llama_batch_ext_ptr::init_from_text( inp.data(), n_input - 1, 0, 0, true); | ||
auto batch1 = llama_batch_ext_ptr::init_from_text(&inp.back(), 1, n_input - 1, 0, true); | ||
auto batch2 = llama_batch_ext_ptr::init_from_text( inp.data(), n_input , 0, 0, true); | ||
llama_decode_ext(ctx_tgt, batch0.get()); | ||
llama_decode_ext(ctx_tgt, batch1.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Check input sizes to avoid out-of-bounds errors.
In lines 167 and 168, you subtract 1 from n_input
and pass &inp.back()
, which can lead to out-of-range or invalid pointer references if inp.size() < 2
. Make sure n_input >= 2
is enforced or handle the edge cases gracefully.
🏁 Script executed:
#!/bin/bash
rg "n_input" -A 5
Length of output: 18081
Enforce Minimum Input Size Check in Batch Initialization
There’s a potential out-of-bounds error in lines 167–168 of examples/speculative/speculative.cpp
. The use of n_input - 1
and &inp.back()
assumes that inp
contains at least 2 elements. If inp.size() < 2
, these calls may lead to invalid pointer accesses. Please add a guard (or precondition) to ensure that n_input >= 2
before these calls, or handle the edge case gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (16)
examples/save-load-state/save-load-state.cpp (2)
51-55
: Ensure consistent error checking after batch initialization and decoding.
While you callllama_decode_ext
at line 54, the return code is not checked to detect potential decoding failures. You do check return values in subsequent calls (e.g., at lines 85, 137, 218). For consistency, consider verifying the return code here as well.llama_decode_ext(ctx, batch.get()); +nint ret_init_prompt = llama_decode_ext(ctx, batch.get()); +if (ret_init_prompt != 0) { + fprintf(stderr, "\n%s : llama_decode_ext() for the prompt failed, ret = %d\n", __func__, ret_init_prompt); + llama_batch_ext_free(batch.get()); + return 1; +} n_past += llama_batch_ext_get_n_tokens(batch.get());
82-87
: Handle possible failure in the call tollama_decode_ext
.
You've already added a check at line 85, which is good. However, consider logging a more descriptive error message when you detect a decoding failure, to help with troubleshooting.examples/speculative/speculative.cpp (7)
452-452
: Log integer return value fromllama_decode_ext
.
You are correctly checking if the function returned a non-zero result, but consider logging the return code or corresponding error text for better diagnostics.
477-481
: Encapsulatebatch_info
to enhance maintainability.
Adding a dedicated struct is a good approach for grouping token, position, and sequence IDs. You might also consider adding helper methods (e.g., a constructor orvalidate()
) to ensure each field is populated correctly.
483-483
: Initializebatch_tgt_data
capacity.
If the approximate size ofbatch_tgt_data
can be estimated, consider reserving capacity to avoid repeated allocations when pushing back new elements.
578-578
: Handle empty draft batch scenario.
At line 578, ifllama_batch_ext_get_n_tokens(batch_dft.get())
returns 0, you break out of the loop. Confirm that you're not missing partial tokens or incorrectly concluding emptiness.
588-588
: Stop condition correctness.
You break ifbatch_tgt_data.size()
exceedsn_draft
. Ensure the partial iteration doesn't skip essential tokens if you only exceed the limit by 1.
601-602
: Leverage standard iteration idioms for clarity.
Currently, you use a for loop from 0 tobatch_tgt_data.size() - 1
. A range-basedfor (auto &data : batch_tgt_data)
might improve readability.
608-608
: Log the return status ofllama_decode_ext
for the target batch.
At line 608, you're callingllama_decode_ext
without checking if the return was non-zero. Logging or handling errors would make troubleshooting easier.examples/passkey/passkey.cpp (7)
126-126
: Use separate function or wrapper for batch creation.
When you dollama_batch_ext_init(params.n_batch, 1)
, consider wrapping this in a function (e.g.,init_batch_for_passkey()
) to streamline repeated usage and ensure consistent parameters.
151-151
: Validate usage ofllama_batch_ext_set_output_last
.
You set the output of the last token. Confirm that it matches your intended scenario — i.e., this token is indeed the final token in the chunk you want to generate output for. Otherwise, decoding might produce unexpected results.
154-154
: Log decode failure with more context.
Whenllama_decode_ext(ctx, batch.get())
returns non-zero, log the return code or specify which chunk of tokens triggered the error for easier debugging.
178-181
: Consider combining repeated batch operations.
You again clear (line 178) and add tokens (line 181). If you frequently do this pattern, factor it into a helper function that receives tokens and positions, then sets the output flag if needed.
185-185
: Verify correctness of final token output flag.
Similar to line 151, setting the output of the last token might be unnecessary if the block only contains partial tokens. Verify you truly want to mark this token for output.
188-188
: Propagate the error message.
You log the error but do not print the numeric return code. Consider including it for debugging.
251-251
: Add error code check for the final decode.
Currently, you log a generic error but do not fetch the negative code fromllama_decode_ext
. Logging the code helps with debugging.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
common/speculative.cpp
(4 hunks)examples/gritlm/gritlm.cpp
(5 hunks)examples/llava/gemma3-cli.cpp
(6 hunks)examples/llava/qwen2vl-cli.cpp
(2 hunks)examples/lookup/lookup.cpp
(5 hunks)examples/parallel/parallel.cpp
(7 hunks)examples/passkey/passkey.cpp
(6 hunks)examples/save-load-state/save-load-state.cpp
(5 hunks)examples/server/server.cpp
(21 hunks)examples/simple-chat/simple-chat.cpp
(3 hunks)examples/simple/simple.cpp
(3 hunks)examples/speculative-simple/speculative-simple.cpp
(5 hunks)examples/speculative/speculative.cpp
(8 hunks)examples/tts/tts.cpp
(5 hunks)include/llama-cpp.h
(2 hunks)
🔥 Files not summarized due to errors (2)
- examples/llava/qwen2vl-cli.cpp: Error: Disallowed special token found: <|im_start|>
- examples/tts/tts.cpp: Error: Disallowed special token found: <|im_start|>
🚧 Files skipped from review as they are similar to previous changes (3)
- examples/simple/simple.cpp
- examples/lookup/lookup.cpp
- examples/speculative-simple/speculative-simple.cpp
🧰 Additional context used
🧬 Code Definitions (10)
examples/tts/tts.cpp (1)
include/llama.h (5) (5)
llama_batch_ext_init
(924-926)llama_batch_ext_get_n_tokens
(948-948)llama_batch_ext_set_output_last
(977-977)llama_decode_ext
(1013-1015)llama_batch_ext_clear
(987-987)
common/speculative.cpp (2)
include/llama-cpp.h (4) (4)
llama_batch_ext_ptr
(38-38)llama_batch_ext_ptr
(39-39)batch
(29-29)batch
(29-29)include/llama.h (4) (4)
llama_batch_ext_init
(924-926)llama_batch_ext_clear
(987-987)llama_batch_ext_get_n_tokens
(948-948)llama_decode_ext
(1013-1015)
examples/gritlm/gritlm.cpp (2)
include/llama-cpp.h (6) (6)
batch
(29-29)batch
(29-29)token
(70-76)token
(70-70)token
(79-85)token
(79-79)include/llama.h (4) (4)
llama_batch_ext_init
(924-926)llama_batch_ext_clear
(987-987)llama_decode_ext
(1013-1015)llama_batch_ext_get_n_tokens
(948-948)
include/llama-cpp.h (2)
src/llama-batch.cpp (8) (8)
llama_batch_ext_init
(372-374)llama_batch_ext_init
(372-372)llama_batch_ext_add_text
(404-427)llama_batch_ext_add_text
(404-410)llama_batch_ext_set_output_last
(448-455)llama_batch_ext_set_output_last
(448-448)llama_batch_ext_init_from_embd
(376-390)llama_batch_ext_init_from_embd
(376-381)include/llama.h (4) (4)
llama_batch_ext_init
(924-926)llama_batch_ext_add_text
(955-961)llama_batch_ext_set_output_last
(977-977)llama_batch_ext_init_from_embd
(934-939)
examples/speculative/speculative.cpp (3)
include/llama.h (4) (4)
llama_decode_ext
(1013-1015)llama_batch_ext_init
(924-926)llama_batch_ext_clear
(987-987)llama_batch_ext_get_n_tokens
(948-948)src/llama-context.cpp (2) (2)
llama_decode_ext
(2780-2784)llama_decode_ext
(2780-2782)src/llama-batch.cpp (6) (6)
llama_batch_ext_init
(372-374)llama_batch_ext_init
(372-372)llama_batch_ext_clear
(457-459)llama_batch_ext_clear
(457-457)llama_batch_ext_get_n_tokens
(400-402)llama_batch_ext_get_n_tokens
(400-400)
examples/server/server.cpp (6)
include/llama.h (7) (7)
llama_batch_ext_init
(924-926)llama_pooling_type
(481-481)llama_get_embeddings_seq
(1080-1080)llama_batch_ext_clear
(987-987)llama_batch_ext_get_n_tokens
(948-948)llama_batch_ext_set_output_last
(977-977)llama_batch_ext_get_view
(981-984)src/llama-batch.cpp (10) (10)
llama_batch_ext_init
(372-374)llama_batch_ext_init
(372-372)llama_batch_ext_clear
(457-459)llama_batch_ext_clear
(457-457)llama_batch_ext_get_n_tokens
(400-402)llama_batch_ext_get_n_tokens
(400-400)llama_batch_ext_set_output_last
(448-455)llama_batch_ext_set_output_last
(448-448)llama_batch_ext_get_view
(461-480)llama_batch_ext_get_view
(461-464)src/llama-batch.h (1) (1)
batch
(99-99)src/llama-context.cpp (4) (4)
llama_pooling_type
(2368-2370)llama_pooling_type
(2368-2368)llama_get_embeddings_seq
(2439-2443)llama_get_embeddings_seq
(2439-2439)src/llama-context.h (9) (9)
seq_id
(53-53)seq_id
(100-100)seq_id
(101-101)seq_id
(102-102)seq_id
(115-120)seq_id
(122-126)ctx
(157-161)i
(49-49)i
(52-52)common/common.h (1) (1)
common_embd_normalize
(636-636)
examples/save-load-state/save-load-state.cpp (3)
include/llama-cpp.h (4) (4)
batch
(29-29)batch
(29-29)tokens
(45-58)tokens
(45-49)include/llama.h (4) (4)
llama_decode_ext
(1013-1015)llama_batch_ext_get_n_tokens
(948-948)llama_batch_ext_clear
(987-987)llama_batch_ext_free
(991-991)src/llama-batch.cpp (6) (6)
llama_batch_ext_get_n_tokens
(400-402)llama_batch_ext_get_n_tokens
(400-400)llama_batch_ext_clear
(457-459)llama_batch_ext_clear
(457-457)llama_batch_ext_free
(482-498)llama_batch_ext_free
(482-482)
examples/passkey/passkey.cpp (3)
include/llama-cpp.h (2) (2)
batch
(29-29)batch
(29-29)include/llama.h (6) (6)
llama_batch_ext_init
(924-926)llama_batch_ext_clear
(987-987)llama_batch_ext_set_output_last
(977-977)llama_decode_ext
(1013-1015)llama_batch_ext_get_n_tokens
(948-948)llama_batch_ext_add_text
(955-961)src/llama-context.cpp (2) (2)
llama_decode_ext
(2780-2784)llama_decode_ext
(2780-2782)
examples/simple-chat/simple-chat.cpp (6)
include/llama-cpp.h (2) (2)
batch
(29-29)batch
(29-29)src/llama-batch.h (1) (1)
batch
(99-99)src/llama-kv-cache.h (1) (1)
batch
(111-111)src/llama-batch.cpp (6) (6)
llama_batch_ext_get_n_tokens
(400-402)llama_batch_ext_get_n_tokens
(400-400)llama_batch_ext_clear
(457-459)llama_batch_ext_clear
(457-457)llama_batch_ext_free
(482-498)llama_batch_ext_free
(482-482)src/llama-context.cpp (4) (4)
n_ctx
(412-414)n_ctx
(412-412)llama_n_ctx
(2340-2342)llama_n_ctx
(2340-2340)src/llama-context.h (1) (1)
ctx
(157-161)
examples/llava/gemma3-cli.cpp (3)
include/llama.h (5) (5)
llama_batch_ext_init
(924-926)llama_batch_ext_clear
(987-987)llama_batch_ext_set_output_last
(977-977)llama_decode_ext
(1013-1015)llama_batch_ext_init_from_embd
(934-939)src/llama-batch.cpp (8) (8)
llama_batch_ext_init
(372-374)llama_batch_ext_init
(372-372)llama_batch_ext_clear
(457-459)llama_batch_ext_clear
(457-457)llama_batch_ext_set_output_last
(448-455)llama_batch_ext_set_output_last
(448-448)llama_batch_ext_init_from_embd
(376-390)llama_batch_ext_init_from_embd
(376-381)src/llama-context.cpp (2) (2)
llama_decode_ext
(2780-2784)llama_decode_ext
(2780-2782)
⏰ Context from checks skipped due to timeout of 90000ms (20)
- GitHub Check: macOS-latest-swift (generic/platform=macOS)
- GitHub Check: ubuntu-latest-cmake-cuda
- GitHub Check: android-build
- GitHub Check: macOS-latest-swift (generic/platform=iOS)
- GitHub Check: windows-latest-cmake-hip
- GitHub Check: ubuntu-22-cmake-sycl-fp16
- GitHub Check: macOS-latest-cmake-tvos
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: server (Release)
- GitHub Check: server-windows
- GitHub Check: server (ADDRESS, RelWithDebInfo)
- GitHub Check: server (UNDEFINED, RelWithDebInfo)
🔇 Additional comments (79)
examples/gritlm/gritlm.cpp (7)
4-4
: Added necessary header for batch extension support.The inclusion of
llama-cpp.h
provides access to thellama_batch_ext_ptr
smart pointer implementation required for the updated batch processing mechanism.
17-17
: Updated to use smart pointer for better memory management.The change from raw
llama_batch*
tollama_batch_ext_ptr
enhances memory safety by using RAII principles. The smart pointer will automatically handle resource cleanup.
20-20
: Updated batch clearing to use extended API.The change from
common_batch_clear(batch)
tollama_batch_ext_clear(batch.get())
correctly adapts to the new API, properly accessing the underlying pointer with.get()
.
45-45
: Updated token addition to use the new batch API.The migration from
common_batch_add
tobatch.add_text
correctly implements the new API pattern. The conditional expressionj >= n_inst
appropriately determines when tokens should be considered for embedding output.
54-54
: Updated decoding to use extended API.Changing from
llama_decode
tollama_decode_ext
is part of the batch API migration. Note that error checking is omitted as mentioned in previous review discussions.
108-108
: Updated generation batch initialization.The batch initialization in the
generate
function follows the same smart pointer pattern as in theencode
function.
114-114
: Updated batch handling in generation function.The changes consistently apply the batch API migration pattern:
- Using
llama_batch_ext_clear
for batch clearing- Using
batch.add_text
for token addition with position tracking- Using
llama_decode_ext
for decoding- Using
llama_batch_ext_get_n_tokens
to retrieve token countThis ensures proper sequence handling throughout generation.
Also applies to: 119-119, 124-124, 126-126
common/speculative.cpp (8)
17-17
: Updated batch member to use smart pointer.Changed from raw pointer
struct llama_batch * batch
to smart pointerllama_batch_ext_ptr batch
. This improves memory safety by using RAII principles for automatic resource management.
26-26
: Updated batch initialization in constructor.Correctly initializes the smart pointer with
llama_batch_ext_init
, passing appropriate parameters for token capacity and sequence count.
65-73
: Removed manual batch freeing.The function no longer needs to call
llama_batch_free(spec->batch)
because the smart pointer will automatically handle memory cleanup when it goes out of scope or when the object is deleted.
207-212
: Updated first batch operation sequence.The batch clearing and token addition operations have been properly updated to use the extended API pattern:
llama_batch_ext_clear(batch.get())
for clearingbatch.add_text
for adding tokens with appropriate position trackingThis ensures correct token processing in the prompt evaluation phase.
217-221
: Updated token count check and decoding.The changes properly:
- Use
llama_batch_ext_get_n_tokens(batch.get())
instead of direct access tobatch.n_tokens
- Use
llama_decode_ext
with the correct pointer access via.get()
This maintains proper batch handling while transitioning to the new API.
227-229
: Updated batch handling for token processing.The batch clearing and token addition operations following the last accepted token have been properly updated to use the extended API pattern.
234-234
: Updated decode call for the last accepted token.Correctly uses
llama_decode_ext
with proper pointer access via.get()
.
240-271
: Updated batch operations in the token sampling loop.The changes in the draft token generation loop correctly implement the batch API migration pattern:
- Using
llama_batch_ext_clear
for batch clearing between iterations- Using
batch.add_text
for adding new tokens with proper position tracking- Using
llama_decode_ext
for decoding each new tokenThis ensures proper token generation and evaluation in the speculative process.
examples/server/server.cpp (15)
1227-1227
: Updated batch_spec type to use smart pointerThe type has been changed from
llama_batch batch_spec
tollama_batch_ext_ptr batch_spec
, which is likely a unique_ptr wrapper. This is a good change as it provides automatic memory management.
1799-1799
: Updated batch type to use smart pointerSimilar to the
batch_spec
change,batch
is now allama_batch_ext_ptr
instead of allama_batch
. This improvement helps with automatic resource management.
1925-1926
: Changed batch initialization to use the new APIThe initialization now properly uses
reset()
withllama_batch_ext_init()
instead of the previous default initialization. This is consistent with the migration pattern.
1961-1962
: Updated batch initialization with proper sizingThe initialization of the main batch now uses
reset()
with an appropriate size calculation and explicitly sets the sequence limit to 1, which is clearer than the previous implementation.
2096-2097
: Consistent use of batch_spec initializationThis follows the same pattern seen earlier, using
reset()
with proper parameters to initialize the batch for speculative decoding.
2404-2431
: Refactored embedding retrieval to use sequence-based approachThe
send_embedding
method has been significantly refactored to:
- No longer require a batch parameter
- Use sequence IDs for embedding retrieval
- Properly handle pooling type checks
- Handle normalization consistently
Note: There's a temporary limitation with POOLING_TYPE_NONE which is now explicitly handled.
2454-2474
: Refactored rerank to use sequence-based approachSimilar to the embedding changes,
send_rerank
now:
- Takes no batch parameter
- Uses sequence IDs to retrieve embeddings
- Has improved error handling
This is consistent with the new API pattern.
2853-2854
: Updated batch clearing to use new APINow using
llama_batch_ext_clear()
instead of the previous method. This is consistent with the migration to the new batch API.
2877-2878
: Changed token addition method to use new APINow using
batch.add_text()
instead of directly manipulating batch members. This provides a cleaner interface and better encapsulation.
3084-3085
: Updated token addition in prompt processingThe code now uses
batch.add_text()
for adding prompt tokens to the batch, which is consistent with the new API and helps ensure proper memory management.
3111-3112
: Updated output selection with new APINow using
llama_batch_ext_set_output_last()
to mark the last token as the output token, which is cleaner than the previous approach.
3141-3147
: Updated batch processing loop with view-based approachThe batch processing now:
- Uses
llama_batch_ext_get_n_tokens()
to get the token count- Creates a view with
llama_batch_ext_get_view()
for batch processing- Uses
llama_decode_ext()
instead of the previous functionThis approach is more efficient as it processes chunks of the batch without copying data.
3281-3286
: Updated speculative batch processingThe speculative batch processing now:
- Clears the batch with
llama_batch_ext_clear()
- Adds tokens with
batch_spec.add_text()
- Maintains proper position management
The code is more concise and readable than before.
4147-4150
: Added important validation for pooling typeThis new check aborts with an informative error message when the pooling type is NONE, which is not yet supported. This prevents silent failures and communicates limitations clearly.
3094-3095
: Improved logging clarity with new token count methodThe debug logging now uses
llama_batch_ext_get_n_tokens()
to report progress, which is more accurate than the previous approach.examples/simple-chat/simple-chat.cpp (6)
2-2
: Include of "llama-cpp.h" is appropriate.
No issues here.
112-114
: Initialization and tracking ofn_past
.
The usage ofn_past
to store the starting position and then incrementing byllama_batch_ext_get_n_tokens(batch.get())
helps maintain alignment. Ensure that subsequent generate loops consistently trackn_past
, especially if multiple tokens might be appended in one loop iteration.
121-121
: Consider a boundary check for context size.
You might want to verify whether using>= n_ctx
instead of> n_ctx
is more appropriate to avoid off-by-one overflows at the exact boundary.Would you like to confirm that the code works as intended when
n_ctx_used + llama_batch_ext_get_n_tokens(batch.get())
is exactlyn_ctx
?
127-127
: Decode error handling seems sufficient.
Decoding failure is caught, and the program aborts as a fallback. The logic is concise.
151-153
: Clearing the batch before adding the new token.
This pattern is consistent with single-token generation steps, then incrementingn_past
. The approach looks good for iterative token decoding.
156-156
: Explicitly freeing the extended batch.
Releasing resources here is appropriate, ensuring no memory leak.include/llama-cpp.h (4)
8-8
: Including<vector>
is valid.
This is a minimal addition needed for sequence-handling methods.
28-30
: Custom deleter forllama_batch_ext
.
Wrappingllama_batch_ext_free
in a functor is a neat approach to automatic resource management.
37-40
: Introduction ofllama_batch_ext_ptr
.
Defining a specialized smart pointer for extended batches is a clean design. It clarifies ownership and cleanup responsibilities.
60-67
: Creating a batch from embeddings.
Similarly, consider validating thatllama_batch_ext_init_from_embd(...)
returns a non-null pointer. If it fails (e.g., if memory is insufficient), the function currently does not handle the error.Would you like to confirm that an internal check assures non-null returns from
llama_batch_ext_init_from_embd
?examples/parallel/parallel.cpp (16)
9-9
: New include of "llama-cpp.h".
No issues with adding this header for extended batch usage.
178-178
: Initializing the extended batch withn_ctx
.
This creates a single sequence slot of sizen_ctx
. Ensure thatn_ctx >= 1
is enforced elsewhere to avoid out-of-bounds risk.
196-196
: Adding system tokens to the batch.
This alignment sets each system token's position toi
. Looks correct as a simpler approach to building the system prompt.
199-199
: Decoding the system prompt in one step.
Error handling is consistent: logs an error on failure, then returns.
220-220
: Clearing the batch.
Resetting the batch for the next wave of tokens is consistent with the new approach.
228-228
: Capturing batch offset inclient.i_batch
.
This is used later to know how far into the batch the client’s tokens reside. Straightforward.
230-231
: Adding sampled token with updated position and distinct sequence ID.
Adding one token at a time is valid. Thetrue
flag for output can ensure you only get logits for that token.
236-236
: Checking for an empty batch.
If no tokens remain, the code resets the KV cache for all clients. The logic is consistent with reusing the system prompt.
248-248
: Conditional insertion of new sequences.
Re-initializing requests if no tokens exist or ifcont_batching
is specified. Implementation is sound.
267-268
: Generating prompt tokens for a new client.
seq_id
is derived fromclient.id + 1
; tokens are offset byn_tokens_system
. Good logic for concurrency.
272-273
: Marking the last token as output.
Ensures logits are extracted for the final token. Correct usage ofllama_batch_ext_set_output_last
.
279-279
: Tracking the offset of the just-added token.
client.i_batch = <batch size> - 1
points to the newly inserted token's position. Straightforward indexing.
299-299
: Retrieving total tokens in batch.
Storing it inn_tokens_in_batch
is a standard approach for chunked decoding.
300-300
: Looping over tokens in steps ofn_batch
.
This chunk-based iteration is consistent with partial decoding to avoid memory or time overhead.
308-308
: Calculating effective tokens to process.
std::min(n_batch, n_tokens_in_batch - i)
ensures we don't overshoot.
310-312
: Creating a batch view for partial decoding and freeing it afterward.
Splitting the batch into sub-ranges, decoding, then freeing the temporary view is an appropriate pattern. The check onret
is also consistent.examples/save-load-state/save-load-state.cpp (3)
134-139
: Double-check sequence handling when re-clearing and re-decoding.
When you clear and re-usebatch
at lines 134–135, ensure that the changed sequence or context is correctly accounted for during subsequent decodes (line 137). If the context depends on progressive accumulation of tokens, clearing too prematurely might reset needed state tokens. Confirm that this is intended.
215-220
: Evaluate positional consistency for multi-sequence decoding.
At lines 215–216, you switch the sequence ID from 0 to 1 when adding the token. Ensure that the new sequence ID is reflected appropriately in yourn_past
logic and subsequent decode calls to avoid mismatched positions between the two sequences.
232-232
: Avoid double-free risk.
You invokellama_batch_ext_free(batch.get())
at line 232. Verify that this pointer is not freed or reset in any other block (like lines 87, 139, or 220). Multiple frees of the same pointer would cause undefined behavior.examples/speculative/speculative.cpp (9)
6-6
: Confirm correct include order.
Placing"llama-cpp.h"
after"llama.h"
is fine, but ensure there are no forward dependency issues or macros needed fromllama.h
beforellama-cpp.h
.
169-174
: Check for off-by-one errors in prompt split.
You split the prompt tokens intobatch0
(n_input - 1
) andbatch1
(&inp.back()
). Ifinp.size()
is 1,n_input - 1
becomes 0, which might risk an empty initialization. Confirm there's always at least 2 tokens.
205-206
: Favor descriptive naming for batch pointers.
You name thembatch_dft
andbatch_tgt
for the draft and target contexts, which is clear. Make sure that each is used only in the correct context to avoid confusion in multi-model usage.
447-448
: Validate token positions when clearing and re-adding.
Clearingbatch_dft
at line 447 resets its existing tokens. Ensure that re-adding a token at line 448 withpos = n_past_dft
does not conflict with prior tokens if re-adding them was intended.
485-485
: Double-check index usage for the first token.
You push the first token fromdrafts[0].tokens[0]
at line 485. Ensuredrafts[0].tokens
is not empty. If there's a chance the array could be empty, add a safety check to avoid out-of-bounds errors.
489-489
: Clarify the reason for clearingbatch_dft
.
Before you loop overn_draft
tokens, you clearbatch_dft
again. Ensure you don't lose essential context tokens. If the intent is a fresh start for each iteration, this usage is valid.
566-567
: Validate target batch data indexing.
At line 566, you push back a new index intoi_batch_tgt
; at line 567, you push a new struct intobatch_tgt_data
. Confirm that only one new element is appended, and that the index is correct for the newly appended item.
584-584
: Validate decoding result in the next loop iteration.
After callingllama_decode_ext
at line 584, the next iteration of the loop incrementsn_past_cur
. If decoding fails or partially processes tokens, confirm you handle it appropriately in subsequent logic.
600-600
: Clarify rationale for clearingbatch_tgt
.
Similar tobatch_dft
, you clearbatch_tgt
at line 600. Double-check that any partial tokens aren’t lost before you loop throughbatch_tgt_data
.examples/passkey/passkey.cpp (3)
5-5
: Include guard check.
Adding"llama-cpp.h"
after"llama.h"
is fine, but ensure"llama-cpp.h"
does not require macros or declarations only found inllama.h
. If so, reorder or update includes.
144-147
: Confirm the loop variable updates for chunked token adds.
Here, you're clearing the batch (line 144) and then re-adding tokens (line 147). Ensure thatn_past++
remains accurate across iterations, especially if partial tokens remain in the batch from prior steps.
227-227
: Prevent negative index usage.
You passllama_batch_ext_get_n_tokens(batch.get()) - 1
tollama_sampler_sample
. If no tokens exist inbatch
, this becomes -1, which can cause issues. Ensure there's at least one token.examples/llava/gemma3-cli.cpp (8)
8-8
: New dependency added for extended batch APIThis include is necessary for accessing the extended batch implementation and smart pointer definitions.
67-67
: Migration to extended batch API with smart pointerThe type has been updated from
llama_batch
tollama_batch_ext_ptr
, which is an appropriate change for the migration pattern. The use of a smart pointer will help with automatic memory management.
77-77
: Proper initialization of extended batch smart pointerThe initialization now uses
reset()
withllama_batch_ext_init()
, correctly setting up the extended batch with the specified batch size and 1 sequence.
93-96
: Consistent batch clearing and token addition using extended APIThe code has been properly migrated from using
common_batch_clear
andcommon_batch_add
to their extended API equivalents. The sequence ID is consistently maintained as 0.
98-99
: Output logits flag correctly setThe migration from direct logits access to using
llama_batch_ext_set_output_last
is appropriate for the extended batch API.
101-101
: Updated decode function for extended batch APIThe function call has been correctly updated from
llama_decode
tollama_decode_ext
, maintaining the same error checking pattern.
150-151
: Image batch initialization migrated to extended APIThe image batch creation has been properly updated to use
llama_batch_ext_init_from_embd
. The position handling withctx.n_past
is maintained correctly, and the sequence ID is consistently set to 0.
181-183
: Response generation updated to use extended batch APIThe token generation logic has been correctly migrated to use the extended batch API, maintaining the same position handling with
ctx.n_past++
and sequence ID of 0.
include/llama-cpp.h
Outdated
// Wrapper to add a single token to the batch (single sequence ID) | ||
int32_t add_text(llama_token token, llama_pos pos, llama_seq_id seq_id, bool output_last) { | ||
int32_t output_id = llama_batch_ext_add_text(this->get(), token, pos, &seq_id, 1, false); | ||
if (output_last) { | ||
llama_batch_ext_set_output_last(this->get()); | ||
} | ||
return output_id; | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Single sequence ID addition.
The same concern about ignoring negative return values applies.
for (int i = 0; i < (int) batch_tgt_data.size(); ++i) { | ||
const auto & data = batch_tgt_data[i]; | ||
|
||
batch_tgt.add_text(data.id, data.pos, data.seq_id, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Check each add_text
call for success.
Similar to prior calls, batch_tgt.add_text
can fail. Consider capturing and verifying its return value.
drafts[s].i_batch_dft = batch_dft.n_tokens; | ||
|
||
common_batch_add(batch_dft, id, n_past_cur, { s }, true); | ||
drafts[s].i_batch_dft = batch_dft.add_text(id, n_past_cur, s, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Check return value from batch_dft.add_text
.
batch_dft.add_text
may return an error code (negative values). Ignoring this could result in silent failures if the token cannot be added.
drafts[s].i_batch_dft = batch_dft.add_text(id, n_past_cur, s, true);
+if (drafts[s].i_batch_dft < 0) {
+ LOG_ERR("%s : batch_dft.add_text failed for seq %d\n", __func__, s);
+ drafts[s].drafting = false;
+ break;
+}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
drafts[s].i_batch_dft = batch_dft.add_text(id, n_past_cur, s, true); | |
drafts[s].i_batch_dft = batch_dft.add_text(id, n_past_cur, s, true); | |
if (drafts[s].i_batch_dft < 0) { | |
LOG_ERR("%s : batch_dft.add_text failed for seq %d\n", __func__, s); | |
drafts[s].drafting = false; | |
break; | |
} |
examples/passkey/passkey.cpp
Outdated
llama_batch_ext_clear(batch.get()); | ||
|
||
// push this new token for next evaluation | ||
common_batch_add(batch, new_token_id, n_past++, { 0 }, true); | ||
llama_seq_id seq_id = 0; | ||
llama_batch_ext_add_text(batch.get(), new_token_id, n_past++, &seq_id, 1, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Align seq_id
usage with multi-sequence logic.
At lines 244–245, you create a single seq_id = 0
and add a token with output = true
. In multi-sequence scenarios, you might need multiple seq_id
s. Confirm this usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🔭 Outside diff range comments (1)
examples/perplexity/perplexity.cpp (1)
1734-1765
: 🛠️ Refactor suggestionEvaluate reusability of the batch object across multiple loops.
Re-initializing a newllama_batch_ext_ptr batch
in each loop can be expensive. Review if you can reuse a single batch, callingbatch.clear()
each time, and only re-init if capacity needs to change.
♻️ Duplicate comments (6)
examples/run/run.cpp (1)
998-999
:⚠️ Potential issueHandle the return value of
check_context_size
.Currently, its return value is ignored, which could let decoding proceed even when the context limit is exceeded.
Apply this diff to check the context size properly:
- check_context_size(llama_data.context, batch); - if (llama_decode_ext(llama_data.context.get(), batch.get())) { + if (check_context_size(llama_data.context, batch) != 0) { + return 1; // or handle the context overflow error + } + if (llama_decode_ext(llama_data.context.get(), batch.get())) {examples/speculative/speculative.cpp (5)
169-174
:⚠️ Potential issueEnforce minimum input size check in batch initialization.
There's a potential out-of-bounds error in these lines. The use of
n_input - 1
and&inp.back()
assumes thatinp
contains at least 2 elements.If
inp.size() < 2
, these calls may lead to invalid pointer accesses. Add a guard to ensure thatn_input >= 2
before these calls, or handle the edge case gracefully.
570-570
:⚠️ Potential issueCheck return value from
batch_dft.add_text
.
batch_dft.add_text
may return an error code (negative values). Ignoring this could result in silent failures if the token cannot be added.drafts[s].i_batch_dft = batch_dft.add_text(id, n_past_cur, s, true); +if (drafts[s].i_batch_dft < 0) { + LOG_ERR("%s : batch_dft.add_text failed for seq %d\n", __func__, s); + drafts[s].drafting = false; + break; +}
584-585
:⚠️ Potential issueCheck the return code from
llama_decode_ext
.The return value of
llama_decode_ext
is ignored, which can lead to silent failures if decoding fails.Add error handling for the decode operation:
-llama_decode_ext(ctx_dft, batch_dft.get()); +int ret_decode = llama_decode_ext(ctx_dft, batch_dft.get()); +if (ret_decode != 0) { + LOG_ERR("%s: llama_decode_ext failed with error code %d\n", __func__, ret_decode); + return 1; +}
604-604
:⚠️ Potential issueCheck each
add_text
call for success.Similar to prior calls,
batch_tgt.add_text
can fail. Consider capturing and verifying its return value.-batch_tgt.add_text(data.id, data.pos, data.seq_id, true); +int ret = batch_tgt.add_text(data.id, data.pos, data.seq_id, true); +if (ret < 0) { + LOG_ERR("%s: batch_tgt.add_text failed with error code %d\n", __func__, ret); + // Handle error appropriately +}
608-608
:⚠️ Potential issueCheck the return code from
llama_decode_ext
.The target-side decoding result is unchecked, which could hide potential errors.
-llama_decode_ext(ctx_tgt, batch_tgt.get()); +int ret_decode = llama_decode_ext(ctx_tgt, batch_tgt.get()); +if (ret_decode != 0) { + LOG_ERR("%s: target llama_decode_ext failed with error code %d\n", __func__, ret_decode); + return 1; +}
🧹 Nitpick comments (5)
examples/perplexity/perplexity.cpp (3)
366-379
: Promote consistency checks forbatch.add_text
return values.
Althoughllama_decode_ext
error handling is present,batch.add_text
can also fail (e.g., exceeding capacity). Consider checking its return status for additional resilience.You can wrap
batch.add_text(...)
in a small helper that logs or handles negative return codes, like:+ int err = batch.add_text(tokens[batch_start + i], j*n_batch + i, 0, true); + if (err < 0) { + LOG_ERR("batch.add_text failed with code %d\n", err); + return ...; + }
652-660
: Provide more context in decode error logging.
Currently, only the failure coderet
is logged. Add relevant context, e.g., the function name or batch details, to make troubleshooting easier.- LOG_ERR("failed to decode the batch, ret = %d\n", ret); + LOG_ERR("decode_helper: llama_decode_ext failed, ret = %d; n_outputs=%zu, n_vocab=%d\n", + ret, n_outputs, n_vocab);
833-900
: Use consistent batch naming to avoid confusion.
You create a newbatch
for image-like data but still call it the same name as in text-based usage. For clarity, consider naming itbatch_img
or similar, especially if you’re repeatedly switching between text-based and other-based batch usage.examples/save-load-state/save-load-state.cpp (2)
82-87
: Safeguard batch usage after decode failures.
Ifllama_decode_ext
fails, the code immediately frees the batch. Consider settingbatch
tonullptr
or re-initializing it after this block to avoid potential misuse in subsequent steps.
232-232
: Freeing the batch.
You manually freebatch
but keep usingbatch.get()
earlier. Double-check you’re not calling it again. Consider RAII usage or destruct the pointer to avoid confusion.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
common/speculative.cpp
(4 hunks)examples/gritlm/gritlm.cpp
(5 hunks)examples/llava/gemma3-cli.cpp
(6 hunks)examples/lookup/lookup.cpp
(5 hunks)examples/parallel/parallel.cpp
(7 hunks)examples/passkey/passkey.cpp
(6 hunks)examples/perplexity/perplexity.cpp
(19 hunks)examples/run/run.cpp
(4 hunks)examples/save-load-state/save-load-state.cpp
(5 hunks)examples/server/server.cpp
(21 hunks)examples/simple-chat/simple-chat.cpp
(3 hunks)examples/simple/simple.cpp
(3 hunks)examples/speculative-simple/speculative-simple.cpp
(5 hunks)examples/speculative/speculative.cpp
(8 hunks)examples/tts/tts.cpp
(5 hunks)include/llama-cpp.h
(2 hunks)
🔥 Files not summarized due to errors (1)
- examples/tts/tts.cpp: Error: Disallowed special token found: <|im_start|>
🚧 Files skipped from review as they are similar to previous changes (4)
- examples/simple/simple.cpp
- examples/lookup/lookup.cpp
- examples/passkey/passkey.cpp
- include/llama-cpp.h
🧰 Additional context used
🧠 Learnings (1)
examples/speculative-simple/speculative-simple.cpp (1)
Learnt from: ngxson
PR: ngxson/llama.cpp#14
File: examples/speculative-simple/speculative-simple.cpp:172-172
Timestamp: 2025-03-21T10:48:55.882Z
Learning: For the speculative-simple example in llama.cpp, ngxson prefers deferring error handling of decode functions to maintain code simplicity in the initial implementation.
🧬 Code Definitions (11)
common/speculative.cpp (5)
include/llama-cpp.h (4) (4)
llama_batch_ext_ptr
(38-38)llama_batch_ext_ptr
(39-39)batch
(29-29)batch
(29-29)include/llama.h (2) (2)
llama_batch_ext_init
(924-926)llama_decode_ext
(1013-1015)src/llama-batch.cpp (2) (2)
llama_batch_ext_init
(372-374)llama_batch_ext_init
(372-372)src/llama-context.cpp (2) (2)
llama_decode_ext
(2780-2784)llama_decode_ext
(2780-2782)src/llama-batch.h (1) (1)
batch
(99-99)
examples/simple-chat/simple-chat.cpp (4)
include/llama-cpp.h (2) (2)
batch
(29-29)batch
(29-29)src/llama-context.cpp (4) (4)
llama_kv_self_used_cells
(2522-2524)llama_kv_self_used_cells
(2522-2522)llama_decode_ext
(2780-2784)llama_decode_ext
(2780-2782)src/llama-context.h (1) (1)
ctx
(157-161)src/llama-batch.cpp (2) (2)
llama_batch_ext_free
(482-498)llama_batch_ext_free
(482-482)
examples/gritlm/gritlm.cpp (5)
include/llama-cpp.h (6) (6)
batch
(29-29)batch
(29-29)token
(70-76)token
(70-70)token
(79-85)token
(79-79)src/llama-batch.h (1) (1)
batch
(99-99)include/llama.h (3) (3)
llama_batch_ext_init
(924-926)llama_n_batch
(468-468)llama_decode_ext
(1013-1015)src/llama-context.cpp (4) (4)
llama_n_batch
(2344-2346)llama_n_batch
(2344-2344)llama_decode_ext
(2780-2784)llama_decode_ext
(2780-2782)src/llama-context.h (3) (3)
ctx
(157-161)i
(49-49)i
(52-52)
examples/parallel/parallel.cpp (7)
examples/perplexity/perplexity.cpp (1) (1)
batch
(1468-1468)include/llama-cpp.h (2) (2)
batch
(29-29)batch
(29-29)src/llama-batch.h (1) (1)
batch
(99-99)src/llama-context.cpp (6) (6)
n_ctx
(412-414)n_ctx
(412-412)llama_decode_ext
(2780-2784)llama_decode_ext
(2780-2782)n_batch
(420-422)n_batch
(420-420)src/llama-context.h (3) (3)
i
(49-49)i
(52-52)ctx
(157-161)src/llama-model.h (1) (1)
params
(389-392)src/llama-vocab.cpp (2) (2)
n_tokens
(2751-2753)n_tokens
(2751-2751)
examples/speculative-simple/speculative-simple.cpp (5)
include/llama-cpp.h (2) (2)
batch
(29-29)batch
(29-29)include/llama.h (3) (3)
llama_decode_ext
(1013-1015)llama_batch_ext_init
(924-926)llama_n_batch
(468-468)src/llama-context.cpp (4) (4)
llama_decode_ext
(2780-2784)llama_decode_ext
(2780-2782)llama_n_batch
(2344-2346)llama_n_batch
(2344-2344)examples/speculative/speculative.cpp (1) (1)
batch_tgt
(206-206)src/llama-batch.cpp (2) (2)
llama_batch_ext_init
(372-374)llama_batch_ext_init
(372-372)
examples/server/server.cpp (2)
include/llama.h (3) (3)
llama_batch_ext_init
(924-926)llama_batch_ext_set_output_last
(977-977)llama_batch_ext_get_view
(981-984)src/llama-batch.cpp (6) (6)
llama_batch_ext_init
(372-374)llama_batch_ext_init
(372-372)llama_batch_ext_set_output_last
(448-455)llama_batch_ext_set_output_last
(448-448)llama_batch_ext_get_view
(461-480)llama_batch_ext_get_view
(461-464)
examples/perplexity/perplexity.cpp (1)
include/llama.h (4) (4)
llama_batch_ext_init
(924-926)llama_decode_ext
(1013-1015)llama_get_logits
(1053-1053)llama_batch_ext_set_output_last
(977-977)
examples/save-load-state/save-load-state.cpp (5)
include/llama-cpp.h (4) (4)
batch
(29-29)batch
(29-29)tokens
(45-58)tokens
(45-49)src/llama-batch.h (1) (1)
batch
(99-99)include/llama.h (2) (2)
llama_decode_ext
(1013-1015)llama_batch_ext_free
(991-991)src/llama-context.cpp (2) (2)
llama_decode_ext
(2780-2784)llama_decode_ext
(2780-2782)src/llama-batch.cpp (2) (2)
llama_batch_ext_free
(482-498)llama_batch_ext_free
(482-482)
examples/run/run.cpp (3)
include/llama-cpp.h (4) (4)
batch
(29-29)batch
(29-29)tokens
(45-58)tokens
(45-49)src/llama-context.cpp (2) (2)
llama_decode_ext
(2780-2784)llama_decode_ext
(2780-2782)include/llama.h (1) (1)
llama_decode_ext
(1013-1015)
examples/speculative/speculative.cpp (4)
include/llama.h (4) (4)
llama_decode_ext
(1013-1015)llama_batch_ext_init
(924-926)llama_n_batch
(468-468)llama_kv_self_seq_rm
(673-677)src/llama-context.cpp (6) (6)
llama_decode_ext
(2780-2784)llama_decode_ext
(2780-2782)llama_n_batch
(2344-2346)llama_n_batch
(2344-2344)llama_kv_self_seq_rm
(2544-2550)llama_kv_self_seq_rm
(2544-2548)src/llama-batch.cpp (2) (2)
llama_batch_ext_init
(372-374)llama_batch_ext_init
(372-372)src/llama-context.h (2) (2)
i
(49-49)i
(52-52)
examples/llava/gemma3-cli.cpp (3)
include/llama.h (4) (4)
llama_batch_ext_init
(924-926)llama_batch_ext_set_output_last
(977-977)llama_decode_ext
(1013-1015)llama_batch_ext_init_from_embd
(934-939)src/llama-batch.cpp (6) (6)
llama_batch_ext_init
(372-374)llama_batch_ext_init
(372-372)llama_batch_ext_set_output_last
(448-455)llama_batch_ext_set_output_last
(448-448)llama_batch_ext_init_from_embd
(376-390)llama_batch_ext_init_from_embd
(376-381)src/llama-context.cpp (2) (2)
llama_decode_ext
(2780-2784)llama_decode_ext
(2780-2782)
⏰ Context from checks skipped due to timeout of 90000ms (20)
- GitHub Check: ubuntu-latest-cmake-cuda
- GitHub Check: android-build
- GitHub Check: windows-2019-cmake-cuda (11.7, cuda)
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
- GitHub Check: macOS-latest-cmake-tvos
- GitHub Check: ubuntu-22-cmake-hip
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: macOS-latest-cmake-visionos
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: macOS-latest-cmake-ios
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: server (Release)
- GitHub Check: server (UNDEFINED, RelWithDebInfo)
- GitHub Check: server (ADDRESS, RelWithDebInfo)
- GitHub Check: server-windows
🔇 Additional comments (77)
examples/gritlm/gritlm.cpp (7)
4-4
: Appropriate header inclusion for the new batch API.The addition of
llama-cpp.h
provides the necessary smart pointer wrapper forllama_batch_ext
structures.
17-17
: Good migration to batch extension API.The change from raw
llama_batch
to smart pointerllama_batch_ext_ptr
improves memory management. The initialization parameter count differs from the original API (removed the middle parameter), which is part of the intended API change.
20-20
: Properly migrated batch operations to object methods.The migration from C-style function calls (
common_batch_clear
,common_batch_add
) to C++ object methods (batch.clear()
,batch.add_text()
) makes the code more consistent with modern C++ practices.Also applies to: 45-45
54-54
: Decoding function update without error handling.Migration from
llama_decode
tollama_decode_ext
is correct. Note that as discussed in previous comments, error handling will be implemented in a future update.Also applies to: 124-124
108-108
: Consistent batch initialization pattern.The pattern for initializing the second batch with
llama_batch_ext_ptr
is consistent with the earlier instance, maintaining the same approach throughout the file.
119-119
: Position tracking handled correctly in token addition.The token position is properly incremented with
i_current_token++
when adding tokens to the batch, which is crucial for maintaining correct context in the sequence.
126-126
: Method-based access for batch properties.Changed from direct member access via pointer (
batch->n_tokens
) to method call (batch.n_tokens()
), which is appropriate for the object-oriented approach of the new API.common/speculative.cpp (7)
17-17
: Good migration to smart pointer batch implementation.Changed from raw
llama_batch
tollama_batch_ext_ptr
for improved memory management, consistent with the broader API migration.
26-26
: Proper initialization of batch extension structure.The initialization syntax using
llama_batch_ext_ptr(llama_batch_ext_init(...))
correctly wraps the C-style allocation in a smart pointer for automatic cleanup.
65-73
: Removed manual batch cleanup.The manual call to
llama_batch_free(spec->batch)
has been removed since the smart pointerllama_batch_ext_ptr
now handles cleanup automatically through its destructor.
207-207
: Consistently migrated batch clearing operations.All instances of
common_batch_clear(batch)
have been properly converted to the object methodbatch.clear()
.Also applies to: 227-227, 240-240
211-211
: Position tracking correctly maintained in token addition.The positions used in
batch.add_text()
calls are correctly incremented throughout the code:
- Line 211: Uses relative position
i - i_start
- Line 228: Uses
n_past
after processing the prompt- Line 267: Uses
n_past + i + 1
in the draft generation loop, properly incrementing positionsThis ensures tokens maintain their correct sequential positions, which is critical for proper context handling.
Also applies to: 228-228, 267-267
217-217
: Method-based access for token count.Changed from direct member access (
batch.n_tokens
) to method call (batch.n_tokens()
), consistent with the object-oriented approach of the new API.
220-220
: Consistent decoding function migration.All instances of
llama_decode
have been properly updated tollama_decode_ext
with the appropriate.get()
accessor to obtain the raw pointer from the smart pointer wrapper.Also applies to: 234-234, 270-270
examples/server/server.cpp (21)
1227-1227
: Data type migration fromllama_batch
to smart pointerThe type for
batch_spec
member has been changed fromllama_batch
tollama_batch_ext_ptr
, which is a smart pointer to the extended batch structure. This is part of the broader migration to the new batch API.
1799-1799
: Data type migration for server context batchSimilar to the slot change, the
batch
member inserver_context
has been updated to use the newllama_batch_ext_ptr
type, following the migration to the extended batch API.
1925-1926
: Updated batch initialization using extended APIThe initialization approach has changed from direct struct initialization to using the
llama_batch_ext_init
function that returns a managed pointer to the batch structure. This provides better memory management and aligns with the new batch API.
1961-1962
: Proper batch initialization with appropriate sizingGood implementation of batch initialization using the extended API. The code correctly passes the maximum of
n_batch
andparams_base.n_parallel
to ensure sufficient capacity, along with a value of 1 forn_seq_max
since only a single sequence ID per token is needed here.
2096-2097
: Consistent allocation pattern for speculative decodingThe speculative batch allocation consistently follows the same pattern as other batch allocations, using
llama_batch_ext_init
with appropriate sizing parameters.
2404-2414
: Method signature update for send_embeddingThe method signature has been updated to remove the
batch
parameter since the embedding data is now accessed directly from the batch associated with the slot.
2454-2469
: Method signature and implementation update for send_rerankSimilar to
send_embedding
, this method has been updated to remove thebatch
parameter and use sequence IDs from the slot, which is consistent with the new batch API approach.
2853-2854
: Using clear() method instead of deprecated common_batch_clearThe code now properly uses the batch object's clear method instead of the deprecated
common_batch_clear
function, consistent with the migration to the extended batch API.
2875-2878
: Position handling in token processingThe implementation correctly maintains the token position within the batch and adds tokens with their appropriate positions and sequence IDs.
2881-2887
: Proper position management for speculative batchThe code correctly initializes and adds tokens to the speculative batch with their appropriate positions, maintaining the context integrity needed for proper inference.
3080-3085
: Improved token addition with position trackingThe token addition loop now correctly uses the slot's tracked position and sequence ID when adding tokens to the batch, ensuring proper position handling.
3094-3096
: Updated logging message with batch metricsThe logging message has been updated to reference the batch token count using the new access pattern
batch.n_tokens()
instead of direct field access, consistent with the new API.
3111-3115
: Using extended batch API for output handlingThe code now uses
llama_batch_ext_set_output_last
to mark the last token in the batch as output, and correctly tracks the position within the batch, which is essential for proper inference.
3141-3147
: Proper batch view management for processingThe implementation now correctly creates views into the batch using
llama_batch_ext_get_view
for processing chunks of tokens within the decode loop, allowing efficient handling of large batches.
2894-2894
: Consistent check for batch processing conditionThe condition now uses
batch.n_tokens()
to check the batch size, which is consistent with the extended batch API. This ensures continuation of batch processing when appropriate.
3126-3126
: Batch size check using the new APIThe code correctly uses
batch.n_tokens()
to check if there are tokens to decode, consistent with the new batch API. This ensures processing only happens when there are tokens available.
3131-3131
: Updated logging with batch sizeThe debugging log now correctly uses
batch.n_tokens()
to report the number of tokens in the batch, which is consistent with the extended batch API.
3281-3286
: Speculative batch preparation with proper positioningThe speculative batch is correctly prepared by clearing the previous batch and adding tokens with their appropriate positions and sequence IDs, which is essential for accurate speculative decoding.
3288-3288
: Updated debugging log for speculative batchThe speculative batch debugging log correctly uses
batch_spec.n_tokens()
to report the size of the batch, consistent with the extended batch API.
3147-3147
: Proper batch processing using extended APIThe code now correctly uses
llama_decode_ext
with the batch view, ensuring efficient processing of tokens in manageable chunks.
3290-3290
: Direct use of speculative batch with extended APIThe speculative decoding now directly uses
llama_decode_ext
with the speculative batch, which is consistent with the migration to the extended batch API.examples/simple-chat/simple-chat.cpp (6)
2-2
: Include extended batch methods from llama-cpp.h.This new include header is necessary to access
llama_batch_ext
functionality and related APIs.
112-114
: Initialize and trackn_past
properly.Storing the initial
n_past = 0
and updating it after creating the batch is correct. This ensures the position is tracked consistently throughout the process.
121-121
: Check the updated condition withbatch.n_tokens()
.Switching to
batch.n_tokens()
ensures accurate token count retrieval. This is an appropriate adjustment matching the new API.
127-127
: Switch to extended decode function.Replacing
llama_decode
withllama_decode_ext
aligns with the new extended batch system. The existing error handling (abort on failure) remains consistent with the current style.
151-153
: Clear and add the single sampled token.Re-initializing the batch with
clear()
and adding only the new token ensures correct incremental token feeding. Incrementingn_past
each time looks correct.
156-156
: Free the extended batch.Calling
llama_batch_ext_free(batch.get())
finalizes resource cleanup properly. Ensure no subsequent usage ofbatch
occurs after this point.examples/parallel/parallel.cpp (13)
9-9
: Includellama-cpp.h
for extended batch API.This header is needed for the new
llama_batch_ext
andllama_decode_ext
functions.
178-178
: Initialize extended batch with single sequence capacity.Creating a batch with
llama_batch_ext_init(n_ctx, 1)
is valid for the system prompt. Verify that1
is sufficient, given that you replicate the system prompt to multiple client sequences.
220-220
: Clear the batch before adding new tokens.Invoking
batch.clear()
is the correct approach to reset batch state between iterations.
228-228
: Store the current batch token count.Storing
batch.n_tokens()
intoclient.i_batch
helps reference the correct position for subsequent decoding logic.
230-231
: Add new token to the batch with the correct position.Using
n_tokens_system + client.n_prompt + client.n_decoded
for the position and setting the sequence ID toclient.id + 1
maintains proper offset and ownership of tokens.
236-236
: Handle empty batches by resetting KV cache.Checking
if (batch.n_tokens() == 0)
to clear the KV cache ensures stale data does not linger.
248-248
: Allow new sequences when empty or in continuous batching mode.This condition ensures new requests can be inserted seamlessly and prevents idle cycles.
267-268
: Add initial prompt tokens for each client sequence.Using
seq_id = client.id + 1
and adding tokens withbatch.add_text
properly differentiates each client's sequence from the system prompt.
272-272
: Set output extraction for the last token only if batch not empty.This helps collect logits for the final token in the batch, preventing unnecessary overhead when the batch is empty.
279-279
: Update client batch index.Storing
batch.n_tokens() - 1
aligns with the zero-based index for the token just added.
299-300
: Retrieve batch size and iterate in chunks.Retrieving
n_tokens_in_batch = batch.n_tokens()
and looping over it withi += n_batch
correctly processes the tokens in sub-batches.
308-308
: Compute the current chunk size safely.
\std::min
ensures we only process up to the remaining tokens in the batch.
310-312
: Create and free a batch view for partial decoding.Obtaining a subrange via
llama_batch_ext_get_view
and then decoding ensures incremental processing. Freeing it immediately avoids memory leaks.examples/speculative-simple/speculative-simple.cpp (6)
7-7
: Includellama-cpp.h
for extended decoding.Necessary for
llama_batch_ext_ptr
and the extended decode functions.
117-118
: Initialize batch from prompt input, then decode once.Combining
init_from_text
and an immediatellama_decode_ext
call is a straightforward approach for processing the initial tokens.
137-137
: Create a secondary extended batch for final decoding.Instantiating
batch_tgt
withllama_batch_ext_init(llama_n_batch(ctx_tgt), 1)
is consistent with the approach in other files. Verify if1
max sequence is enough for your usage.
156-157
: Clear and add the last token for target decoding.Resetting with
clear()
and then addingid_last
atn_past++
ensures a consistent continuation point.
167-167
: Append draft tokens incrementally.Appending each draft token at
(n_past + i)
is aligned with the target position tracking.
172-172
: Decode the combined target batch.No error handling here aligns with your decision to keep the example code simpler for now.
examples/run/run.cpp (5)
643-643
: Added position tracking withn_past
.Good addition of the
n_past
member variable to track token positions across multiple decode operations.
954-957
: Updated function signature to use extended batch.The function has been properly updated to accept the
llama_batch_ext_ptr
type and use itsn_tokens()
method instead of the previousn_tokens
member.
995-995
: Updated batch initialization with position tracking.Good update to use the extended batch initialization with proper position tracking via
llama_data.n_past
.
1004-1005
: Added position update tracking.Good addition to increment
n_past
by the number of tokens in the batch, which ensures accurate position tracking for subsequent operations.
1020-1020
: Updated next token batch initialization.Correctly using the extended batch initialization with the updated
n_past
value to maintain proper token positions.examples/speculative/speculative.cpp (6)
6-6
: Added include for extended batch functionality.Good addition of the
llama-cpp.h
header which provides the extended batch functionality.
205-206
: Updated batch initialization to use extended API.Good replacement of the old batch initialization with the extended batch initialization using proper context-specific batch sizes.
477-481
: Created well-structured batch info.Good addition of the
batch_info
structure to organize token ID, position, and sequence ID data, which makes the code more maintainable.
483-485
: Initialized batch data with proper position tracking.Good use of the new structure to properly track token positions with the first token's position set to
n_past_tgt
.
489-489
: Updated batch clear operation.Correctly using the member function
clear()
instead of the previous function call.
600-605
: Properly cleared and prepared the target batch.Good update to use the new batch API for clearing and adding text with full position and sequence ID support.
examples/perplexity/perplexity.cpp (1)
505-579
: Clarify position logic and ensure correctness in output toggling.
The code conditionally setsoutput = pos >= first
, which incrementsn_outputs
in line 572 iftrue
. Confirm thatfirst
is correctly updated prior to each iteration—any off-by-one or stale value could lead to missing or extra outputs inlogits
.examples/save-load-state/save-load-state.cpp (3)
51-55
: Confirm thatn_past
aligns with new tokens.
After callingbatch.n_tokens()
, you incrementn_past
by that value. Ensure that all tokens inbatch
are evaluated, especially if you setoutput_last
or skip any positions.
134-139
: Check matching contexts forctx2
vs. original.
Here, you decode inctx2
but still reuse the samebatch
. Make sure the batch contents (like positions) truly match the new context’s state.
215-221
: Validate sequence ID usage.
You add tokens withseq_id=1
. Confirm thatctx3
expects a second sequence in valid range (E.g., if the model or context has 2 parallel sequences).examples/llava/gemma3-cli.cpp (2)
8-8
: Include guards synergy and forward declarations.
Including"llama-cpp.h"
is correct for usage, but monitor if multiple llama-related headers cause redefinition or macro conflicts. Also, ensure the newllama_batch_ext_ptr batch
is forward-declared if possible to reduce compile times.Also applies to: 67-67, 77-77
93-101
: Re-check empty prompts foreval_text
.
You always clear the batch, then add tokens. Ifinput
is empty, confirm no tokens are queued. This might skip settingoutput_last
iftokens
is empty.
const llama_seq_id seq_id = slot.id; | ||
|
||
for (int i = 0; i < batch.n_tokens; ++i) { | ||
if (!batch.logits[i] || batch.seq_id[i][0] != slot.id) { | ||
continue; | ||
} | ||
std::vector<float> embd_res(n_embd, 0.0f); | ||
|
||
const float * embd = llama_get_embeddings_seq(ctx, batch.seq_id[i][0]); | ||
if (embd == NULL) { | ||
embd = llama_get_embeddings_ith(ctx, i); | ||
} | ||
if (llama_pooling_type(slot.ctx) != LLAMA_POOLING_TYPE_NONE) { | ||
const float * embd = llama_get_embeddings_seq(ctx, seq_id); | ||
|
||
if (embd == NULL) { | ||
SLT_ERR(slot, "failed to get embeddings, token = %d, seq_id = %d\n", batch.token[i], batch.seq_id[i][0]); | ||
SLT_ERR(slot, "failed to get sequence embeddings, seq_id = %d\n", seq_id); | ||
|
||
res->embedding.push_back(std::vector<float>(n_embd, 0.0f)); | ||
continue; | ||
} | ||
|
||
// normalize only when there is pooling | ||
// TODO: configurable | ||
if (llama_pooling_type(slot.ctx) != LLAMA_POOLING_TYPE_NONE) { | ||
common_embd_normalize(embd, embd_res.data(), n_embd, 2); | ||
res->embedding.push_back(embd_res); | ||
} else { | ||
res->embedding.push_back({ embd, embd + n_embd }); | ||
} | ||
common_embd_normalize(embd, embd_res.data(), n_embd, 2); | ||
res->embedding.push_back(embd_res); | ||
} else { | ||
GGML_ABORT("embeddings without pooling is not supported yet"); | ||
//for (int i = 0; i < batch.n_tokens(); ++i) { | ||
// auto tok = batch.tokens[i]; | ||
// if (!tok.logits || tok.seq_id != slot.id) { | ||
// continue; | ||
// } | ||
|
||
// const float * embd = llama_get_embeddings_ith(ctx, tok.seq_id); | ||
// if (embd == NULL) { | ||
// SLT_ERR(slot, "failed to get embeddings, token = %d, seq_id = %d\n", tok.token, tok.seq_id); | ||
|
||
// res->embedding.push_back(std::vector<float>(n_embd, 0.0f)); | ||
// continue; | ||
// } | ||
|
||
// res->embedding.push_back({ embd, embd + n_embd }); | ||
//} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Embeddings implementation with sequence ID
The implementation now correctly uses sequence IDs from the slot instead of iterating through batch tokens. Note that the commented-out code (currently not used) contains a limitation warning that embeddings without pooling are not yet supported.
Consider implementing support for embeddings without pooling to make the feature complete. The commented code shows the intended approach but is not yet functional.
examples/llava/gemma3-cli.cpp
Outdated
llama_batch_ext_ptr batch_img(llama_batch_ext_init_from_embd(image_embd_v.data(), n_tokens, n_embd, ctx.n_past, 0)); | ||
if (llama_decode_ext(ctx.lctx, batch_img.get())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure embedding dimension consistency.
llama_batch_ext_init_from_embd(image_embd_v.data(), n_tokens, n_embd, ...)
—validate that n_embd
matches the model’s dimension. A mismatch can lead to undefined behavior.
ctx.batch.clear(); | ||
ctx.batch.add_text(token_id, ctx.n_past++, 0, true); | ||
if (llama_decode_ext(ctx.lctx, ctx.batch.get())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Check concurrency for shared ctx.batch
.
If generate_response
is called concurrently with other code, ensure ctx.batch
usage is thread-safe. Currently, multiple updates to the same batch from different threads might cause data races.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (5)
include/llama-cpp.h (3)
86-93
: 🛠️ Refactor suggestionCheck the return value from token addition.
Similar to the
init_from_text
method, this function doesn't handle potential errors returned byllama_batch_ext_add_text
.- int32_t output_id = llama_batch_ext_add_text(this->get(), token, pos, seq_id.data(), seq_id.size(), false); - if (output_last) { + int32_t output_id = llama_batch_ext_add_text(this->get(), token, pos, seq_id.data(), seq_id.size(), false); + if (output_id < 0) { + return output_id; + } + if (output_last) {
95-102
: 🛠️ Refactor suggestionAdd error handling for single sequence ID token addition.
The same error handling issue exists in this overload of the
add_text
method.- int32_t output_id = llama_batch_ext_add_text(this->get(), token, pos, &seq_id, 1, false); - if (output_last) { + int32_t output_id = llama_batch_ext_add_text(this->get(), token, pos, &seq_id, 1, false); + if (output_id < 0) { + return output_id; + } + if (output_last) {
52-55
: 💡 Verification agent❓ Verification inconclusive
Verify error handling for
llama_batch_ext_add_text
.The function doesn't check the return value of
llama_batch_ext_add_text
, which can return negative values to indicate errors. If any token addition fails, this will silently continue with a corrupted batch.Add error checking to handle potential failures:
- llama_batch_ext * batch = llama_batch_ext_init(ctx); - for (int32_t i = 0; i < n_tokens; i++) { - llama_batch_ext_add_text(batch, tokens[i], pos0 + i, &seq_id, 1, false); - } + llama_batch_ext * batch = llama_batch_ext_init(ctx); + for (int32_t i = 0; i < n_tokens; i++) { + int32_t rc = llama_batch_ext_add_text(batch, tokens[i], pos0 + i, &seq_id, 1, false); + if (rc < 0) { + llama_batch_ext_free(batch); + return llama_batch_ext_ptr(nullptr); + } + }
ACTION: Implement Robust Error Handling for Token Addition
The review comment’s concern is valid. Analysis indicates that the function
llama_batch_ext_add_text
can indeed return negative values to signal an error, yet its return value is not being checked in the current loop. Failing to check for errors may result in silently propagating a corrupted batch to later processing stages.Suggested changes:
- Add Error Checking: Capture the return code from
llama_batch_ext_add_text
and check if it indicates an error (i.e. a negative value).- Clean Up on Failure: If an error is detected, free the batch using
llama_batch_ext_free(batch)
to prevent memory leaks.- Propagate Error: Return an appropriate error (here, using
llama_batch_ext_ptr(nullptr)
) to signal the failure upstream.The refactored code snippet in
include/llama-cpp.h
(lines 52-55) should be modified as follows:- llama_batch_ext * batch = llama_batch_ext_init(ctx); - for (int32_t i = 0; i < n_tokens; i++) { - llama_batch_ext_add_text(batch, tokens[i], pos0 + i, &seq_id, 1, false); - } + llama_batch_ext * batch = llama_batch_ext_init(ctx); + for (int32_t i = 0; i < n_tokens; i++) { + int32_t rc = llama_batch_ext_add_text(batch, tokens[i], pos0 + i, &seq_id, 1, false); + if (rc < 0) { + llama_batch_ext_free(batch); + return llama_batch_ext_ptr(nullptr); + } + }Please ensure that the chosen error propagation method (
llama_batch_ext_ptr(nullptr)
) aligns with overall error handling patterns elsewhere in the codebase.examples/speculative/speculative.cpp (1)
169-174
: 🛠️ Refactor suggestionCheck for Minimum Input Size & Return Codes
Usingn_input - 1
and borrowinginp.back()
may lead to out-of-bounds errors ifinp.size()
is less than 2. Additionally, the return codes fromllama_decode_ext()
are not checked, which can silence decoding failures. Consider adding range checks and capturing the return code.if (inp.size() < 2) { + LOG_ERR("Not enough tokens (need at least 2)."); + return 1; } int ret0 = llama_decode_ext(ctx_tgt, batch0.get()); -if (ret0 != 0) { /* handle error */ }src/llama-batch.cpp (1)
456-475
:⚠️ Potential issueAdd boundary checks for
offset
andn_tokens
.
Ifoffset + n_tokens
goes beyond the original batch’s capacity, pointers to out-of-range elements will be returned, risking undefined behavior. A simple check can prevent this.Apply this diff to enforce a boundary check:
int32_t total_tokens = batch->n_tokens; if (offset < 0 || n_tokens < 0 || offset + n_tokens > total_tokens) { return nullptr; // or handle error }
🧹 Nitpick comments (27)
src/llama-model.cpp (2)
12060-12062
: Ternary Operator for Token Position CalculationThe function returns 4 positions per token if the architecture is
LLM_ARCH_QWEN2VL
, otherwise it defaults to 1. This inline conditional is concise and works for the current two-case scenario; however, if additional architectures begin to require differentiated handling in the future, a more extensible construct (such as a switch-case) might improve clarity and maintainability.
12064-12066
: Delegation with Pointer ParameterThis helper function delegates to the previously defined version by extracting
model->arch
. While this approach is straightforward, consider whether a null-check (or an assertion) should be added formodel
to prevent potential NULL pointer dereferences if there’s any chance the function might be invoked with a NULL model. Documenting the precondition would also be helpful.src/llama-model.h (1)
405-405
: Function looks good but needs a docstringThe new function
llama_n_pos_per_token
declaration correctly accepts anllm_arch
parameter and returns auint32_t
, matching the implementation seen in the relevant code snippets. Consider adding a docstring to explain its purpose and return value.+/** + * Returns the number of positions per token for a given architecture. + * + * @param arch The model architecture + * @return 4 for LLM_ARCH_QWEN2VL, 1 for all other architectures + */ uint32_t llama_n_pos_per_token(llm_arch arch);examples/llama.swiftui/llama.cpp.swift/LibLlama.swift (1)
199-199
: Remove trailing semicolon
Swift code typically avoids trailing semicolons.- let t_pp_start = DispatchTime.now().uptimeNanoseconds / 1000; + let t_pp_start = DispatchTime.now().uptimeNanoseconds / 1000🧰 Tools
🪛 SwiftLint (0.57.0)
[Warning] 199-199: Lines should not have trailing semicolons
(trailing_semicolon)
examples/lookahead/lookahead.cpp (1)
263-266
: Partial error handling for decode calls.
Here, you check the result ofllama_decode_ext
. If you decide to unify error checks across all decode invocations (e.g., lines 97-98), I can help with that refactor later.examples/batched/batched.cpp (1)
158-158
: Stream index initializationStoring the last token's batch index is a neat approach. Consider verifying that the batch isn’t empty (to avoid
-1
if the prompt is missing).examples/speculative/speculative.cpp (5)
447-449
: Return Code fromllama_decode_ext
Not Checked
The call tollama_decode_ext(ctx_dft, batch_dft.get())
lacks error checking. Ignoring an error code may lead to silent failures.llama_decode_ext(ctx_dft, batch_dft.get()); +// Optionally capture and analyze return code +int decode_ret = llama_decode_ext(ctx_dft, batch_dft.get()); +if (decode_ret != 0) { + LOG_ERR("llama_decode_ext failed with code %d\n", decode_ret); + // handle or abort +}
570-572
: Conditional Drafting Logic
After adding the token tobatch_dft
, you conditionally disable drafting if the combined data size exceedsn_draft
. This is logical for controlling branching. Ensure all sequences are properly cleaned up if truncated.
584-584
: Decode Return Code Not Captured
Again,llama_decode_ext(ctx_dft, batch_dft.get())
return value is ignored. Consistently handle potential errors.
600-605
: Batch Population Loop
Clearingbatch_tgt
and then populating it frombatch_tgt_data
is coherent. However,llama_batch_ext_add_text(...)
may return an error. Consider capturing and logging potential failures.
608-608
: Decode Return Code Not Captured
As with previous calls, please consider checking the return value fromllama_decode_ext(ctx_tgt, batch_tgt.get())
.examples/llama.android/llama/src/main/cpp/llama-android.cpp (4)
194-195
: Unverified Return
llama_batch_ext_add_text(batch, 0, i, &seq_id, 1, false)
may fail if the batch exceeds capacity. Consider checking and logging its return code.
214-218
: Prompt for Generation
Clearing the batch and adding tokens for text generation is correct. However, eachllama_batch_ext_add_text(...)
call can return an error code. Check to avoid silent failures.
356-357
: Check forllama_batch_ext_add_text
Failures
Though unlikely, it’s safer to check the return code in case of size or parameter issues.
413-415
: Adding Next Token
Clearing and then adding the newly sampled token is correct. Again, keep a note on error checking if batch is full.examples/save-load-state/save-load-state.cpp (1)
54-55
: Decode Return Code & Token Count
Line 54 ignores the return ofllama_decode_ext
. Consider checking it. The token count update on line 55 is correct.src/llama-batch.cpp (4)
193-275
: Validaten_seq_id
boundaries and usage assumptions.
This method sorts and groups tokens by sequence IDs. Ifbatch.n_seq_id
orbatch.seq_id
have unexpected values (e.g., negative or null), it could break the logic. Consider adding sanity checks or clarifying that upstream code ensures valid ranges.
341-371
: Consider adding range checks for allocated resources.
The function allocates memory for arrays based onn_tokens_alloc
. If a caller passes very large or negative values, it could lead to allocation issues. However, it appears consistent with your codebase rules on ignoring null checks.
399-422
: Check ifn_seq_ids
exceeds allocation.
The loop writes up ton_seq_ids
entries intobatch->seq_id[...][j]
. If the caller passes a largern_seq_ids
than allocated, it could cause out-of-bounds writes. Consider bounding this or documenting the requirement.
424-441
: Linear search approach is acceptable; watch performance on large batches.
This function performs a nested loop to set output flags, which may be slow ifbatch->n_tokens
is large. It might be worth optimizing or clarifying performance trade-offs.examples/batched-bench/batched-bench.cpp (2)
64-82
: Helper function uses batch views correctly; watch for offset range.
llama_batch_ext_get_view
can fail if offsets are out of range. Currently, the loop bounds keep it safe. If usage expands, validate offset usage.
88-88
: Return code fromllama_batch_ext_add_text
is ignored.
If the batch is full or has embeddings, the function can fail and return negative codes. Consider checking the return value.examples/llava/gemma3-cli.cpp (1)
95-95
: Return value ofadd_text
is not checked.
The function can fail if the batch is at capacity. Handling that return code may prevent silent errors.include/llama.h (4)
265-266
: Duplicate documentation for a deprecated struct.
This doc block may cause confusion since the struct is already marked as deprecated. Consider consolidating or removing it to maintain clarity.
958-967
: Consider setting clear boundaries and fallback conditions.
The function’s documentation does not specify the behavior for out-of-rangepos
or invalidseq_id
. A guard or fallback could improve robustness.
974-980
: Minor wording improvement.
Correct “Return returned batch” to “The returned batch” for clarity in the documentation.
1006-1010
: Document return codes for llama_decode_ext.
Since the existing decode function describes possible return values, replicate that detail here to ensure consistency and clarity for users.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (42)
common/common.cpp
(2 hunks)common/speculative.cpp
(4 hunks)examples/batched-bench/batched-bench.cpp
(5 hunks)examples/batched/batched.cpp
(6 hunks)examples/cvector-generator/cvector-generator.cpp
(1 hunks)examples/embedding/embedding.cpp
(5 hunks)examples/eval-callback/eval-callback.cpp
(1 hunks)examples/gritlm/gritlm.cpp
(5 hunks)examples/imatrix/imatrix.cpp
(3 hunks)examples/infill/infill.cpp
(1 hunks)examples/llama-bench/llama-bench.cpp
(5 hunks)examples/llama.android/llama/src/main/cpp/llama-android.cpp
(9 hunks)examples/llama.android/llama/src/main/java/android/llama/cpp/LLamaAndroid.kt
(2 hunks)examples/llama.swiftui/llama.cpp.swift/LibLlama.swift
(8 hunks)examples/llava/gemma3-cli.cpp
(6 hunks)examples/llava/llava-cli.cpp
(1 hunks)examples/llava/llava.cpp
(2 hunks)examples/llava/minicpmv-cli.cpp
(1 hunks)examples/llava/qwen2vl-cli.cpp
(2 hunks)examples/lookahead/lookahead.cpp
(6 hunks)examples/lookup/lookup.cpp
(5 hunks)examples/main/main.cpp
(2 hunks)examples/parallel/parallel.cpp
(7 hunks)examples/passkey/passkey.cpp
(6 hunks)examples/perplexity/perplexity.cpp
(19 hunks)examples/retrieval/retrieval.cpp
(7 hunks)examples/run/run.cpp
(4 hunks)examples/save-load-state/save-load-state.cpp
(5 hunks)examples/server/server.cpp
(21 hunks)examples/simple-chat/simple-chat.cpp
(3 hunks)examples/simple/simple.cpp
(3 hunks)examples/speculative-simple/speculative-simple.cpp
(5 hunks)examples/speculative/speculative.cpp
(8 hunks)examples/tts/tts.cpp
(5 hunks)include/llama-cpp.h
(2 hunks)include/llama.h
(6 hunks)src/llama-batch.cpp
(5 hunks)src/llama-context.cpp
(4 hunks)src/llama-graph.cpp
(3 hunks)src/llama-graph.h
(2 hunks)src/llama-model.cpp
(1 hunks)src/llama-model.h
(1 hunks)
🔥 Files not summarized due to errors (5)
- examples/llava/llava-cli.cpp: Error: Disallowed special token found: <|im_end|>
- examples/llava/minicpmv-cli.cpp: Error: Disallowed special token found: <|im_start|>
- examples/llava/qwen2vl-cli.cpp: Error: Disallowed special token found: <|im_start|>
- examples/tts/tts.cpp: Error: Disallowed special token found: <|im_start|>
- src/llama-model.cpp: Error: Server error: no LLM provider could handle the message
🚧 Files skipped from review as they are similar to previous changes (13)
- examples/eval-callback/eval-callback.cpp
- examples/cvector-generator/cvector-generator.cpp
- examples/llava/llava.cpp
- examples/infill/infill.cpp
- examples/gritlm/gritlm.cpp
- src/llama-graph.h
- examples/main/main.cpp
- examples/parallel/parallel.cpp
- examples/speculative-simple/speculative-simple.cpp
- examples/passkey/passkey.cpp
- src/llama-context.cpp
- examples/lookup/lookup.cpp
- examples/simple/simple.cpp
🧰 Additional context used
🧠 Learnings (3)
examples/imatrix/imatrix.cpp (1)
Learnt from: ngxson
PR: ngxson/llama.cpp#14
File: examples/imatrix/imatrix.cpp:500-500
Timestamp: 2025-03-21T10:48:55.882Z
Learning: The function llama_batch_ext_init never returns nullptr in the llama.cpp codebase, so null checks after calling this function are unnecessary.
examples/lookahead/lookahead.cpp (1)
Learnt from: ngxson
PR: ngxson/llama.cpp#14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-21T10:48:55.882Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
src/llama-batch.cpp (1)
Learnt from: ngxson
PR: ngxson/llama.cpp#14
File: src/llama-batch.cpp:356-386
Timestamp: 2025-03-21T10:48:55.882Z
Learning: Null checks after malloc calls are not required in the llama.cpp codebase, as the project follows a design pattern where memory allocation failures are handled differently or allowed to propagate.
🧬 Code Definitions (23)
examples/llava/qwen2vl-cli.cpp (3)
include/llama-cpp.h (7)
pos
(79-79)batch
(29-29)batch
(29-29)token
(87-93)token
(87-87)token
(96-102)token
(96-96)include/llama.h (2)
llama_decode_ext
(1008-1010)llama_batch_ext_set_output_last
(972-972)src/llama-batch.cpp (2)
llama_batch_ext_set_output_last
(443-450)llama_batch_ext_set_output_last
(443-443)
examples/tts/tts.cpp (3)
include/llama-cpp.h (2)
batch
(29-29)batch
(29-29)include/llama.h (2)
llama_batch_ext_set_output_last
(972-972)llama_decode_ext
(1008-1010)src/llama-batch.cpp (2)
llama_batch_ext_set_output_last
(443-450)llama_batch_ext_set_output_last
(443-443)
src/llama-model.cpp (2)
include/llama.h (1)
llama_n_pos_per_token
(486-486)src/llama-model.h (1)
llama_n_pos_per_token
(405-405)
src/llama-graph.cpp (2)
src/llama-model.cpp (4)
llama_n_pos_per_token
(12060-12062)llama_n_pos_per_token
(12060-12060)llama_n_pos_per_token
(12064-12066)llama_n_pos_per_token
(12064-12064)src/llama-model.h (1)
llama_n_pos_per_token
(405-405)
common/speculative.cpp (7)
include/llama-cpp.h (11)
llama_batch_ext_ptr
(38-38)llama_batch_ext_ptr
(39-39)llama_batch_ext_ptr
(40-40)batch
(29-29)batch
(29-29)ctx
(46-60)ctx
(46-51)ctx
(63-70)ctx
(63-68)ctx
(73-84)ctx
(73-78)examples/server/server.cpp (10)
ctx_dft
(1349-1351)ctx
(218-469)ctx
(218-221)ctx
(2044-2054)ctx
(2044-2044)params
(1840-1914)params
(1840-1840)id
(1433-1453)id
(1972-1980)id
(1972-1972)src/llama-batch.h (1)
batch
(99-99)common/sampling.cpp (6)
i
(64-69)i
(64-64)ctx
(114-129)ctx
(114-114)common_sampler_reset
(310-314)common_sampler_reset
(310-310)include/llama.h (1)
llama_decode_ext
(1008-1010)src/llama-context.cpp (2)
llama_decode_ext
(2797-2801)llama_decode_ext
(2797-2799)src/llama-kv-cache.h (2)
id
(36-38)id
(36-36)
examples/batched/batched.cpp (1)
include/llama.h (6)
llama_batch_ext_init
(926-926)llama_batch_ext_add_text
(950-956)llama_batch_ext_get_n_tokens
(943-943)llama_batch_ext_clear
(982-982)llama_batch_ext_set_output_last
(972-972)llama_batch_ext_free
(986-986)
examples/simple-chat/simple-chat.cpp (4)
include/llama-cpp.h (8)
batch
(29-29)batch
(29-29)ctx
(46-60)ctx
(46-51)ctx
(63-70)ctx
(63-68)ctx
(73-84)ctx
(73-78)src/llama-batch.h (1)
batch
(99-99)src/llama-context.cpp (8)
n_ctx
(429-431)n_ctx
(429-429)llama_n_ctx
(2357-2359)llama_n_ctx
(2357-2357)llama_kv_self_used_cells
(2539-2541)llama_kv_self_used_cells
(2539-2539)llama_decode_ext
(2797-2801)llama_decode_ext
(2797-2799)src/llama-batch.cpp (2)
llama_batch_ext_free
(477-493)llama_batch_ext_free
(477-477)
examples/llama.android/llama/src/main/cpp/llama-android.cpp (3)
include/llama.h (6)
llama_init_from_model
(447-449)llama_batch_ext_clear
(982-982)llama_batch_ext_add_text
(950-956)llama_batch_ext_set_output_last
(972-972)llama_batch_ext_init
(926-926)llama_batch_ext_free
(986-986)src/llama-context.cpp (2)
llama_init_from_model
(2303-2344)llama_init_from_model
(2303-2305)src/llama-batch.cpp (10)
llama_batch_ext_clear
(452-454)llama_batch_ext_clear
(452-452)llama_batch_ext_add_text
(399-422)llama_batch_ext_add_text
(399-405)llama_batch_ext_set_output_last
(443-450)llama_batch_ext_set_output_last
(443-443)llama_batch_ext_init
(373-375)llama_batch_ext_init
(373-373)llama_batch_ext_free
(477-493)llama_batch_ext_free
(477-477)
examples/save-load-state/save-load-state.cpp (5)
include/llama-cpp.h (8)
batch
(29-29)batch
(29-29)ctx
(46-60)ctx
(46-51)ctx
(63-70)ctx
(63-68)ctx
(73-84)ctx
(73-78)src/llama-batch.h (1)
batch
(99-99)include/llama.h (2)
llama_decode_ext
(1008-1010)llama_batch_ext_free
(986-986)src/llama-context.cpp (2)
llama_decode_ext
(2797-2801)llama_decode_ext
(2797-2799)src/llama-batch.cpp (2)
llama_batch_ext_free
(477-493)llama_batch_ext_free
(477-477)
examples/retrieval/retrieval.cpp (1)
src/llama-batch.h (1)
batch
(99-99)
include/llama-cpp.h (1)
include/llama.h (6)
llama_batch_ext_init
(926-926)llama_batch_ext_add_text
(950-956)llama_batch_ext_set_output_last
(972-972)llama_batch_ext_init_from_embd
(934-940)llama_batch_ext_clear
(982-982)llama_batch_ext_get_n_tokens
(943-943)
examples/imatrix/imatrix.cpp (1)
include/llama.h (5)
llama_batch_ext_init
(926-926)llama_batch_ext_clear
(982-982)llama_batch_ext_add_text
(950-956)llama_decode_ext
(1008-1010)llama_batch_ext_free
(986-986)
examples/run/run.cpp (4)
include/llama-cpp.h (8)
ctx
(46-60)ctx
(46-51)ctx
(63-70)ctx
(63-68)ctx
(73-84)ctx
(73-78)batch
(29-29)batch
(29-29)src/llama-batch.h (1)
batch
(99-99)src/llama-context.cpp (4)
n_ctx
(429-431)n_ctx
(429-429)llama_decode_ext
(2797-2801)llama_decode_ext
(2797-2799)include/llama.h (1)
llama_decode_ext
(1008-1010)
examples/perplexity/perplexity.cpp (5)
src/llama-context.cpp (4)
llama_decode_ext
(2797-2801)llama_decode_ext
(2797-2799)llama_get_logits
(2432-2436)llama_get_logits
(2432-2432)src/llama-batch.h (1)
batch
(99-99)src/llama-context.h (4)
i
(49-49)i
(52-52)ctx
(157-161)data
(78-83)include/llama.h (3)
llama_decode_ext
(1008-1010)llama_get_logits
(1048-1048)llama_batch_ext_set_output_last
(972-972)src/llama-batch.cpp (2)
llama_batch_ext_set_output_last
(443-450)llama_batch_ext_set_output_last
(443-443)
examples/embedding/embedding.cpp (3)
include/llama.h (9)
llama_batch_ext_add_text
(950-956)llama_batch_ext_get_n_tokens
(943-943)llama_model_has_encoder
(533-533)llama_model_has_decoder
(536-536)llama_encode_ext
(996-998)llama_decode_ext
(1008-1010)llama_batch_ext_init
(926-926)llama_batch_ext_clear
(982-982)llama_batch_ext_free
(986-986)src/llama-context.cpp (4)
llama_encode_ext
(2791-2795)llama_encode_ext
(2791-2793)llama_decode_ext
(2797-2801)llama_decode_ext
(2797-2799)src/llama-model.cpp (4)
llama_model_has_encoder
(12134-12140)llama_model_has_encoder
(12134-12134)llama_model_has_decoder
(12142-12147)llama_model_has_decoder
(12142-12142)
examples/speculative/speculative.cpp (4)
include/llama.h (1)
llama_decode_ext
(1008-1010)src/llama-context.cpp (2)
llama_decode_ext
(2797-2801)llama_decode_ext
(2797-2799)src/llama-context.h (2)
i
(49-49)i
(52-52)src/llama-kv-cache.h (2)
id
(36-38)id
(36-36)
examples/lookahead/lookahead.cpp (1)
include/llama.h (5)
llama_batch_ext_init
(926-926)llama_batch_ext_clear
(982-982)llama_batch_ext_add_text
(950-956)llama_batch_ext_get_n_tokens
(943-943)llama_batch_ext_free
(986-986)
src/llama-batch.cpp (6)
src/llama-batch.h (3)
batch
(99-99)llama_batch_allocr
(113-113)llama_batch_allocr
(115-115)include/llama-cpp.h (9)
batch
(29-29)batch
(29-29)pos
(79-79)model
(13-13)model
(13-13)token
(87-93)token
(87-87)token
(96-102)token
(96-96)examples/llama-bench/llama-bench.cpp (2)
tokens
(1437-1437)n_tokens
(946-952)examples/run/run.cpp (12)
model
(682-693)model
(682-682)model
(709-736)model
(709-709)model
(738-762)model
(738-738)model
(764-787)model
(764-764)model
(789-820)model
(789-789)model
(891-898)model
(891-891)src/llama-model.h (1)
llama_n_pos_per_token
(405-405)src/llama-model.cpp (4)
llama_n_pos_per_token
(12060-12062)llama_n_pos_per_token
(12060-12060)llama_n_pos_per_token
(12064-12066)llama_n_pos_per_token
(12064-12064)
examples/batched-bench/batched-bench.cpp (2)
include/llama.h (8)
llama_batch_ext_init
(926-926)llama_batch_ext_get_n_tokens
(943-943)llama_batch_ext_get_view
(976-979)llama_decode_ext
(1008-1010)llama_batch_ext_add_text
(950-956)llama_batch_ext_clear
(982-982)llama_batch_ext_set_output_last
(972-972)llama_batch_ext_free
(986-986)src/llama-batch.cpp (14)
llama_batch_ext_init
(373-375)llama_batch_ext_init
(373-373)llama_batch_ext_get_n_tokens
(395-397)llama_batch_ext_get_n_tokens
(395-395)llama_batch_ext_get_view
(456-475)llama_batch_ext_get_view
(456-459)llama_batch_ext_add_text
(399-422)llama_batch_ext_add_text
(399-405)llama_batch_ext_clear
(452-454)llama_batch_ext_clear
(452-452)llama_batch_ext_set_output_last
(443-450)llama_batch_ext_set_output_last
(443-443)llama_batch_ext_free
(477-493)llama_batch_ext_free
(477-477)
examples/llama-bench/llama-bench.cpp (4)
include/llama-cpp.h (12)
ctx
(46-60)ctx
(46-51)ctx
(63-70)ctx
(63-68)ctx
(73-84)ctx
(73-78)batch
(29-29)batch
(29-29)token
(87-93)token
(87-87)token
(96-102)token
(96-96)src/llama-context.h (4)
ctx
(157-161)n_threads
(61-61)i
(49-49)i
(52-52)src/llama-context.cpp (6)
n_threads
(449-451)n_threads
(449-449)llama_decode_ext
(2797-2801)llama_decode_ext
(2797-2799)llama_synchronize
(2428-2430)llama_synchronize
(2428-2428)include/llama.h (2)
llama_decode_ext
(1008-1010)llama_synchronize
(1041-1041)
include/llama.h (6)
include/llama-cpp.h (11)
model
(13-13)model
(13-13)batch
(29-29)batch
(29-29)ctx
(46-60)ctx
(46-51)ctx
(63-70)ctx
(63-68)ctx
(73-84)ctx
(73-78)pos
(79-79)examples/run/run.cpp (12)
model
(682-693)model
(682-682)model
(709-736)model
(709-709)model
(738-762)model
(738-738)model
(764-787)model
(764-764)model
(789-820)model
(789-789)model
(891-898)model
(891-891)src/llama-context.cpp (2)
n_seq_max
(445-447)n_seq_max
(445-445)src/llama-batch.cpp (2)
llama_batch_ext_init
(373-375)llama_batch_ext_init
(373-373)src/llama-batch.h (1)
batch
(99-99)src/llama-context.h (1)
ctx
(157-161)
examples/server/server.cpp (4)
src/llama-batch.h (1)
batch
(99-99)src/llama-context.h (9)
seq_id
(53-53)seq_id
(100-100)seq_id
(101-101)seq_id
(102-102)seq_id
(115-120)seq_id
(122-126)ctx
(157-161)i
(49-49)i
(52-52)src/llama-context.cpp (6)
llama_pooling_type
(2385-2387)llama_pooling_type
(2385-2385)llama_get_embeddings_seq
(2456-2460)llama_get_embeddings_seq
(2456-2456)llama_decode_ext
(2797-2801)llama_decode_ext
(2797-2799)src/llama-batch.cpp (4)
llama_batch_ext_set_output_last
(443-450)llama_batch_ext_set_output_last
(443-443)llama_batch_ext_get_view
(456-475)llama_batch_ext_get_view
(456-459)
examples/llama.swiftui/llama.cpp.swift/LibLlama.swift (1)
include/llama.h (6)
llama_batch_ext_init
(926-926)llama_batch_ext_free
(986-986)llama_batch_ext_get_n_tokens
(943-943)llama_batch_ext_clear
(982-982)llama_batch_ext_add_text
(950-956)llama_batch_ext_set_output_last
(972-972)
🪛 SwiftLint (0.57.0)
examples/llama.swiftui/llama.cpp.swift/LibLlama.swift
[Warning] 199-199: Lines should not have trailing semicolons
(trailing_semicolon)
⏰ Context from checks skipped due to timeout of 90000ms (20)
- GitHub Check: server-windows
- GitHub Check: server (UNDEFINED, RelWithDebInfo)
- GitHub Check: server (ADDRESS, RelWithDebInfo)
- GitHub Check: server (Release)
- GitHub Check: ubuntu-22-cmake-sycl-fp16
- GitHub Check: android-build
- GitHub Check: ubuntu-22-cmake-sycl
- GitHub Check: ubuntu-latest-cmake-cuda
- GitHub Check: windows-2019-cmake-cuda (12.4, cuda)
- GitHub Check: windows-latest-cmake (avx-x64, -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_AVX...
- GitHub Check: windows-latest-cmake (avx512-x64, -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_...
- GitHub Check: windows-latest-cmake-sycl
- GitHub Check: macOS-latest-cmake-tvos
- GitHub Check: macOS-latest-swift (generic/platform=iOS)
- GitHub Check: macOS-latest-swift (generic/platform=tvOS)
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: ubuntu-22-cmake-vulkan
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: ubuntu-22-cmake-hip
- GitHub Check: check-requirements
🔇 Additional comments (172)
src/llama-graph.cpp (4)
5-5
: Include correctly added for accessing the new functionGood addition of the
llama-model.h
header file to access the newly addedllama_n_pos_per_token
function.
569-569
: Properly migrated from method to direct initializationThe initialization of
n_pos_per_token
now uses the new functionllama_n_pos_per_token(params.arch)
instead of a class method. This change is consistent with the refactoring pattern to move the logic outside the class.
1004-1004
: Updated constructor to use class memberThe
llm_graph_input_pos
constructor now correctly takesn_pos_per_token
as a parameter, matching the migration from method to class member.
1008-1008
: Updated tensor size calculationTensor size calculation now correctly uses the class member
n_pos_per_token
instead of calling a method, consistent with the overall refactoring.examples/llama.android/llama/src/main/java/android/llama/cpp/LLamaAndroid.kt (2)
48-48
: Updated JNI function signature to use contextThe
new_batch
function signature has been updated to take acontext
parameter (Long) instead of three integers. This aligns with the migration tollama_batch_ext
in the native code.
105-105
: Updated function call to match new signatureThe call to
new_batch
now correctly passes thecontext
parameter instead of the hardcoded values (512, 0, 1), aligning with the updated function signature.common/speculative.cpp (11)
17-17
: Updated struct member type to use the new batch APIChanged the
batch
member type fromllama_batch
tollama_batch_ext_ptr
, which is a smart pointer wrapper around the new extended batch structure. This change is part of the migration to the new batch API and will ensure proper memory management.
26-26
: Updated initialization to use the new batch APIThe batch initialization now uses
llama_batch_ext_ptr(ctx_dft)
constructor, which internally callsllama_batch_ext_init(ctx)
. This aligns with the migration pattern described in the PR objectives.
207-207
: Updated to use the batch object's clear methodChanged from using the standalone
common_batch_clear
function to using the object'sclear()
method, which is part of thellama_batch_ext_ptr
interface.
211-211
: Updated to use the batch object's add_text methodChanged from using
common_batch_add
tobatch.add_text()
, maintaining the same parameters but using the object's method instead of a standalone function.
217-217
: Updated token count check to use methodNow using
batch.n_tokens()
method instead of directly accessing a field, which is consistent with the encapsulation provided by the new batch API.
220-220
: Updated to use the extended decode functionChanged from using
llama_decode
tollama_decode_ext
and passingbatch.get()
to retrieve the raw pointer from the smart pointer. This correctly uses the new API for decoding.
227-229
: Batch operations updated to use new API methodsThese consecutive operations on the batch (clear and add_text) now use the object's methods instead of standalone functions, consistent with the migration to the new batch API.
234-234
: Another decode operation updated to use extended APIConsistently updated to use
llama_decode_ext
withbatch.get()
to retrieve the raw pointer from the smart pointer.
240-240
: Consistent use of batch clear methodConsistently using the batch object's
clear()
method throughout the code.
267-267
: Consistent use of batch add_text methodConsistently using the batch object's
add_text()
method throughout the code with the same parameter order as before.
270-270
: Consistent use of extended decode functionConsistently using
llama_decode_ext
withbatch.get()
throughout the code.common/common.cpp (2)
1019-1020
: Correct migration tollama_batch_ext_ptr
APIThe warm-up code has been properly migrated from the legacy batch API to the new extended batch API. This implementation correctly uses position 0 for the initial encoding of tokens.
1029-1030
: Properly migrated decoder warm-up with batch size limitThe decoding step of the warm-up process has been correctly migrated to the new extended batch API. Good use of
std::min()
to ensure the batch size doesn't exceed the configured limit.examples/server/server.cpp (12)
1232-1232
: Structure type updated for better memory managementThe change from
llama_batch batch_spec
tollama_batch_ext_ptr batch_spec
aligns with the migration pattern fromllama_batch
to the extended batch API. The smart pointer version helps with proper resource management.
1804-1804
: Structure type updated for better memory managementThe change from
llama_batch batch
tollama_batch_ext_ptr batch
follows the same migration pattern as the slot's batch_spec field, ensuring consistency throughout the codebase.
1930-1930
: Using clear() method instead of empty initializationThe code now properly calls
clear()
instead of initializing with empty braces, which is more explicit and ensures proper cleanup of resources.Also applies to: 1966-1966, 2858-2858, 3286-3286
2409-2457
: Embeddings implementation with sequence IDThe function signature has been updated to remove the
batch
parameter. The implementation now correctly uses a sequence ID (the slot ID) to retrieve embeddings directly from the context rather than iterating through batch tokens. This is more efficient and cleaner.Note that the commented-out code indicates a known limitation: embeddings without pooling are not yet supported, as flagged in the conditional check.
2459-2479
: Reranking implementation with sequence IDSimilar to the embeddings function, the reranking implementation has been updated to use sequence IDs instead of batch tokens, which simplifies the code and makes it more maintainable.
2880-2882
: Updated batch token handlingThe code now uses the
batch.n_tokens()
method instead of directly accessing then_tokens
member, and usesbatch.add_text()
instead ofcommon_batch_add()
, which provides better encapsulation and follows the new API pattern.
3085-3090
: Updated batch token handling for prompt processingThe code now uses
batch.add_text()
with proper position and sequence ID parameters for adding prompt tokens, replacing the oldcommon_batch_add()
function. This ensures consistent use of the new batch API.
3116-3116
: Using new API for setting output tokensReplaced the previous implementation with
llama_batch_ext_set_output_last(batch.get())
to set the last token as output, which aligns with the new API.
3146-3152
: Using batch view for efficient token processingThe code now processes tokens in chunks by creating a view of the batch using
llama_batch_ext_get_view()
, which allows for more efficient token processing without copying the entire batch.
3287-3291
: Updated batch token handling for speculative decodingThe code now uses
batch_spec.add_text()
with proper position and sequence ID parameters for adding tokens to the speculative batch, replacing the oldcommon_batch_add()
function.
3295-3295
: Updated decoding with extended batch APIChanged to use
llama_decode_ext(ctx, slot.batch_spec.get())
to align with the new API, ensuring that the speculative batch is properly processed.
4152-4155
: Warning about unsupported pooling typeThe code correctly checks for and rejects requests with pooling type 'none' since this isn't yet supported, as indicated by the commented-out implementation in
send_embedding
.examples/simple-chat/simple-chat.cpp (6)
2-2
: Include llama-cpp.h
This header is necessary for using the extended batch and decoding functions.
112-114
: Initialize and track n_past
Initializingn_past
at 0 and then updating it withbatch.n_tokens()
ensures consistent tracking of token positions across batches.
121-121
: Safe boundary check
Usingbatch.n_tokens()
here is correct for preventing context overflow.
127-127
: Extended decoding call
Switching tollama_decode_ext
is a valid upgrade; the error handling viaGGML_ABORT
is consistent with the rest of the code.
151-153
: Prepare batch for next token
Clearing the batch, adding the new token with the same sequence ID, and incrementingn_past
is a clean approach to building the prompt incrementally.
156-156
: Free the batch
Freeing the batch at the end ensures no leaks. This call is correctly placed outside the loop.examples/imatrix/imatrix.cpp (6)
500-500
: Init new batch
Callingllama_batch_ext_init(ctx)
is correct. Null checks are unnecessary as this function never returns nullptr per project learnings.
514-514
: Clearing the batch
Usingllama_batch_ext_clear(batch)
properly resets the batch before adding new tokens.
516-516
: Add tokens with extended API
The extended function call correctly includes the sequence ID reference. This aligns with the new batch approach.
520-520
: Handle decode failures
Performing the decode withllama_decode_ext
and checking errors ensures the process can abort cleanly on failure.
522-522
: Free batch upon failure
This early return avoids potential leaks or invalid usage if the decode fails.
535-535
: Final batch cleanup
Ensures resources are freed if decode succeeds. The logic avoids double-free issues.examples/llama.swiftui/llama.cpp.swift/LibLlama.swift (18)
13-13
: Opaque pointer for the batch
Storing the batch as anOpaquePointer
aligns with the Swift bridging of C APIs.
20-20
: Context length set to 128
Reducingn_len
from a larger value may limit generation length; verify if this meets your needs.
29-29
: Initialize extended batch
Initialization is correct, and no null check is required based on the library’s guarantees.
40-40
: Free batch in deinit
Properly releases resources and avoids memory leaks when this actor is deallocated.
98-98
: Return extended batch token count
Delegating token count tollama_batch_ext_get_n_tokens(batch)
is the correct approach.
120-120
: Clear batch before adding prompt tokens
Resets the batch to a known empty state.
124-124
: Add tokens with sequence ID
Providing[llama_seq_id(0)]
for each token is suitable for a single-sequence context.
126-126
: Mark last token for output
llama_batch_ext_set_output_last(batch)
ensures the final token is treated as output.
128-129
: Improve error handling for decode failure
Only printing an error might allow continued execution with an invalid state. Consider throwing an exception or returning early.
132-132
: Track current token count
Retrieving the new token count from the batch is crucial to keepn_cur
in sync.
138-138
: Sampler usage
Sampling based onllama_batch_ext_get_n_tokens(batch) - 1
seems intentional, but confirm it correctly references the last token.
165-166
: Prepare batch for appended token
Clearing the batch and adding the new token with updatedn_cur
is consistent with incremental text generation.
171-171
: Add more robust error handling
Similar to lines 128-129, only printing a message can lead to unpredictable flow. Return early or raise an error.
188-188
: Clear batch
Properly resets prior to adding the next set of tokens.
193-195
: Add tokens and mark last output
Ensures the final token in this batch is flagged for output.
201-202
: Refine error handling
Consider throwing or returning an error code to prevent continuing after a decode failure.
215-218
: Batch reuse for text generation
Clearing and adding tokens in a loop is consistent. Each iteration handles a slice of text with its own sequence ID.
221-222
: Consistent decode error strategy
Again, only printing an error may risk future logic. A robust approach could error-out completely.examples/llama-bench/llama-bench.cpp (6)
1430-1454
: Ensuren_past
remains valid.
This refactoring cleanly incorporates the newn_past
parameter intest_prompt()
. Just be mindful of any scenarios wheren_past
might be negative or exceed expected bounds.
1455-1470
: Decode error handling not present.
This newtest_gen()
function also omits error checks onllama_decode_ext
. Based on previous feedback, you plan to simplify for now and handle errors later, so this is noted.
1613-1613
: Warm-up call looks good.
Callingtest_prompt(...)
withn_past = 0
aligns with the intended warm-up approach.
1619-1619
: Consistent generation warm-up.
Similar to the prompt warm-up, usingtest_gen(..., 0, ...)
is consistent with the new signature.
1632-1632
: Additional prompt run.
Repeated invocation oftest_prompt(...)
seems intentional for benchmarking multiple runs. No concerns here.
1639-1639
: Final generation run invocation.
Invokingtest_gen(...)
witht.n_gen, t.n_prompt
is consistent with the revised batch API usage.examples/lookahead/lookahead.cpp (9)
95-96
: Partitioning prompt tokens into two batches.
Splitting the initial tokens (n_input - 1
) and the last token separately is a neat approach. Be cautious ifn_input
is ever ≤ 1, but the prior checks should cover that.
97-98
: No error checks for decode calls (revisited).
As previously noted, you’ve chosen not to handlellama_decode_ext
return codes here for simplicity.
120-120
: Batch initialization.
Allocating a dedicatedllama_batch_ext
for the lookahead logic appears correct and is properly paired with a free at line 483.
209-209
: Clearing batch state.
Good use ofllama_batch_ext_clear
before repopulating tokens.
212-212
: Adding the current token with output enabled.
Marking this token astrue
for output is consistent, ensuring it contributes to decoded logits.
235-239
: Tracking token indices and adding verification tokens.
Storing the batch token position viallama_batch_ext_get_n_tokens
right before callingllama_batch_ext_add_text
helps manage multi-sequence verification.
250-251
: Filling the next tokens at this level.
Appending lookahead tokens in a single call is succinct and leverages thefalse
output flag for intermediate tokens.
257-258
: Handling sub-level tokens.
Usingi + 1
as the sequence ID and conditionally settingoutput
toj == N - 2
ensures tokens become logits only at the final stage.
483-483
: Proper resource cleanup.
Freeing the extended batch pointer prevents leaks and finalizes the lookahead logic neatly.examples/run/run.cpp (6)
643-643
: Initialize past-tokens counterAdding a dedicated
n_past
field is a solid approach for tracking the number of tokens processed so far.
954-964
: New context size check functionThis helper cleanly encapsulates the logic for verifying that the batch does not exceed the context limit. Consider ensuring that its return value is consistently used or handled at the call site to avoid decoding when the limit is exceeded.
995-995
: Initialize batch from text with current positionUsing
llama_data.n_past
as thepos0
argument is a clean way to maintain correct offset. Just verify that other function calls also pass updated positions to prevent misalignment.
998-999
: Revisit return-value handling
Previously flagged in older commits: the return value ofcheck_context_size
is ignored, which can lead to a context overflow scenario undetected until after decoding.
1004-1005
: Incrementingn_past
Updating
llama_data.n_past
by the batch size is correct for preserving generation context.
1020-1020
: Clearing batch mid-generationClearing the batch to prepare for the next token set is appropriately placed. Ensure no logic relies on the old batch contents.
examples/embedding/embedding.cpp (9)
29-34
: New helper for adding tokens
batch_add_seq
correctly transforms each token into a call tollama_batch_ext_add_text
. Be mindful that usingtrue
for theoutput
parameter produces embeddings for every token; confirm that all tokens need embedding.
36-43
: Dedicated batch-decoding functionThis new function centralizes the batch decoding logic for embeddings. Parameter choices (e.g.,
embd_norm
) appear consistent with the code’s usage.
45-45
: Informational logLogging the token count and sequence number is helpful for debugging.
49-50
: Encoder invocationSwitching to
llama_encode_ext
aligns with the extended batch approach. The error-handling block looks good.
54-56
: Decoder invocationSimilarly, calling
llama_decode_ext
ensures consistency with extended API usage.
59-75
: Conditional pooling logicSplitting the logic by pooling type is clear. Iterating token-by-token vs. sequence-by-sequence is properly separated.
170-170
: Extended batch creationReplacing legacy batch initialization with
llama_batch_ext_init
properly adopts the new mechanism.
210-210
: Final decode callUsing
batch_decode
one last time for any remaining tokens completes the embedding process.
316-316
: Freeing the extended batchReleasing batch resources at the end ensures no memory leaks remain.
examples/batched/batched.cpp (14)
105-105
: Extended batch initializationCreating the batch with
llama_batch_ext_init(ctx)
is the correct replacement for the legacy batch struct.
114-114
: Prompt token additionAppropriately adds prompt tokens for decoding. Confirm that the final
false
foroutput
matches the desired logits usage.
116-116
: Assert on prompt sizeVerifying that the batch token count matches the prompt size ensures correct prompt loading.
119-120
: Hybrid model checkEntering the encoder path to invoke
llama_encode_ext
is correct. The immediate error check is a good practice.
129-130
: Reset and add decoder startClearing the batch, then adding a
decoder_start_token_id
sets the stage for the next decode phase.
134-134
: Output for last tokenCalling
llama_batch_ext_set_output_last
ensures the final prompt token's logits are available for sampling.
136-136
: Check decode resultThe nonzero return from
llama_decode_ext
is handled via an error condition, which is good for ensuring robust flow.
160-160
: Initializen_cur
Keeping track of the total tokens processed so far with
llama_batch_ext_get_n_tokens
helps produce correct offsets.
167-167
: Clearing batch per iterationClearing between iterations is essential for the next set of tokens.
197-197
: Update stream index after samplingReassigning
i_batch[i]
to the new last token index ensures correct sampling references.
199-199
: Add new token for outputPassing
true
for theoutput
argument logs the token’s logits. This is consistent with the decoding loop’s flow.
205-205
: Detect completionShort-circuiting the loop if there are no new tokens to decode makes sense to avoid unnecessary processing.
212-212
: Error-handling decodeEnsuring decode success each iteration is critical for reliability.
237-237
: Clean up extended batchProperly freeing the batch finalizes the memory usage.
examples/perplexity/perplexity.cpp (3)
652-661
: Good update to the decode_helper function.The function has been properly updated to work with the new batch extension API. The error handling is implemented correctly, checking the return value from
llama_decode_ext
.
366-375
: Batch initialization and token addition are properly updated.The code now uses the new
llama_batch_ext_ptr
API correctly. The batch is initialized, cleared, and tokens are added using the appropriate methods.
567-573
: Position tracking is handled correctly.The updated code properly tracks positions for tokens, which is critical for maintaining context. The conversion from the old batch API to the new extended batch API preserves this important aspect.
examples/retrieval/retrieval.cpp (4)
77-82
: Proper update to batch_add_seq function.The function signature and implementation have been updated to work with the new
llama_batch_ext
API, usingllama_batch_ext_add_text
for token addition.
84-110
: Improved model type handling in batch_decode.The updated function now properly handles both encoder-only and decoder-only models, selecting the appropriate decoding method based on the model type. The addition of the
embd_norm
parameter also provides more flexibility for embedding normalization.The assertion checking for non-null embeddings is a good defensive programming practice.
232-239
: Batch management is correctly updated.The code properly uses
llama_batch_ext_get_n_tokens
to check the batch size andllama_batch_ext_clear
to reset the batch. This maintains the same logic flow as the original code with the new API.
299-301
: Proper memory cleanup.The code correctly frees the allocated batch memory using
llama_batch_ext_free
, ensuring there are no memory leaks.examples/speculative/speculative.cpp (7)
6-6
: Include Usage Check
Including"llama-cpp.h"
here is valid, but ensure it does not conflict with the existing"llama.h"
and that all required definitions are uniquely provided without redefinition.
205-206
: No Immediate Issues
Instantiating bothbatch_dft
andbatch_tgt
from the correct contexts looks appropriate for parallel usage.
452-452
: Repeated Omission of Error Checking
Similarly, this second call tollama_decode_ext
also does not verify success or failure. Maintain consistency by handling its return code.
477-484
: Struct Introduction & Vector Usage
Definingbatch_info
and usingbatch_tgt_data
clarifies structured token storage. The approach is readable. No issues found.
489-489
: No Issues with Clearing
Callingbatch_dft.clear()
is straightforward and should reset the batch accurately.
579-580
: Early Break on Empty Batch
Breaking ifbatch_dft.n_tokens() == 0
is a neat safeguard against aimless decoding. No problem found.
588-588
: Draft Data Size Check
Breaking ifbatch_tgt_data.size() > n_draft
is fine for preventing excessive tokens. Ensure the user is informed of partial drafting.examples/llama.android/llama/src/main/cpp/llama-android.cpp (14)
178-178
: Correct Casting of Batch Pointer
Casting the batch pointer tollama_batch_ext *
is consistent with the new extended batch API. No immediate issues.
189-189
: Batch Clear
llama_batch_ext_clear(batch);
effectively resets the batch. Looks good.
197-197
: Setting Output Last
Marking the last token as output is valid for capturing logits from the prompt. No concerns.
201-202
: Good Error Checking
Here, the failure state fromllama_decode_ext(...)
is properly logged. This consistent pattern of checking is recommended everywhere.
221-222
: Decode Failure Handling
Again, robust error handling is displayed here for text generation. Good job.
274-278
: Initializing a New Batch
Creating a newllama_batch_ext
usingllama_batch_ext_init(context)
aligns with the extended batch approach. Looks correct.
286-286
: Freeing the Batch
Callingllama_batch_ext_free
ensures we release associated memory. No issues.
334-334
: Consistent Reinterpret Cast
Usingreinterpret_cast<llama_batch_ext *>(batch_pointer)
maintains consistency with your new API calls.
352-352
: Batch Reset Before Filling
Clearing the batch before adding the prompt tokens is sensible. No concerns.
362-362
: Designating Last Token as Output
No issues. This is needed if you want to collect logits for the final token.
364-364
: Error Handling
You properly handle decode errors. Good consistency.
369-369
: Returning the Batch Token Count
This is a helpful interface for the Java layer. Looks fine.
383-383
: Batch Pointer Reinterpretation
Same pattern of casting. No immediate issues.
419-420
: Logging Decode Failures
As before, capturing decode errors helps in debugging. This is a good pattern.examples/save-load-state/save-load-state.cpp (5)
51-51
: Batch Initialization
Usingllama_batch_ext_ptr::init_from_text
to create and populate the batch from your prompt tokens is a clean approach.
82-85
: Decoding & Freeing on Failure
This block properly checksllama_decode_ext
and frees the batch upon failure. Good error handling.
134-139
: Consistent Error Handling
You again handle decode failures and free resources. Keeping the cleanup consistent is commendable.
218-220
: Error Handling for Third Context
The pattern remains consistent, returning if decode fails and freeing the batch. This is good practice.
232-232
: Memory Cleanup
Final call to free the batch ensures resources are released. Looks correct.src/llama-batch.cpp (11)
2-2
: Header inclusion looks fine.
No issues noted with adding#include "llama-graph.h"
; it aligns with the rest of the include directives.
277-317
: Confirm intent for forcing the last token’s logits totrue
.
By default, line 314 sets the final token'slogits
value totrue
. Ensure this behavior is intentional regarding how output tokens are flagged.
319-321
: Constructor/Destructor alignment.
The destructor deletes thebatch
pointer. Past review comments mentioned double-free concerns. Since they are addressed, no further action is needed.
328-330
: Helper function is concise and correct.
No obvious issues. This function simply returns a validllama_batch
struct without extra checks, which is consistent with the project style.
373-375
: Simple initialization logic.
Constructing the batch with the specified parameters is straightforward and consistent with the rest of the file.
377-393
: Embedding dimension validation.
Ifn_embd
mismatches the actual model dimension, it can lead to undefined behavior. This was flagged in a previous review.
395-397
: Trivial accessor is correct.
Returningbatch->n_tokens
meets the function’s purpose.
443-450
: Marking the last token as output is straightforward.
No issues with the logic; it is a minimal function that setslogits
on the final token.
452-454
: Method clearsn_tokens
only.
Resettingn_tokens
is enough for reusing the batch structure. This appears intentional and consistent with the code style.
477-493
: View-aware free logic is correct.
Memory is freed only if!is_view
, preventing double-free of shared pointers. This aligns with the introduced view mechanism.
495-495
: No additional comment needed.
This line simply documents deprecation.examples/batched-bench/batched-bench.cpp (7)
62-62
: Batch initialization appears standard.
The call tollama_batch_ext_init(ctx)
is straightforward.
118-118
: Clearing the batch is consistent.
No issues: resettingn_tokens
each iteration is valid.
122-122
: Again ignoring error codes.
Same concern regarding ignoring the return code fromllama_batch_ext_add_text
.
125-125
: Mark last token as output.
Straightforward usage ofllama_batch_ext_set_output_last
.
147-147
: Batch clear repeated.
Nothing unusual; consistent usage with earlier calls.
150-150
: Ignoring error codes once more.
As with earlier calls tollama_batch_ext_add_text
, evaluating the return could improve reliability.
188-188
: Freeing the extended batch is correct.
Deallocating at the end follows the introduced pattern.examples/llava/gemma3-cli.cpp (10)
8-8
: New include file.
Including"llama-cpp.h"
for extended batch features is consistent with the migration.
67-67
: Migration tollama_batch_ext_ptr
.
Switching from a raw struct to a smart pointer should aid safe memory management.
77-78
: Batch reset withllama_batch_ext_init
.
Properly assigning a fresh batch instance. No concerns.
93-93
: Clearing the batch.
Ensures subsequent calls start with a clean slate.
98-98
: Setting the last token as output.
No issues observed, consistent with usage.
101-101
: Decoding call with error check.
You handle the return value, ensuring any decode failure is logged.
150-152
: Embedding-based batch initialization.
Be suren_embd
matches the model dimension (already noted).
183-183
: Clearing batch for next token.
Logical step before reusing or adding new tokens.
184-184
: Ignoringadd_text
return code.
Same concern that ignoring a failure could lead to hidden errors.
188-188
: Decoding token with proper error handling.
Checksllama_decode_ext
return value and logs on failure.include/llama.h (6)
238-240
: Deprecation notice is clear.
The documentation effectively warns about the upcoming removal of this struct.
928-941
: Reiterate the need for pointer and dimension checks.
As noted in previous reviews, explicitly validate thatembd
is non-null andn_tokens, n_embd > 0
to avoid undefined behavior.
942-943
: Function for retrieving the token count looks good.
No immediate issues found.
945-957
: Clarify usage of seq_ids parameter.
The documentation does not specify how to handle a null pointer or zero-length array inseq_ids
. Please provide details or consider adding checks to avoid potential runtime errors.
968-973
: Straightforward logic for setting output on the last token.
Implementation looks clear and consistent.
981-986
: Clear guidance on memory management.
These functions provide straightforward handling of batch clearing and freeing.
examples/embedding/embedding.cpp
Outdated
if (llama_batch_ext_get_n_tokens(batch) + n_toks > n_batch) { | ||
batch_decode(ctx, batch, emb + e * n_embd, s, n_embd, params.embd_normalize); | ||
llama_batch_ext_clear(batch); | ||
|
||
e += pooling_type == LLAMA_POOLING_TYPE_NONE ? llama_batch_ext_get_n_tokens(batch) : s; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential off-by-one error after clearing
llama_batch_ext_clear(batch)
resets the token count, yet line 201 uses llama_batch_ext_get_n_tokens(batch)
to update e
. This will yield zero instead of the tokens just decoded.
Below is a suggested fix to store the old count before clearing:
int old_tokens = llama_batch_ext_get_n_tokens(batch);
batch_decode(ctx, batch, emb + e * n_embd, s, n_embd, params.embd_normalize);
llama_batch_ext_clear(batch);
- e += pooling_type == LLAMA_POOLING_TYPE_NONE ? llama_batch_ext_get_n_tokens(batch) : s;
+ e += pooling_type == LLAMA_POOLING_TYPE_NONE ? old_tokens : s;
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if (llama_batch_ext_get_n_tokens(batch) + n_toks > n_batch) { | |
batch_decode(ctx, batch, emb + e * n_embd, s, n_embd, params.embd_normalize); | |
llama_batch_ext_clear(batch); | |
e += pooling_type == LLAMA_POOLING_TYPE_NONE ? llama_batch_ext_get_n_tokens(batch) : s; | |
if (llama_batch_ext_get_n_tokens(batch) + n_toks > n_batch) { | |
int old_tokens = llama_batch_ext_get_n_tokens(batch); | |
batch_decode(ctx, batch, emb + e * n_embd, s, n_embd, params.embd_normalize); | |
llama_batch_ext_clear(batch); | |
e += pooling_type == LLAMA_POOLING_TYPE_NONE ? old_tokens : s; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (9)
src/llama-model.cpp (2)
6078-6082
: Clarify and Verify Transposition Logic for Input PositionsThe new transposition sequence—from reshaping to transposing then flattening—appears intended to convert a layout from
0000111122223333
to0123012301230123
. This approach is clear given the comment, but please verify that:
- The intermediate reshapes and transpose indeed produce the expected layout for all
n_tokens
andn_pos_per_token
values.- There is no unintended overhead or memory management issue (e.g., resource leaks if
ggml_cont
allocates new tensors).As a follow-up, consider encapsulating this common transformation in a helper function and adding unit tests to ensure correctness.
12065-12072
: Consistent Overloads for Position-Per-Token CalculationThe overloads of
llama_n_pos_per_token
that differentiate based on the architecture ensure that forLLM_ARCH_QWEN2VL
four positions per token are returned, and one otherwise. This refactoring is concise and improves clarity by abstracting the logic behind position calculation.For added clarity, a brief comment explaining why the QWEN2VL architecture requires
4
positions (if not documented elsewhere) would be beneficial.examples/server/server.cpp (1)
2403-2447
: Embeddings implementation with sequence IDThe
send_embedding
method has been refactored to use sequence IDs instead of directly processing batch tokens. This aligns with the sequence-centric design of the extended batch API. However, as noted in the commented-out code, embeddings without pooling are not currently supported.Consider implementing support for embeddings without pooling in a future update to make this feature complete. The commented-out code shows the intended approach but is not yet functional.
src/llama-batch.h (1)
8-33
: Extended batch documentation
These added comments detail usage ofllama_batch_ext
(tokens, embeddings, positions, etc.). Good clarity. Consider clarifying multi-position behavior if more than one position per token is eventually used.src/llama-batch.cpp (1)
431-448
: Selective output marking
llama_batch_ext_set_output
marks tokens as output if matchingseq_id
(and optional position). This is a flexible approach. Just ensure that the unrolled search doesn’t cause performance issues with large batches.include/llama.h (4)
264-267
: Consider adding a dedicated documentation block
While the forward declaration ofllama_batch_ext
is helpful, consider adding a short docstring describing usage, memory ownership, and any constraints for users of this new extended batch structure.// Input data for llama_decode / llama_encode // It can contain text tokens and embeddings for one or many sequences struct llama_batch_ext; +// Example usage: +// 1) Allocate with llama_batch_ext_init(ctx). +// 2) Add tokens or embeddings. +// 3) Pass to llama_encode_ext / llama_decode_ext. +// 4) Free with llama_batch_ext_free().
485-486
: Add clarifying comments for these new model functions
The new functions for rope type and position retrieval appear useful, but adding brief details about their purpose (e.g., how the rope type influences positional encoding) would make the API more self-documenting.LLAMA_API enum llama_rope_type llama_model_rope_type(const struct llama_model * model); LLAMA_API uint32_t llama_n_pos_per_token(const struct llama_model * model); +/// @brief Returns the number of positions per token for the given model. +/// Typically used to configure internal data structures for positional encodings.
997-999
: Document the extended encoding workflow
The newly introducedllama_encode_ext
function lacks a descriptive comment explaining its input, output, and error codes. Consider replicating or adapting the documentation fromllama_encode
with emphasis on how extended batches differ.LLAMA_API int32_t llama_encode_ext( struct llama_context * ctx, struct llama_batch_ext * batch); +/// @brief Encode a batch of tokens or embeddings using the new llama_batch_ext API. +/// @return 0 on success, < 0 on error, as with llama_encode. +/// @note This function updates internal KV cache state.
1007-1011
: Improve documentation and consistency with llama_decode
Similarly,llama_decode_ext
would benefit from usage notes and parameter details that mirror the originalllama_decode
doc—especially around batch sizing, concurrency, and return codes.struct llama_batch batch); LLAMA_API int32_t llama_decode_ext( struct llama_context * ctx, struct llama_batch_ext * batch); +/// @brief Decode a batch of extended tokens or embeddings. +/// @return 0 on success, < 0 on error. +/// @note Validate the batch size and sequence IDs before calling this function.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
examples/llava/qwen2vl-cli.cpp
(2 hunks)examples/parallel/parallel.cpp
(7 hunks)examples/server/server.cpp
(20 hunks)include/llama.h
(6 hunks)src/llama-batch.cpp
(5 hunks)src/llama-batch.h
(4 hunks)src/llama-model.cpp
(2 hunks)
🔥 Files not summarized due to errors (2)
- examples/llava/qwen2vl-cli.cpp: Error: Disallowed special token found: <|im_start|>
- src/llama-model.cpp: Error: Server error: no LLM provider could handle the message
🧰 Additional context used
🧬 Code Definitions (6)
examples/llava/qwen2vl-cli.cpp (4)
include/llama-cpp.h (7)
pos
(79-79)batch
(29-29)batch
(29-29)token
(87-93)token
(87-87)token
(96-102)token
(96-96)src/llama-batch.h (1)
batch
(100-100)include/llama.h (2)
llama_decode_ext
(1009-1011)llama_batch_ext_set_output_last
(973-973)src/llama-batch.cpp (2)
llama_batch_ext_set_output_last
(450-457)llama_batch_ext_set_output_last
(450-450)
examples/parallel/parallel.cpp (5)
include/llama-cpp.h (8)
batch
(29-29)batch
(29-29)ctx
(46-60)ctx
(46-51)ctx
(63-70)ctx
(63-68)ctx
(73-84)ctx
(73-78)src/llama-batch.h (1)
batch
(100-100)src/llama-context.h (9)
i
(49-49)i
(52-52)ctx
(157-161)seq_id
(53-53)seq_id
(100-100)seq_id
(101-101)seq_id
(102-102)seq_id
(115-120)seq_id
(122-126)src/llama-context.cpp (4)
llama_decode_ext
(2797-2801)llama_decode_ext
(2797-2799)n_batch
(437-439)n_batch
(437-437)src/llama-batch.cpp (4)
llama_batch_ext_set_output_last
(450-457)llama_batch_ext_set_output_last
(450-450)llama_batch_ext_get_view
(463-483)llama_batch_ext_get_view
(463-466)
include/llama.h (3)
include/llama-cpp.h (15)
model
(13-13)model
(13-13)batch
(29-29)batch
(29-29)ctx
(46-60)ctx
(46-51)ctx
(63-70)ctx
(63-68)ctx
(73-84)ctx
(73-78)pos
(79-79)token
(87-93)token
(87-87)token
(96-102)token
(96-96)src/llama-context.cpp (2)
n_seq_max
(445-447)n_seq_max
(445-445)src/llama-batch.cpp (2)
llama_batch_ext_init
(375-378)llama_batch_ext_init
(375-375)
src/llama-batch.h (3)
include/llama-cpp.h (2)
batch
(29-29)batch
(29-29)src/llama-kv-cache.h (1)
batch
(111-111)src/llama-batch.cpp (2)
llama_batch_allocr
(277-318)llama_batch_allocr
(320-322)
src/llama-batch.cpp (3)
src/llama-batch.h (3)
batch
(100-100)llama_batch_allocr
(114-114)llama_batch_allocr
(116-116)include/llama.h (3)
llama_n_pos_per_token
(486-486)llama_n_batch
(469-469)llama_n_seq_max
(471-471)src/llama-model.h (1)
llama_n_pos_per_token
(405-405)
examples/server/server.cpp (4)
src/llama-batch.h (1)
batch
(100-100)src/llama-context.h (9)
seq_id
(53-53)seq_id
(100-100)seq_id
(101-101)seq_id
(102-102)seq_id
(115-120)seq_id
(122-126)ctx
(157-161)i
(49-49)i
(52-52)src/llama-context.cpp (4)
llama_pooling_type
(2385-2387)llama_pooling_type
(2385-2385)llama_get_embeddings_seq
(2456-2460)llama_get_embeddings_seq
(2456-2456)src/llama-batch.cpp (2)
llama_batch_ext_set_output_last
(450-457)llama_batch_ext_set_output_last
(450-450)
⏰ Context from checks skipped due to timeout of 90000ms (20)
- GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
- GitHub Check: windows-latest-cmake-hip
- GitHub Check: windows-latest-cmake-sycl
- GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
- GitHub Check: macOS-latest-swift (generic/platform=macOS)
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
- GitHub Check: ubuntu-22-cmake-sycl
- GitHub Check: editorconfig
- GitHub Check: ubuntu-22-cmake-musa
- GitHub Check: ubuntu-22-cmake-vulkan
- GitHub Check: ubuntu-22-cmake-sycl-fp16
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: ubuntu-22-cmake-hip
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: server (Release)
- GitHub Check: server (ADDRESS, RelWithDebInfo)
- GitHub Check: server (UNDEFINED, RelWithDebInfo)
- GitHub Check: server-windows
🔇 Additional comments (41)
examples/server/server.cpp (10)
1232-1232
: Type change fromllama_batch
to smart pointerllama_batch_ext_ptr
This is part of the migration from the legacy batch API to the extended batch API. The change to a smart pointer should improve memory management and resource cleanup.
1804-1804
: Consistent type change tollama_batch_ext_ptr
in server_contextThis change is consistent with the migration pattern from
llama_batch
tollama_batch_ext_ptr
seen in the slot struct, ensuring uniform adoption of the extended batch API throughout the codebase.
1930-1930
: Replaced initialization with clear() methodThe code now uses the
clear()
method on the batch objects instead of reinitializing them. This is a more explicit and potentially more efficient way to reset the batch state.Also applies to: 1962-1962, 2095-2095
2453-2468
: Reranking implementation with sequence IDSimilar to the embeddings implementation, the
send_rerank
method now uses sequence IDs instead of directly processing batch tokens. This provides a more consistent approach across different functionalities.
2874-2877
: Updated token addition API in update_slotsThe code now uses
batch.add_text()
to add tokens to the batch, which is part of the new extended batch API. This provides a more direct and potentially more efficient way to build batches.
3079-3084
: Consistent use of add_text for prompt tokensThe code for processing prompt tokens now uses the same
batch.add_text()
method, ensuring consistency in how tokens are added to batches throughout the codebase.
3110-3110
: Using extended batch API to set output tokenThe code now uses
llama_batch_ext_set_output_last(batch.get())
to set the last token as the output, which is part of the new API. This provides a clearer way to specify which tokens should produce output logits.
2893-2893
: Updated batch size checkThe condition now uses
batch.n_tokens()
instead of likely a direct field access in the previous implementation. This is consistent with the object-oriented approach of the new batch API.
3140-3146
: Improved batch processing with viewsThe code now creates batch views for processing, which allows processing batches in chunks without copying data. This is a more efficient approach that aligns well with the extended batch architecture.
3280-3285
: Updated speculative decoding batch handlingSpeculative decoding now uses the extended batch API consistently, maintaining compatibility with the rest of the code. The clear and add_text pattern is applied consistently here as well.
examples/parallel/parallel.cpp (15)
9-9
: Include for new extended batch API
Including"llama-cpp.h"
here aligns with the refactor to the new extended batch API. Good addition.
176-176
: Instantiating an extended batch pointer
Creatingllama_batch_ext_ptr batch(ctx);
ensures the batch is tied to the context. Make surectx
is valid and remains in scope for this usage.
194-194
: Adding system tokens to the new batch
Usingbatch.add_text(tokens_system[i], i, 0, false);
shows the updated approach to add tokens. Looks correct for the incremental indexing ofi
.
197-197
: Error handling for extended decode
The conditional checks the return value ofllama_decode_ext
and logs an error before returning. The handling is consistent with the new decoding API.
218-218
: Clearing the extended batch
batch.clear();
correctly resets the batch for reuse. Straightforward usage of the new method.
226-226
: Tracking client batch indices
Assigningclient.i_batch = batch.n_tokens();
associates the client’s position with the current end of the batch tokens. This is logically sound.
229-229
: Adding sampled token with position offset
batch.add_text(client.sampled, n_tokens_system + client.n_prompt + client.n_decoded, seq_id, true);
properly offsets the position. Ensure total position count doesn’t exceed the model’s limit.
247-247
: Conditional insertion of new sequences
if (cont_batching || batch.n_tokens() == 0) { ... }
ensures new sequences are only appended when continuous batching is allowed or the batch is empty. No immediate concerns.
265-266
: Appending prompt tokens
These lines compute a sequence ID and add tokens withbatch.add_text(...)
. The logic appears correct for binding tokens to the correct client-driven sequence.
271-272
: Mark last token for output
Conditionally sets the last token as an output token. This is a clean way to retrieve logits/embeddings only for the final token if needed.
277-277
: Index alignment for newly added tokens
client.i_batch = batch.n_tokens() - 1;
updates the client’s reference to the last token position. Straightforward usage.
290-290
: Early exit if batch is empty
if (batch.n_tokens() == 0) { break; }
stops processing when there are no tokens to decode. Approved logic.
297-298
: Looping over batch tokens
Storingn_tokens_in_batch = batch.n_tokens()
and iterating in chunks is consistent with the new extended batch usage.
306-306
: Chunk size determination
const int32_t n_tokens = std::min(n_batch, (int32_t) (n_tokens_in_batch - i));
neatly caps each iteration’s size. Looks correct.
308-310
: Batch view usage without explicit boundary checks
llama_batch_ext_get_view(batch.get(), i, n_tokens)
assumesi + n_tokens <= batch->max_tokens
. If incorrect, this could go out of bounds. Verify that the combination of offset + n_tokens never exceeds the allocated range.src/llama-batch.h (5)
77-77
: Pointer to extended batch
const llama_batch_ext * batch = nullptr;
suggests a read-only pointer usage. Ensure the batch is valid for the entire lifetime of this struct.
100-100
:from_batch
updated to acceptllama_batch_ext
Switching to the extended type is consistent with the overall refactor. The function properly handles tokens, pos, seq IDs, etc.
105-106
: Pointer member for the allocated batch
struct llama_batch_ext * batch;
inllama_batch_allocr
shows a new approach for dynamic ownership. Confirm that usage patterns avoid double frees.
113-114
: New constructor transition
The constructor now receivesstruct llama_batch & in_batch
to create an extended batch. This is a straightforward bridging approach from the old batch type.
116-117
: Ensuring proper resource teardown
The destructor callsdelete batch;
preventing leaks. This finalizes the one-to-one ownership pattern.src/llama-batch.cpp (10)
2-2
: Including llama-graph
Adding#include "llama-graph.h"
may be necessary if extended batch APIs or graph structures cross-reference. This is presumably correct.
342-373
:llama_batch_ext_init_impl
allocation strategy
The function allocates dynamic buffers for tokens, embeddings, positions, etc., and sets default fields. Notably, there's no explicit out-of-memory check, consistent with codebase learnings. Ensure the caller will handle or assume success.
375-379
: Initialize extended batch from context
llama_batch_ext_init
references the model’s batch/seq maxima. The code sets up a batch with zero tokens. Looks correct for a typical initialization.
380-396
: Initialize extended batch from embeddings
Creates a batch forembd
, copying positions and assigning the single seq_id. This is consistent with an embedding-based input pipeline.
398-400
: Getter for token count
llama_batch_ext_get_n_tokens
simply returnsbatch->n_tokens
. Straightforward accessor.
402-429
: Adding tokens to extended batch
llama_batch_ext_add_text
appends one token, copying seq_ids and setting logits flags. This method returns negative codes on error. See if upstream code checks these error codes thoroughly.
450-457
: Set the last token as output
llama_batch_ext_set_output_last
sets thelogits
flag on the last token ifbatch->n_tokens > 0
. Appropriately returns the output index.
459-462
: Clearing the extended batch
llama_batch_ext_clear(batch)
setsn_tokens = 0
. Straightforward approach to reuse the allocated buffers.
463-484
: Creating a batch view without buffer copies
llama_batch_ext_get_view
returns a newllama_batch_ext
object referencing a sub-range (offset, n_tokens). It’s flagged asis_view
so memory isn’t freed. Beware of offset boundary checks.
485-501
: Memory release for extended batch
llama_batch_ext_free
frees only if it’s not a view. This upholds theis_view
semantics. Potential for double-free is mitigated if usage is consistent.include/llama.h (1)
238-240
: Clear deprecation note for the legacy struct
The addition of a prominent warning helps guide developers away from legacy usage.
if (llama_pooling_type(ctx_server.ctx) == LLAMA_POOLING_TYPE_NONE) { | ||
res_error(res, format_error_response("Pooling type 'none' is not yet supported. Please use a different pooling type", ERROR_TYPE_INVALID_REQUEST)); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warning about unsupported pooling type
The code correctly checks for and rejects requests with pooling type 'none' since this isn't yet supported, as indicated by the commented-out implementation in send_embedding
.
Consider implementing support for the 'none' pooling type to make the feature complete, or provide more detailed error messages to help users understand the limitation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
examples/server/server.cpp (1)
1256-1257
: TODO comment should be addressed.There's a TODO comment about removing
i_batch
and using only sequence-based sampling, which would be a more modern approach. Consider implementing this change now as part of the batch API migration, or create a follow-up task to track this improvement.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/server/server.cpp
(20 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
examples/server/server.cpp (6)
src/llama-batch.h (1)
batch
(100-100)src/llama-context.h (9)
ctx
(157-161)seq_id
(53-53)seq_id
(100-100)seq_id
(101-101)seq_id
(102-102)seq_id
(115-120)seq_id
(122-126)i
(49-49)i
(52-52)include/llama.h (4)
llama_pooling_type
(482-482)llama_get_embeddings_seq
(1076-1076)llama_batch_ext_set_output_last
(973-973)llama_batch_ext_get_view
(977-980)src/llama-context.cpp (6)
llama_pooling_type
(2385-2387)llama_pooling_type
(2385-2385)llama_get_embeddings_seq
(2456-2460)llama_get_embeddings_seq
(2456-2456)n_batch
(437-439)n_batch
(437-437)common/common.cpp (2)
common_embd_normalize
(1795-1829)common_embd_normalize
(1795-1795)src/llama-batch.cpp (4)
llama_batch_ext_set_output_last
(450-457)llama_batch_ext_set_output_last
(450-450)llama_batch_ext_get_view
(463-483)llama_batch_ext_get_view
(463-466)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: ubuntu-22-cmake-sycl-fp16
- GitHub Check: ubuntu-22-cmake-sycl
- GitHub Check: ubuntu-22-cmake-vulkan
- GitHub Check: ubuntu-latest-cmake-cuda
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: server (ADDRESS, RelWithDebInfo)
- GitHub Check: server (UNDEFINED, RelWithDebInfo)
- GitHub Check: server-windows
- GitHub Check: server (Release)
🔇 Additional comments (21)
examples/server/server.cpp (21)
1232-1232
: Data structure update to use extended batch.The server_slot struct now uses
llama_batch_ext_ptr
which is a smart pointer wrapper for the new extended batch structure, replacing the previousllama_batch
type. This is part of the migration to the new batch API.
1804-1804
: Data structure update in server context.The
server_context
struct now usesllama_batch_ext_ptr
instead ofllama_batch
, consistent with the migration to the extended batch API.
1930-1930
: Updated batch_spec clearing.Now using the
clear()
method of the smart pointer wrapper instead of initializing with empty braces. This is the correct approach for the new API.
2095-2095
: Consistent use of clear() method.Consistently using the
clear()
method on batch_spec throughout the codebase, which is appropriate for the new batch pointer type.
2403-2447
: Embeddings implementation with sequence IDThe implementation now correctly uses sequence IDs from the slot instead of iterating through batch tokens. Note that the commented-out code (currently not used) contains a limitation warning that embeddings without pooling are not yet supported.
Consider implementing support for embeddings without pooling to make the feature complete. The commented code shows the intended approach but is not yet functional.
2453-2468
: Improved reranking implementation with sequence ID.The
send_rerank
method has been updated to work with the new batch API by using sequence IDs rather than passing a batch parameter. This is consistent with the similar changes to thesend_embedding
method and aligns with the broader batch API migration.
2852-2852
: Properly clearing the batch.Using
batch.clear()
instead of resetting with empty braces, which is the correct approach for the new batch API.
2874-2877
: Correct use of new batch API methods.Using the batch's accessor methods (
n_tokens()
) and builder methods (add_text()
) instead of directly accessing properties, which is consistent with the new API design.
2893-2893
: Using proper batch checking.The condition now properly uses
batch.n_tokens()
to check if the batch is empty, which is consistent with the new API.
3059-3060
: Proper batch capacity checking.Using
batch.n_tokens() + slot.n_prompt_tokens > n_batch
to check if the prompt will fit in the batch, which is the correct approach with the new API.
3083-3084
: Properly adding token to batch.Using
batch.add_text()
to add tokens to the batch, which is the proper method for the new batch API.
3093-3095
: Consistent batch size reporting.Using
batch.n_tokens()
in logging statement to report the current batch size, maintaining consistency with the new API.
3100-3100
: Proper batch assertion.Using
batch.n_tokens() > 0
in the assertion, which is the correct way to check for a non-empty batch with the new API.
3110-3110
: Using new batch API for output management.Calling
llama_batch_ext_set_output_last(batch.get())
correctly sets up the batch for output processing with the new API.
3113-3115
: Properly setting batch index.Setting
slot.i_batch = batch.n_tokens() - 1
to correctly identify the position of the last token in the batch, which is important for retrieving outputs.
3119-3125
: Proper batch size checking.Using
batch.n_tokens() >= n_batch
andbatch.n_tokens() == 0
for batch processing control flow, which is consistent with the new API.
3130-3130
: Improved logging with batch size.Using
batch.n_tokens()
in the debug log message to properly report the batch size being processed.
3140-3145
: Proper batch processing with views.Using the new batch view functionality (
llama_batch_ext_get_view
) to process subsets of the batch efficiently. This is an important feature of the new batch API that allows for better batch management.
4146-4149
: Warning about unsupported pooling typeThe code correctly checks for and rejects requests with pooling type 'none' since this isn't yet supported, as indicated by the commented-out implementation in
send_embedding
.Consider implementing support for the 'none' pooling type to make the feature complete, or provide more detailed error messages to help users understand the limitation.
3280-3284
: Speculative batch initialization using new API.The code properly initializes and adds tokens to the speculative batch using the new batch API methods like
clear()
andadd_text()
. This maintains consistency with the rest of the codebase.
3287-3288
: Improved debug logging for batch processing.The debug message now correctly reports the size of the speculative batch using
batch_spec.n_tokens()
which is consistent with the new API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
include/llama-cpp.h (1)
37-60
: 🛠️ Refactor suggestionConsider checking for negative return values.
Withininit_from_text
, no error checks are performed onllama_batch_ext_add_text
. A negative return indicates a failure to add the token, which is silently ignored.for (int32_t i = 0; i < n_tokens; i++) { - llama_batch_ext_add_text(batch, tokens[i], pos0 + i, &seq_id, 1, false); + int32_t ret = llama_batch_ext_add_text(batch, tokens[i], pos0 + i, &seq_id, 1, false); + if (ret < 0) { + // Handle error or propagate it, e.g.: + return llama_batch_ext_ptr(nullptr); + } }
🧹 Nitpick comments (4)
include/llama-cpp.h (2)
62-84
: Validate initialization result.
Ininit_from_embd
, ifllama_batch_ext_init_from_embd
fails, it might return a null pointer or an invalid batch. Consider adding error checks or fallback handling.
104-114
: Warn about ignoring the last returned output_id in add_seq.
Inside the loop, you overwriteoutput_id
without checking for error codes. If any call fails, it remains undetected.examples/embedding/embedding.cpp (1)
29-70
: Consider early return on encode/decode failure.
Currently, ifllama_encode_ext
orllama_decode_ext
fails, the error is logged, but the function proceeds with embedding extraction. This can lead to unexpected results.if (llama_encode_ext(ctx, batch) < 0) { LOG_ERR("%s : failed to encode\n", __func__); + return; // or handle gracefully }
examples/retrieval/retrieval.cpp (1)
77-105
: Handle encode/decode failures more robustly.
As in the embedding example, logging an error may not suffice. Consider returning or skipping remaining logic to avoid invalid state use.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
examples/embedding/embedding.cpp
(3 hunks)examples/retrieval/retrieval.cpp
(5 hunks)include/llama-cpp.h
(2 hunks)
🧰 Additional context used
🧬 Code Definitions (2)
examples/retrieval/retrieval.cpp (5)
include/llama.h (4)
llama_get_model
(480-480)llama_batch_ext_get_n_tokens
(944-944)llama_model_has_encoder
(533-533)llama_model_has_decoder
(536-536)src/llama-context.cpp (2)
llama_get_model
(2373-2375)llama_get_model
(2373-2373)src/llama-batch.cpp (2)
llama_batch_ext_get_n_tokens
(398-400)llama_batch_ext_get_n_tokens
(398-398)src/llama-model.cpp (4)
llama_model_has_encoder
(12139-12145)llama_model_has_encoder
(12139-12139)llama_model_has_decoder
(12147-12152)llama_model_has_decoder
(12147-12147)common/common.h (1)
common_embd_normalize
(636-636)
examples/embedding/embedding.cpp (4)
src/llama-context.cpp (4)
pooling_type
(823-825)pooling_type
(823-823)llama_pooling_type
(2385-2387)llama_pooling_type
(2385-2385)src/llama-batch.cpp (2)
llama_batch_ext_get_n_tokens
(398-400)llama_batch_ext_get_n_tokens
(398-398)src/llama-model.cpp (4)
llama_model_has_encoder
(12139-12145)llama_model_has_encoder
(12139-12139)llama_model_has_decoder
(12147-12152)llama_model_has_decoder
(12147-12147)common/common.cpp (2)
common_embd_normalize
(1795-1829)common_embd_normalize
(1795-1795)
⏰ Context from checks skipped due to timeout of 90000ms (20)
- GitHub Check: windows-latest-cmake (msvc-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-win...
- GitHub Check: windows-2019-cmake-cuda (12.4, cuda)
- GitHub Check: macOS-latest-cmake-tvos
- GitHub Check: macOS-latest-cmake-visionos
- GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
- GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
- GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: ubuntu-22-cmake-sycl
- GitHub Check: ubuntu-22-cmake-sycl-fp16
- GitHub Check: ubuntu-22-cmake-musa
- GitHub Check: macOS-latest-cmake-ios
- GitHub Check: ubuntu-22-cmake-hip
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: ubuntu-22-cmake-vulkan
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: server (Release)
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: server-windows
🔇 Additional comments (13)
include/llama-cpp.h (4)
8-8
: No issues with the new include.
The addition of<vector>
is appropriate for managing dynamic arrays.
28-30
: Proper unique_ptr deleter setup.
Thellama_batch_ext_deleter
correctly invokesllama_batch_ext_free
and ensures clean resource deallocation.
95-102
: Single-sequence add_text: same concern about error handling.
Similarly, consider handling a potentially negative return value fromllama_batch_ext_add_text
.
116-123
: Clear and n_tokens usage looks good.
Both methods wrap their underlying C API nicely. No immediate issues found.examples/embedding/embedding.cpp (4)
163-163
: Initialization of batch with context seems fine.
This change follows the new extended batch structure design.
190-195
: Potential off-by-zero.
After callingbatch_decode
and thenbatch.clear()
,batch.n_tokens()
is guaranteed to be zero, so line 194 always incrementse
by 0 for the NONE pooling path. Verify if you intended to use the token count before clearing.
199-199
: Sequence addition appears correct.
You set the sequence ID ass
to distinguish each prompt in the batch.
204-204
: Final decode invocation aligns with new API usage.
Decoding after adding remaining tokens ensures embeddings are produced.examples/retrieval/retrieval.cpp (5)
209-209
: Good: adopting new extended batch for retrieval.
No problems spotted.
225-227
: Flow for capacity check is correct.
When the batch is full, you decode first and then clear. This is valid.
235-235
: Sequence addition looks straightforward.
Token list is added with a unique sequence ID.
240-240
: Decode at final batch boundary.
Decoding here ensures the last chunk is processed before storing embeddings.
249-249
: Separate query batch.
Initializing a new batch for queries makes sense for the retrieval process.
include/llama-cpp.h
Outdated
// Wrapper to add a single token to the batch, support multiple sequence IDs | ||
int32_t add_text(llama_token token, llama_pos pos, const std::vector<llama_seq_id> & seq_id, bool output_last) { | ||
int32_t output_id = llama_batch_ext_add_text(this->get(), token, pos, seq_id.data(), seq_id.size(), false); | ||
if (output_last) { | ||
llama_batch_ext_set_output_last(this->get()); | ||
} | ||
return output_id; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Check for add_text failure.
Although you store output_id
, you do not verify if the addition was successful. A negative return indicates an error.
batch.add_seq(query_tokens, 0, 0, true); | ||
|
||
std::vector<float> query_emb(n_embd, 0); | ||
batch_decode(ctx, query_batch, query_emb.data(), 1, n_embd); | ||
batch_decode(ctx, query_batch.get(), query_emb.data(), 1, n_embd); | ||
|
||
common_batch_clear(query_batch); | ||
query_batch.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical mismatch between “batch” vs. “query_batch”.
You add query tokens to batch
(line 258) but then decode query_batch
(line 261), which remains empty. This results in incorrect or empty embeddings.
- batch.add_seq(query_tokens, 0, 0, true);
+ query_batch.add_seq(query_tokens, 0, 0, true);
batch_decode(ctx, query_batch.get(), query_emb.data(), 1, n_embd);
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
batch.add_seq(query_tokens, 0, 0, true); | |
std::vector<float> query_emb(n_embd, 0); | |
batch_decode(ctx, query_batch, query_emb.data(), 1, n_embd); | |
batch_decode(ctx, query_batch.get(), query_emb.data(), 1, n_embd); | |
common_batch_clear(query_batch); | |
query_batch.clear(); | |
query_batch.add_seq(query_tokens, 0, 0, true); | |
std::vector<float> query_emb(n_embd, 0); | |
batch_decode(ctx, query_batch.get(), query_emb.data(), 1, n_embd); | |
query_batch.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
include/llama-cpp.h (4)
42-61
:⚠️ Potential issueMissing error handling for batch token addition
The implementation doesn't check the return value of
llama_batch_ext_add_text()
, which can return negative values to indicate errors (-1 if batch size is exceeded, -2 if embeddings are already set).static llama_batch_ext_ptr init_from_text(struct llama_context * ctx, llama_token * tokens, int32_t n_tokens, llama_pos pos0, llama_seq_id seq_id, bool output_last) { llama_batch_ext * batch = llama_batch_ext_init(ctx); for (int32_t i = 0; i < n_tokens; i++) { - llama_batch_ext_add_text(batch, tokens[i], pos0 + i, &seq_id, 1, false); + int32_t rc = llama_batch_ext_add_text(batch, tokens[i], pos0 + i, &seq_id, 1, false); + if (rc < 0) { + // Handle error: free batch and return empty pointer or throw exception + llama_batch_ext_free(batch); + return llama_batch_ext_ptr(nullptr); + } } if (output_last) { // TODO: somehow return the output ID llama_batch_ext_set_output_last(batch); } return llama_batch_ext_ptr(batch); }
87-95
:⚠️ Potential issueError handling missing in add_text with multiple sequence IDs
The method doesn't check the return value from
llama_batch_ext_add_text()
, which could lead to silent failures.int32_t add_text(llama_token token, llama_pos pos, const std::vector<llama_seq_id> & seq_id, bool output_last) { int32_t output_id = -1; - llama_batch_ext_add_text(this->get(), token, pos, seq_id.data(), seq_id.size(), false); + int32_t rc = llama_batch_ext_add_text(this->get(), token, pos, seq_id.data(), seq_id.size(), false); + if (rc < 0) { + return rc; // Return error code + } if (output_last) { output_id = llama_batch_ext_set_output_last(this->get()); } return output_id; }
97-105
:⚠️ Potential issueError handling missing in add_text with single sequence ID
Similar to the multiple sequence ID version, this method doesn't check the return value from
llama_batch_ext_add_text()
.int32_t add_text(llama_token token, llama_pos pos, llama_seq_id seq_id, bool output_last) { int32_t output_id = -1; - llama_batch_ext_add_text(this->get(), token, pos, &seq_id, 1, false); + int32_t rc = llama_batch_ext_add_text(this->get(), token, pos, &seq_id, 1, false); + if (rc < 0) { + return rc; // Return error code + } if (output_last) { output_id = llama_batch_ext_set_output_last(this->get()); } return output_id; }
107-117
:⚠️ Potential issueError handling missing in add_seq method
The method doesn't check the return value from
llama_batch_ext_add_text()
calls, which could lead to silent failures.int32_t add_seq(std::vector<llama_token> & tokens, llama_pos pos0, llama_seq_id seq_id, bool output_last) { int32_t output_id = -1; for (size_t i = 0; i < tokens.size(); i++) { - llama_batch_ext_add_text(this->get(), tokens[i], pos0 + i, &seq_id, 1, false); + int32_t rc = llama_batch_ext_add_text(this->get(), tokens[i], pos0 + i, &seq_id, 1, false); + if (rc < 0) { + return rc; // Return error code and stop processing + } } if (output_last) { output_id = llama_batch_ext_set_output_last(this->get()); } return output_id; }
🧹 Nitpick comments (1)
include/llama-cpp.h (1)
57-57
: Address TODO comment about returning output IDThe TODO comment indicates an incomplete implementation. The
output_id
fromllama_batch_ext_set_output_last()
should be captured and returned to the caller.static llama_batch_ext_ptr init_from_text(struct llama_context * ctx, llama_token * tokens, int32_t n_tokens, llama_pos pos0, llama_seq_id seq_id, - bool output_last) { + bool output_last, + int32_t * p_output_id = nullptr) { llama_batch_ext * batch = llama_batch_ext_init(ctx); for (int32_t i = 0; i < n_tokens; i++) { llama_batch_ext_add_text(batch, tokens[i], pos0 + i, &seq_id, 1, false); } if (output_last) { - // TODO: somehow return the output ID - llama_batch_ext_set_output_last(batch); + int32_t output_id = llama_batch_ext_set_output_last(batch); + if (p_output_id != nullptr) { + *p_output_id = output_id; + } } return llama_batch_ext_ptr(batch); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
include/llama-cpp.h
(2 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
include/llama-cpp.h (1)
include/llama.h (6)
llama_batch_ext_init
(926-926)llama_batch_ext_add_text
(951-957)llama_batch_ext_set_output_last
(973-973)llama_batch_ext_init_from_embd
(935-941)llama_batch_ext_clear
(983-983)llama_batch_ext_get_n_tokens
(944-944)
⏰ Context from checks skipped due to timeout of 90000ms (20)
- GitHub Check: ubuntu-latest-cmake-cuda
- GitHub Check: android-build
- GitHub Check: ubuntu-22-cmake-musa
- GitHub Check: ubuntu-22-cmake-vulkan
- GitHub Check: ubuntu-22-cmake-sycl-fp16
- GitHub Check: macOS-latest-cmake-tvos
- GitHub Check: ubuntu-22-cmake-hip
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: server-windows
- GitHub Check: server (Release)
- GitHub Check: server (ADDRESS, RelWithDebInfo)
- GitHub Check: server (UNDEFINED, RelWithDebInfo)
🔇 Additional comments (5)
include/llama-cpp.h (5)
28-30
: LGTM: Clear consistent deleter implementationThe deleter implementation follows the established pattern in the file, consistent with other resource management classes.
37-41
: Constructor implementations look goodThe constructors properly initialize the base unique_ptr class with appropriate parameters, following C++ best practices.
63-71
: LGTM: Clean wrapper for embeddings initializationThe implementation properly wraps the C API function and handles memory management through the smart pointer.
73-85
: Good position handling implementationThe method correctly generates sequential positions starting from
pos0
, which aligns with the migration pattern requirements.
119-125
: LGTM: Simple utility wrappersThe
clear()
andn_tokens()
methods are clean wrappers around their C API counterparts.
Summary by CodeRabbit
New Features
llama_batch_ext
) for improved handling of tokens and sequences.Refactor
llama_batch
structure and associated functions in favor of the newllama_batch_ext
methods.