Skip to content

Private batch API (for AI review) #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 61 commits into
base: master
Choose a base branch
from
Open

Conversation

ngxson
Copy link
Owner

@ngxson ngxson commented Mar 13, 2025

Summary by CodeRabbit

  • New Features

    • Enhanced text generation with upgraded batch processing that leads to faster, more reliable responses and improved overall performance.
    • Introduced a new batch management structure (llama_batch_ext) for improved handling of tokens and sequences.
    • Added permissions for external storage access in the Android application to facilitate file handling.
  • Refactor

    • Replaced outdated token handling methods with advanced, efficient mechanisms for better resource management and stability.
    • Updated existing functions to utilize the new batch structure, improving clarity and performance in batch processing.
    • Deprecated the old llama_batch structure and associated functions in favor of the new llama_batch_ext methods.
    • Improved error handling and initialization processes for batch operations across various components.

Copy link

coderabbitai bot commented Mar 13, 2025

Walkthrough

This pull request refactors the batch processing system throughout the project. The legacy API using llama_batch (and related functions like common_batch_add/clear and string_from) has been removed or marked as deprecated. In its place, a new extended API is introduced using structures such as llama_batch_ext (and its unique pointer alias) and common_batch, with updated functions for initialization, token addition, output handling, decoding, and memory management. Both core library files and numerous example modules have been updated to consistently adopt the extended batch mechanism.

Changes

File(s) Change Summary
common/common.cpp, common/common.h Removed the string_from function; deprecated common_batch_clear and common_batch_add; introduced new common_batch structure with methods like clear(), add_text(), and get_view().
common/speculative.cpp Updated the batch member type from llama_batch to llama_batch_ext_ptr and replaced legacy batch functions with extended counterparts (llama_batch_ext_init_from_text, llama_decode_ext, etc.).
examples/batched-bench/*, examples/batched/* Replaced basic batch initialization (e.g. llama_batch_init) and functions (common_batch_add/clear, llama_decode) with extended API calls (llama_batch_ext_init, llama_batch_ext_add_text, llama_decode_ext, etc.), and modified token count retrieval.
examples/... (cvector-generator, embedding, eval-callback, gritlm, imatrix, infill, llama-bench, lookahead, lookup, main, parallel, passkey, perplexity, retrieval, run, save-load-state, server, simple-chat, simple, speculative-simple, speculative) Consistently transitioned all example code to use the extended batch API by updating function signatures, initialization calls, token addition routines, decoding, and memory cleanup.
include/llama-cpp.h, include/llama.h Introduced new declarations for llama_batch_ext and its unique pointer alias (llama_batch_ext_ptr); marked the legacy llama_batch API as deprecated; added new functions for extended batch management.
src/llama-batch.cpp, src/llama-batch.h Refactored batch management to implement llama_batch_ext; added functions for initializing from text/embeddings, adding tokens, setting outputs, clearing, getting views, and freeing resources.
src/llama-context.cpp, src/llama-context.h Modified context encoding/decoding to accept llama_batch_ext instead of llama_batch; added overloads for backwards compatibility while deprecating the old methods; removed temporary memory allocation logic.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant BatchAPI
    participant Context
    Caller->>BatchAPI: Initialize batch (llama_batch_ext_init_from_text)
    BatchAPI-->>Caller: Return new batch pointer
    Caller->>BatchAPI: Add tokens (llama_batch_ext_add_text)
    Caller->>BatchAPI: Clear batch if needed (llama_batch_ext_clear)
    Caller->>Context: Decode tokens (llama_decode_ext(ctx, batch))
    Context-->>Caller: Return decoded output
Loading

Poem

I’m a rabbit in the codeland, hopping with delight,
Leaping from old paths to new with every byte.
Extended batches now lead my joyful run,
Clean and clear, a fresh new dawn begun.
In fields of tokens and logic so spry,
I cheer—hop on, and watch our code fly!
🐇💻🌟

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai plan to trigger planning for file edits and PR creation.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

🔭 Outside diff range comments (1)
src/llama-batch.cpp (1)

356-386: ⚠️ Potential issue

Potential out-of-bounds writes when embd != 0 but n_tokens_alloc = 0.
If llama_batch_ext_init_from_embd is called with n_tokens_alloc=0 and n_embd>0, the code copies data (memcpy(...)) into a buffer of zero length. This is a critical bug.

-static struct llama_batch_ext * llama_batch_ext_init_impl(int32_t n_tokens_alloc, int32_t embd, int32_t n_seq_max) {
-    ...
-    if (embd) {
-        batch->embd = (float *) malloc(sizeof(float) * n_tokens_alloc * embd);
-    }
-    ...
-    memcpy(batch->embd, embd, n_embd * sizeof(float)); // potential OOB if n_tokens_alloc = 0
+    // Fix by using `n_tokens_alloc = n_embd` if we treat each float as a separate token:
+    // or re-check the logic to handle embeddings properly
}
🧹 Nitpick comments (46)
common/common.cpp (2)

1580-1584: Deprecated function: common_batch_clear.
Since this is now marked DEPRECATED, verify if it is still referenced elsewhere. Removing dead code could reduce maintenance overhead.


1585-1604: Deprecated function: common_batch_add.
Similar to common_batch_clear, confirm whether any call sites remain. If unused, consider removing to keep the codebase clean.

common/speculative.cpp (1)

242-242: Repeated clearing.
Be mindful of performance overhead from frequent batch clears. If performance is a concern, consider re-adding tokens incrementally in a single batch.

examples/imatrix/imatrix.cpp (1)

516-517: Parameter check for llama_batch_ext_add_text.
You're using a single sequence ID (seq_id = 0), and passing true for the last parameter each iteration. Verify whether this semantics (indicating end-of-sequence or final token) is intended at this stage of the code.

examples/gritlm/gritlm.cpp (4)

16-16: Check return from llama_batch_ext_init.
Ensure you handle possible allocation failures or consider adding a quick null check.


44-45: Ensure correct usage of sequence ID and final token flag.
Using the last parameter j >= n_inst sets is_end_of_token conditionally. Validate if that logic aligns with your intended design (i.e., ignoring instruction tokens in embeddings).


121-122: Sequence ID usage.
Similarly, passing a zeroed seq_id each time is valid if you have a single sequence context. If you plan to support multiple sequences, consider a dynamic approach.


129-129: Sampling from last token.
llama_batch_ext_get_n_tokens(bat) - 1 is correct for indexing the last token. Ensure there are no off-by-one scenarios when the batch is empty.

examples/llava/gemma3-cli.cpp (3)

77-77: Initialize batch safely.
You might want to throw/log an error if batch allocation fails to avoid null pointer usage.


184-185: Sequence ID usage for next token.
Passing ctx.n_past++ ensures correct offset. Using the last parameter as true for final token might be by design. Double-check if you want the model to consider each token final.


186-186: Decoding each new token.
Immediately decoding the newly added token is a valid approach for token-by-token generation. Confirm it matches your performance and concurrency requirements if you have advanced usage.

examples/run/run.cpp (1)

911-911: Validate the batch pointer
Before calling llama_batch_ext_get_n_tokens(), consider checking that batch.get() is valid to avoid unexpected null pointer issues.

common/common.h (1)

572-576: Deprecated functions
Marking common_batch_clear and common_batch_add as deprecated is fine. If they're fully replaced by the new extended batch mechanism, consider removing them in a future cleanup.

examples/parallel/parallel.cpp (1)

278-278: Index handling
client.i_batch = llama_batch_ext_get_n_tokens(batch) - 1 is correct for referencing the last token’s index, but ensure no negative results if empty.

examples/lookahead/lookahead.cpp (3)

120-120: Use a smart pointer for consistency.
You already use llama_batch_ext_ptr in lines 95 and 96. For consistency, consider returning a llama_batch_ext_ptr here as well, which will simplify memory management.

-llama_batch_ext * batch = llama_batch_ext_init(params.n_ctx, W + G + 1);
+llama_batch_ext_ptr batch(llama_batch_ext_init(params.n_ctx, W + G + 1));

250-251: Consider improving readability by avoiding line breaks in function parameters.
The parameter list is split across lines. Keeping the function call on one line, or using a local struct/object, can enhance clarity.


483-483: Consider unified resource management.
Here, you call llama_batch_ext_free(batch) directly, whereas lines 95 and 96 rely on llama_batch_ext_ptr. For uniformity and safer resource handling, consider adopting the same pattern.

examples/speculative/speculative.cpp (1)

48-648: Disabling a large block of logic with #if 0.

You have effectively deactivated a significant portion of the logic by wrapping lines 49–648 in #if 0/#endif. This is a convenient way to temporarily disable code paths, but ensure this block of code is either intentionally deprecated or will be re-enabled after the experiments/tests. If it’s no longer needed, consider removing it to keep the codebase clean.

examples/perplexity/perplexity.cpp (3)

904-910: Randomization or partial evaluation logic changes.

If these lines handle random selection or partial usage of tasks, confirm that the distribution logic remains correct, especially for concurrency.

Consider extracting the sampling logic into a helper function for clearer test coverage and potential reusability.


999-1011: Parsing CSV or user input for tasks.

Be sure input length checks are robust. If quoting logic changed, watch for edge cases where quotes or commas appear unescaped.

Include validation steps or fallback for malformed lines instead of just logging an error.

Also applies to: 1014-1014


1264-1271: Loading or deserializing tasks.

Changes likely revolve around ensuring data from the new format is read consistently. Make sure the new logic handles large tasks or malformed input gracefully.

Consider adding an extra check for string lengths and re-check stream errors to reduce potential parsing failures.

Also applies to: 1274-1287, 1290-1291, 1293-1294

examples/passkey/passkey.cpp (1)

147-148: Consider varying seq_id.
Currently, seq_id is always 0. If you anticipate handling multiple sequences in the future, consider incrementing or specifying distinct IDs.

src/llama-batch.h (3)

8-21: Doc comments are helpful.
These clarifications are valuable. Consider addressing the TODO to rename logits to output now to avoid confusion.


76-76: Pointer ownership.
Storing const llama_batch_ext * here may obscure ownership. If shared ownership is intended, consider a reference-counted or RAII interface to prevent accidental misuse.


104-104: Potential naming conflict.
Within llama_batch_allocr, using the field name batch can be confusing given other usage. Rename to something like ext_batch for clarity.

examples/batched/batched.cpp (2)

116-116: Assertion clarity.
GGML_ASSERT may halt execution abruptly. Consider providing a user-friendly error message for better diagnostics.


160-160: Naming.
n_cur might be interpreted differently. A more self-explanatory name could improve readability.

examples/simple/simple.cpp (1)

183-183: Check usage sequence of llama_batch_ext_clear
Clearing the batch right before repopulating or freeing it might be unnecessary. Evaluate if the call can be removed for cleaner, more efficient code.

examples/retrieval/retrieval.cpp (4)

84-86: Improve naming consistency
batch_decode handles both encoder-only and decoder-only paths, yet the name suggests decoder logic. Consider renaming or commenting to reduce confusion for future maintainers.


129-129: Document the single-sequence initialization
common_batch(n_batch, 1) implies exactly one sequence. If you're planning multi-sequence usage, expand or comment on how to adapt the second parameter.


150-150: Log or debug large input handling
This capacity check helps avoid overfilling the batch. Logging when the threshold is reached may aid in diagnosing errors.


274-274: Encourage helper function reuse
Code that creates common_batch(n_batch, 1) reappears across the file. Extracting a small helper can improve maintainability and consistency.

src/llama-context.cpp (2)

2724-2729: Deprecated llama_encode
Maintaining the old method enhances backward compatibility. If deprecation is planned, ensure existing references migrate to llama_encode_ext before removal.


2731-2736: Retaining old llama_decode
Similar to llama_encode, this preserves a legacy function. Confirm that new code paths adopt llama_decode_ext, and plan for eventual phase-out.

examples/save-load-state/save-load-state.cpp (3)

135-137: Repeated code pattern.
Lines 135–137 duplicate the logic used in lines 82–84 (and later 217–219) for clearing and adding a token before decoding. Consider refactoring these steps into a small helper function to avoid duplication.


217-219: Repeated code pattern.
Again, identical logic to lines 82–84, 135–137. Factor out common logic for maintainability.


221-223: Proper error check but repeated code.
As with the earlier blocks, consider DRY (“Don’t Repeat Yourself”) to keep the code maintainable.

src/llama-batch.cpp (3)

392-406: Consider clarifying the “one float per token” approach.
llama_batch_ext_init_from_embd duplicates logic to treat each float index as a position. Make sure this design is correct for your embeddings scenario. If you intend multi-float embeddings per token, you need a different dimensioning strategy.


436-453: Potential performance bottleneck.
llama_batch_ext_set_output does a nested loop to find the matching seq_id. If usage grows large, consider a more efficient lookup structure (e.g., hashing).


464-466: llama_batch_ext_clear() resets only n_tokens.
While intentional, remain aware that leftover data (token arrays, embeddings, etc.) is not zeroed out. In contexts requiring data scrubbing (e.g., security or privacy), you might also want to wipe memory.

examples/batched-bench/batched-bench.cpp (2)

122-122: Looping over “pos” with the same token.
Similarly, you pass token=0, pos=i, seq_ids=&j. If multiple sequences or advanced usage arises, confirm it aligns with your text vs. token logic.


150-150: Adding tokens for each sequence.
Here, token=0 plus pos = pp + i suggests a minimal placeholder approach. Confirm that rolling pos is correct if you expect to track larger conversation history.

include/llama.h (4)

921-927: Validate Parameters
Consider checking n_tokens and n_seq_max to guard against negative or zero, which may cause allocation issues.


928-939: Initialization From Text
Usage and documentation around output_last seem clear. However, clarifying edge cases (e.g., n_tokens <= 0) would improve robustness.


950-952: Return Type Consideration
llama_batch_ext_get_n_tokens returns int32_t. If a large batch is expected, consider using a wider type like size_t.


953-965: Error Code Enumeration
llama_batch_ext_add_text returns negative values as error codes. Defining named constants or an enum for these codes would aid clarity.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 84d5475 and 04f8641.

📒 Files selected for processing (38)
  • common/common.cpp (3 hunks)
  • common/common.h (1 hunks)
  • common/speculative.cpp (5 hunks)
  • examples/batched-bench/batched-bench.cpp (5 hunks)
  • examples/batched/batched.cpp (6 hunks)
  • examples/cvector-generator/cvector-generator.cpp (1 hunks)
  • examples/embedding/embedding.cpp (4 hunks)
  • examples/eval-callback/eval-callback.cpp (1 hunks)
  • examples/gritlm/gritlm.cpp (6 hunks)
  • examples/imatrix/imatrix.cpp (3 hunks)
  • examples/infill/infill.cpp (1 hunks)
  • examples/llama-bench/llama-bench.cpp (2 hunks)
  • examples/llava/gemma3-cli.cpp (6 hunks)
  • examples/llava/llava-cli.cpp (1 hunks)
  • examples/llava/llava.cpp (2 hunks)
  • examples/llava/minicpmv-cli.cpp (1 hunks)
  • examples/llava/qwen2vl-cli.cpp (2 hunks)
  • examples/lookahead/lookahead.cpp (6 hunks)
  • examples/lookup/lookup.cpp (5 hunks)
  • examples/main/main.cpp (2 hunks)
  • examples/parallel/parallel.cpp (7 hunks)
  • examples/passkey/passkey.cpp (6 hunks)
  • examples/perplexity/perplexity.cpp (16 hunks)
  • examples/retrieval/retrieval.cpp (5 hunks)
  • examples/run/run.cpp (3 hunks)
  • examples/save-load-state/save-load-state.cpp (5 hunks)
  • examples/server/server.cpp (18 hunks)
  • examples/simple-chat/simple-chat.cpp (2 hunks)
  • examples/simple/simple.cpp (3 hunks)
  • examples/speculative-simple/speculative-simple.cpp (5 hunks)
  • examples/speculative/speculative.cpp (6 hunks)
  • examples/tts/tts.cpp (6 hunks)
  • include/llama-cpp.h (1 hunks)
  • include/llama.h (5 hunks)
  • src/llama-batch.cpp (4 hunks)
  • src/llama-batch.h (4 hunks)
  • src/llama-context.cpp (4 hunks)
  • src/llama-context.h (1 hunks)
🔥 Files not summarized due to errors (4)
  • examples/llava/llava-cli.cpp: Error: Disallowed special token found: <|im_end|>
  • examples/llava/minicpmv-cli.cpp: Error: Disallowed special token found: <|im_start|>
  • examples/llava/qwen2vl-cli.cpp: Error: Disallowed special token found: <|im_start|>
  • examples/tts/tts.cpp: Error: Disallowed special token found: <|im_start|>
⏰ Context from checks skipped due to timeout of 90000ms (20)
  • GitHub Check: macOS-latest-cmake-tvos
  • GitHub Check: android-build
  • GitHub Check: ubuntu-22-cmake-musa
  • GitHub Check: ubuntu-22-cmake-hip
  • GitHub Check: server-windows
  • GitHub Check: ubuntu-22-cmake-sycl
  • GitHub Check: ios-xcode-build
  • GitHub Check: ubuntu-22-cmake-sycl-fp16
  • GitHub Check: ubuntu-latest-cmake-cuda
  • GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
  • GitHub Check: macOS-latest-cmake-ios
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
  • GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
  • GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
  • GitHub Check: ubuntu-latest-llguidance
  • GitHub Check: ubuntu-latest-cmake-rpc
🔇 Additional comments (206)
examples/cvector-generator/cvector-generator.cpp (1)

346-347: Refactoring to use extended batch API improves resource management

The code has been updated to use the new llama_batch_ext_ptr with RAII semantics that automatically manages memory. This replaces the legacy batch API and follows the broader transition in the codebase.

examples/eval-callback/eval-callback.cpp (1)

137-138: Good transition to extended batch API

The code correctly implements the new extended batch API which offers improved resource management through RAII. The batch is properly initialized with tokens data and automatically cleaned up when it goes out of scope.

include/llama-cpp.h (2)

27-29: Well-designed deleter for batch extension resources

The implementation follows the established pattern in the codebase for resource management. The deleter ensures proper cleanup of llama_batch_ext resources.


35-35: Good addition of smart pointer typedef for batch extension

This typedef enables automatic resource management for batch extensions, consistent with the existing pattern for models, contexts, and other resources in the codebase.

examples/llava/llava.cpp (2)

5-5: Appropriate header inclusion

Adding the llama-cpp.h header is necessary to access the newly defined llama_batch_ext_ptr type.


451-452: Simplified implementation using extended batch API

The code replaces the previous llava_embd_batch structure with a more streamlined approach using llama_batch_ext_ptr. This change reduces code complexity while maintaining the same functionality, and ensures proper resource cleanup via RAII.

examples/main/main.cpp (2)

551-553: Appropriate batch API migration

The code properly transitions from using the deprecated llama_encode function to using the new extended batch API by initializing a llama_batch_ext_ptr with input buffer data and calling llama_decode_ext.


672-675: Clean implementation of the new batch API

The code correctly implements the new batch handling mechanism, creating a batch from text tokens and setting the output flag before passing to llama_decode_ext. This approach is more flexible and consistent with the refactoring pattern used throughout the codebase.

src/llama-context.h (2)

84-86: Good practice in API deprecation

Properly marking the old batch methods as deprecated while keeping them functional ensures backward compatibility during the transition period.


88-89: Clean API extension

The new extended batch methods follow consistent naming patterns and maintain the same interface style as the deprecated methods, making it intuitive for users to migrate to the new API.

examples/llama-bench/llama-bench.cpp (2)

1447-1448: Clean implementation of batch API in test_prompt

The function correctly adopts the extended batch API by using llama_batch_ext_ptr with llama_batch_ext_init_from_text and llama_decode_ext, maintaining the same functionality with the new interface.


1465-1466: Proper API migration in test_gen

Similar to the changes in test_prompt, this code properly transitions to the new batch API, making the implementation consistent across the codebase.

examples/lookup/lookup.cpp (6)

94-97: Effective initialization with the extended batch API

The code properly replaces the previous encoding approach with the new extended batch API. Using smart pointers (llama_batch_ext_ptr) for both batch objects ensures proper memory management.


113-113: Appropriate type change for batch_tgt

The variable type has been correctly updated from llama_batch to llama_batch_ext* to align with the new API, which is essential for all subsequent operations on this batch.


199-202: Proper batch management with the extended API

The code correctly uses llama_batch_ext_clear and llama_batch_ext_add_text to manage the batch, maintaining the same functionality with the new interface.


211-211: Consistent token addition with extended API

The token addition logic is properly updated to use llama_batch_ext_add_text, maintaining consistency with the new API throughout the function.


217-217: Updated decode function call

Properly changed to use llama_decode_ext to match the new batch type, ensuring API consistency.


249-249: Proper resource cleanup

Changed to use llama_batch_ext_free for proper deallocation of the new batch type, preventing memory leaks.

examples/infill/infill.cpp (1)

356-357: Confirm no out-of-bounds for embd.
This approach of creating the extended batch with a segment of the embd array looks correct. Ensure n_eval never exceeds the array bounds, and confirm that the offset i and chunk size n_eval are carefully validated to prevent out-of-range memory reads.

examples/simple-chat/simple-chat.cpp (5)

111-114: Check for null batch returns.
When initializing the extended batch with llama_batch_ext_init_from_text(), ensure you gracefully handle potential allocation or initialization failures if the returned pointer is null or invalid.


120-120: Looks good for context size checks.
The updated logic using llama_batch_ext_get_n_tokens(batch) is valid and should correctly limit the total context usage.


126-126: Abort path is consistent.
Aborting on decode failure is acceptable here. Just confirm no resources remain unfreed if llama_decode_ext returns an error.


150-153: Batch clearing and position increment.
Clearing the batch before adding the new token is a good practice. Updating n_past aligns with how the new token is appended to the sequence.


156-156: Validate usage after free.
After calling llama_batch_ext_free(batch), ensure there's no further use or accidental references to batch. Consider setting it to nullptr if used in a broader scope.

common/speculative.cpp (10)

17-17: Transition to extended batch pointer.
Replacing a raw llama_batch with llama_batch_ext_ptr is aligned with the new approach. Ensure you handle memory ownership correctly to avoid leaks or double frees.


26-26: Initialization with llama_batch_ext_init.
Using llama_n_batch(ctx_dft) is logical for the draft context capacity. Keep an eye on scenarios where ctx_dft might have zero or minimal capacity.


153-153: Single sequence ID usage.
If the code later needs multi-sequence handling, consider making seq_id adjustable. For now, zero is acceptable for a single sequence scenario.


209-209: Clearing the batch before prompt additions.
This ensures the new calls to llama_batch_ext_add_text() don’t mix with leftover tokens, which is a good initial practice.


213-213: Adding tokens to the batch.
Correctly sets logits=false for the prompt tokens. This prevents them from interfering with sampling checks.


220-223: Conditional decode.
Skipping decode if there are no tokens ensures efficiency. The error path logs and returns, so resource usage is well-handled.


229-230: Clear then add next token.
This pattern is consistent with maintaining a clean slate for each newly appended token.


236-236: Check error flows.
Same approach as above: on failure, the code logs and stops. Ensure future expansions handle partial success cases if needed.


269-269: Maintaining incremental token positions.
Using n_past + i + 1 appears correct. Validate this indexing if you later change token insertion logic.


272-272: Final decode call.
Consistent with the new extended-batch pattern. Ensure concurrency use-cases are tested if multiple threads might decode simultaneously.

examples/imatrix/imatrix.cpp (4)

514-514: Clear the batch carefully.
Calling llama_batch_ext_clear(batch) inside this loop is valid, but ensure that no tokens are needed from previous iterations. If tokens from earlier steps must persist, this call could unintentionally discard necessary data.


520-520: Check return value of llama_decode_ext.
Using if (llama_decode_ext(ctx, batch)) is good. If an error is returned, you handle it. Confirm that you log or propagate the error upstream if it needs attention.


522-522: Correctly freeing the batch on error.
Freeing batch on the decode error path mitigates memory leaks. This is a good use of cleanup logic.


535-535: Memory cleanup is sound.
You free the batch after use. Confirm there's no use-after-free if any subsequent function tries to reference it.

examples/gritlm/gritlm.cpp (6)

19-19: Batch clearing logic.
Clearing the batch before token addition is typically correct, but confirm that residual tokens or states aren’t required from a previous iteration.


54-54: Check llama_decode_ext.
Decoding is initiated without error checks here. If memory or other issues occur, the function will fail silently. Consider verifying the return value.


93-93: Proper resource deallocation.
Freeing the batch resource after finishing the embeddings is correct.


110-110: Initializing batch with dynamic size.
Using llama_n_batch(ctx) ensures automatic alignment with context settings. Looks good.


116-116: Appropriate batch clearing.
Once more, verify no leftover tokens are needed from earlier calls. Clearing each iteration is usually correct in a stateless loop.


150-150: Batch freed after usage.
Resources are released. Good practice.

examples/llava/gemma3-cli.cpp (8)

8-8: New header import.
#include "llama-cpp.h" looks consistent with the usage of extended APIs. Ensure no redundant includes remain.


67-67: Use of llama_batch_ext_ptr.
Smart pointer management with llama_batch_ext_ptr helps avoid manual frees, which is a good pattern for exception safety.


93-93: Clear batch at start of new text.
Verifies that no stale tokens remain. This is typically correct.


96-97: Parameter usage in llama_batch_ext_add_text.
Using seq_id=0 and final token flag = false might be correct if these tokens belong to a single sequence. Confirm logic if multi-sequence usage is anticipated.


99-99: Output last token setting.
llama_batch_ext_set_output_last is effective if you want to retrieve the final token’s output. Ensure this is logically consistent with your decoding approach.


102-102: Check llama_decode_ext status.
You do a proper if check. Good job! Ensure that you handle the error accordingly so the rest of the code remains stable.


152-153: Initializing batch from embeddings.
llama_batch_ext_init_from_embd is a specialized constructor. Confirm first param matches the correct dimension. Also check for out-of-bounds on image_embd_v.


182-182: Clearing batch in each loop iteration.
Ensure tokens from prior steps are no longer needed. This is typically correct for generating each subsequent token.

examples/speculative-simple/speculative-simple.cpp (5)

116-117: Ensure non-empty input before calling batch initialization.
If inp.size() is 0, subtracting 1 can lead to an out-of-range size. Consider adding a check to avoid potential issues.


136-136: Initialization of extended batch
Looks correct. Make sure that llama_n_batch(ctx_tgt) is sufficient for all tokens in your use case.


155-158: Clearing batch before re-adding tokens
This approach ensures each step re-initializes the batch. The logic is sound for evaluating just one new token at a time.


167-167: Validate draft.size() vs. context capacity.
When adding tokens with n_past + i, confirm that this won't exceed llama_n_ctx(ctx_tgt).


258-258: Freeing the extended batch
This is good practice to prevent memory leaks. Ensure no further usage of batch_tgt after this point.

examples/run/run.cpp (4)

908-908: Refactor to accept extended batch pointer
Switching the function signature to const llama_batch_ext_ptr & batch is a good step for extended batch handling.


949-949: Initialize batch from tokens
Creating a new batch with llama_batch_ext_init_from_text is straightforward. Ensure tokens.size() is non-zero to avoid an empty batch pass.


953-953: Decoding with extended API
You’re already handling the return value of llama_decode_ext. Good job ensuring errors are caught properly.


972-972: Resetting the batch for a single token
Re-initializing the batch for the next token is a clear approach. Ensure that repeated resets don’t cause confusion in logs or debug outputs.

common/common.h (1)

583-642: Introduction of common_batch struct
This structure nicely encapsulates batch handling, tokens, and output logic. A few considerations:

  • Ensure concurrency safety if accessed from multiple threads.
  • Check for performance overhead of inlined methods in multiple translation units.
    Overall, this design is a clear improvement for global code clarity.
examples/parallel/parallel.cpp (10)

177-177: Extended batch initialization
Allocating a new extended batch with llama_batch_ext_init(n_ctx, 1) is straightforward. Confirm that n_ctx is large enough for the worst-case prompt.


220-220: Clearing the batch
Resetting the batch via llama_batch_ext_clear ensures a fresh state before adding tokens. Good approach for iterative decoding loops.


228-228: Storing the current batch size
client.i_batch = llama_batch_ext_get_n_tokens(batch) is lightweight but confirm the logic if multiple clients share the same batch.


231-231: Added text with sequence ID
Using n_tokens_system + client.n_prompt + client.n_decoded as the position is logical, but confirm it won’t exceed context.


236-236: Checking for an empty batch
Resetting the KV cache if no tokens are present is a robust approach to keep context consistent.


248-248: Conditional batching
if (cont_batching || llama_batch_ext_get_n_tokens(batch) == 0) effectively handles new requests. Good pattern for concurrency.


267-268: Assigning sequence ID per client
Ensures each client’s tokens remain isolated. Keep an eye on concurrency if multiple threads modify seq_id concurrently.


272-273: Output logits for last token
Toggling the output once the batch has tokens is an effective approach for partial result analysis. No issues found.


300-311: Chunked decoding with batch views
Splitting the batch into chunks and calling llama_decode_ext on each view is a valid approach for large prompts. Watching out for partial remainders or repeated decode failures is important.


417-417: Freeing the main batch
A proper cleanup to prevent memory leaks at the end of the main function. Looks good.

examples/lookahead/lookahead.cpp (6)

95-96: Validate n_input to avoid out-of-bounds and confirm arguments.
Ensure that n_input is at least 2, otherwise having n_input - 1 here can cause unintended behavior. Also verify that passing 0 for the offset parameters is correct for your use case in llama_batch_ext_init_from_text.


209-209: Clearing the batch before repopulating looks fine.
This ensures a fresh slate for each iteration. No concerns here.


212-212: Good usage of multi-seq ID addition.
Passing the entire seq_id_all array aligns with the extended batch approach. Nicely done.


235-235: Verify index usage.
Storing llama_batch_ext_get_n_tokens(batch) for each n-gram is logical, but please verify that these indices align with subsequent sampler operations.


237-238: Confirm extended batch parameters.
Here, you pass true for the last argument of llama_batch_ext_add_text, indicating a final token in the chunk. Ensure this is the intended usage for each verification n-gram token.


257-258: Validate partial final token logic.
You pass j == N - 2 as the final argument, presumably to finalize tokens at the last iteration. Double-check that this matches your intended partial or final token logic.

examples/server/server.cpp (22)

1227-1227: Use of common_batch for batch_spec.
Replacing llama_batch with common_batch appears consistent with the new extended API.


1799-1799: Switched global batch to common_batch.
Ensures alignment with the new batch abstraction. No immediate concerns.


1925-1925: Initialize batch_spec with speculative capacity.
Consider verifying that params_base.speculative.n_max is non-negative to avoid unintended zero/negative sizing.


1950-1950: Moving slot into slots container.
Using move semantics here is efficient for transferring ownership.


1961-1961: Re-initializing the main batch.
No issues seen; this logic properly sets the batch capacity for tokens and sequences.


2052-2052: Resetting the slot.
Calling slot.reset() ensures a clean slate before usage; seems appropriate.


2096-2096: Checking for a draft context.
Conditionally handling the speculative model context is clear and concise.


2097-2097: Setting batch_spec for speculation.
As above, ensure slot.params.speculative.n_max is not negative to avoid invalid sizing.


2857-2857: Clearing the main batch.
A straightforward reset before assembling new tokens.


2879-2879: Tracking the batch offset.
Storing batch.get_n_tokens() into slot.i_batch is a clear way to align slot state with batch indexing.


2881-2881: Adding sampled token to the batch.
This usage of batch.add_text(...) aligns with the new API pattern.


3064-3064: Ensuring batch capacity isn’t exceeded.
Good safeguard against overfilling the batch.


3085-3085: Appending prompt token to the batch.
Logic is sound; no immediate concerns.


3104-3104: Assertion to ensure non-empty batch.
Helps catch error states early.


3115-3115: Setting logits for the last token.
This API call is straightforward, presumably controlling final logits usage.


3118-3118: Recording last token index in slot.i_batch.
Keeps slot state aligned with the batch's final token.


3150-3150: Calling llama_decode_ext.
Return code is checked below; usage appears consistent with error handling.


3244-3244: Skipping speculation checks.
Logic is clear: no speculation if the slot isn’t active or the draft model is unavailable.


3284-3284: Clearing the speculative batch.
Ensures a fresh batch for the speculation pass.


3285-3285: Adding initial token to the speculative batch.
No issues found; usage matches the new API.


3288-3288: Populating speculative batch with draft tokens.
Implementation looks correct; just be mindful of potential batch size limits.


3292-3292: Processing the speculation batch.
This final decode extends the approach to test and accept tokens from the draft.

examples/speculative/speculative.cpp (4)

168-173: Adoption of extended batch initialization and decoding.

These lines show a transition from older batch processing calls to llama_batch_ext_init_from_text plus llama_decode_ext. This is consistent with the extended batch mechanism introduced across the codebase. No correctness issues are apparent. Ensure that downstream usage (e.g., sampling logic) expects the new structure and checks decode return codes properly.


205-206: New batch initialization with llama_batch_ext_init.

Switching to llama_batch_ext_init for batch_dft and batch_tgt matches the new extended batch design. This looks correct, assuming the first argument selects the batch size and the second argument the sequence count. Confirm these match your concurrency or parallelism requirements.


447-449: Refining the draft batch between steps.

  1. llama_batch_ext_clear(batch_dft); helps you reuse the same batch object.
  2. Setting seq_id to 0 each time ensures a consistent, single-sequence approach here.
  3. llama_batch_ext_add_text(...) properly appends the token for decoding on the target side.
  4. Finally, calling llama_decode_ext on batch_dft runs the decoding.

Everything is aligned with the extended batch usage. Just verify that the correct sequence ID is used when branching or splitting.

Also applies to: 453-453


478-480: Re-initializing batch_tgt before the speculation loop.

Clearing batch_tgt and reassigning a fresh seq_id helps isolate each iteration’s partial tokens, avoiding contamination from previous steps. The usage of llama_batch_ext_add_text with drafts[0].tokens[0] also appears correct. This approach should simplify logic, provided the rest of the speculation logic properly tracks the new sequences.

examples/perplexity/perplexity.cpp (34)

366-366: Refactored batch usage for perplexity calculation.

  • Line 366 clears the KV cache with llama_kv_self_clear(ctx)—appropriate before processing a new chunk of tokens.
  • Line 372 calls batch.clear(), ensuring each batch starts clean.
  • Line 375 uses batch.add_text(...) to add tokens, matching the new extended usage.
  • Line 378 calls llama_decode_ext(...) for decoding.

All steps look consistent with the revised common_batch approach. Verify that error handling for llama_decode_ext is consistently applied to detect decode failures.

Also applies to: 372-372, 375-375, 378-378


505-505: Instantiation of common_batch with dynamic size.

Using std::min(n_batch, n_ctx*n_seq) ensures the batch won’t exceed the model’s context or the configured limit. This is a safe approach to avoid potential buffer overruns. Looks good.


555-557: Batch clearing and token addition within the loop.

  • Line 555 calls batch.clear() before filling each sub-batch, preventing leftover tokens.
  • Lines 557, 568-571 add tokens with batch.add_text(...) and handle flags like output = true.

This pattern helps ensure each sub-batch is well-defined. Confirm that the correct sequence IDs are used if some lines rely on multi-sequence contexts.

Also applies to: 568-571


652-653: Replacing or refactoring lines in test or logging logic?

No explicit logic appears in your snippet for lines 652-653 beyond them having ~. If they involve minor refactoring (e.g., logging or clearing), confirm consistency with surrounding code.

Please ensure these modified lines align with updated usage of the batch or reflect a no-op if it’s purely a comment/log change.


778-782: Transition from llama_decode to llama_decode_ext.

Replacing the older decode method with the extended version ensures consistent usage of the new batch approach. Confirm that all downstream calls to sampling or perplexity computations expect the new structures or call signatures.


865-868: Ensuring chunk-based loop validity.

Any changes to these lines likely adjust how chunk iteration or indexes are incremented. Double-check off-by-one errors around loop boundaries, especially if prompt sizes can vary.

If the chunk size logic changed, run a test with the largest possible prompt to confirm no out-of-range indexing.


962-962: Comparing predicted endings.

Here, verifying gold labels vs. predicted tokens likely changed to reflect the new batch structure. Ensure all references to log_probs or token arrays remain consistent after the refactor.

Also applies to: 964-965


977-978: Data structure or function call alignment for multiple-choice tasks.

If these lines adjust parameter passing or function calls to the updated batch system, confirm that the new type or name exactly matches.


982-983: Tokenizing multiple-choice tasks and storing them.

These lines likely replace direct calls with common_batch or updated logic. Watch out for memory usage if multiple tasks are large. The approach seems correct if it matches the new batch design.

Also applies to: 985-987, 989-990, 991-993


1030-1041: Insertion logic for _ in the sentence.

Verifying underscore placeholders ensures correct string manipulation. The approach looks fine, but confirm that replaced code matches new specs for multi-sequence handling.


1042-1044: Ensuring model state is consistent for multiple-choice tasks.

Refactoring might require re-initializing the KV cache or sampler. No immediate issues are visible, but watch for concurrency if tasks are processed in parallel.

Also applies to: 1058-1060


1062-1066: Invocation of Winogrande scoring.

Refactored calls that unify with the extended batch approach might require consistent n_batch handling. Confirm that the new logic does not skip final tokens.


1069-1077: Refreshing random selection or partial usage for Winogrande tasks.

If we’re limiting tasks or randomizing them, ensure the changes do not bias the distribution or produce out-of-range selection. The approach of removing tasks after selection looks correct.

To confirm correctness, run repeated tests with different seeds and compare distribution outcomes for coverage.

Also applies to: 1080-1093


1095-1107: Tokenizing Winogrande tasks with common code.

These lines presumably unify the code with the new common_tokenize approach. Ensure no off-by-one bugs if partial tokens are included.

Also applies to: 1114-1116


1130-1134: Decoding or evaluating tasks in a new batch structure.

Check that subsequent references to the log probabilities or embeddings match the updated indexing if extended-batch logic changed how tokens are stored.

Also applies to: 1138-1138


1146-1146: Batch handling for Winogrande.

  • Clearing the batch and reinitializing it.
  • Possibly new arguments for sequence IDs.
    Looks consistent if it tracks each test scenario properly.

Also applies to: 1148-1148, 1159-1162


1165-1166: Refined sequence iteration or partial logic.

Any newly introduced checks or pointer arithmetic must be double-checked for boundary conditions. The changes appear structurally correct if tested with different batch sizes.

Encourage a boundary test to ensure no stepping out of range on seq_tokens.

Also applies to: 1200-1208, 1211-1212


1219-1223: Collecting and normalizing token log probabilities.

These lines appear to gather logprobs from eval_results. The logic looks consistent, but confirm that newly introduced or replaced lines do not skip an index or double-count tokens.

Also applies to: 1225-1238


1240-1244: Granular comparison of predicted endings.

This final acceptance check for the correct answer is consistent with the rest of the new logic. No immediate issues seen.

Also applies to: 1247-1247


1254-1262: Final results and stats.

If these lines finalize the acceptance ratio or set up the scoring, ensure they handle the updated batch size or newly introduced partial chunking. Otherwise, it looks correct.


1307-1311: multiple_choice_prepare_one_task adjustments.

These lines presumably adapt the method to the new batch mechanism. Look for any mismatch with the new sequence offset or token indexing.

Also applies to: 1314-1322, 1324-1338


1342-1346: Setting up the multiple-choice scoring environment.

Ensuring context size and concurrency are properly enforced is key. The changes appear consistent but keep an eye on potential off-by-one if n_ctx changed.

Run scale tests with different n_ctx values to confirm no boundary exceptions occur.

Also applies to: 1482-1483


1535-1537: Performing the final decode for multiple-choice tasks (batch usage).

Deleting or rewriting older calls with llama_decode to llama_decode_ext can break logic if some metadata is missing. Everything looks aligned so far but confirm that all tasks are accounted for in logs.

Also applies to: 1539-1539, 1542-1545


1552-1558: Clearing KV and decoding all tasks in [i0, i1).

This persistent pattern is consistent with earlier changes. Just ensure you handle partial tasks if the final chunk is smaller than a full batch.

Also applies to: 1560-1560


1581-1584: Computing logprobs in parallel.

Where concurrency is introduced or updated, confirm that shared data structures (e.g., eval_pairs, eval_results) are protected from race conditions. The usage with std::mutex or an atomic counter looks correct.

Also applies to: 1586-1596, 1600-1608


1612-1612: Aggregating or normalizing partial results.

Double-check these lines for potential floating-point rounding issues, especially if the partial sums approach changed. All appears logically consistent if thoroughly tested.

Also applies to: 1614-1614, 1616-1616, 1618-1618, 1620-1620, 1622-1626


1628-1631: Calculating the final accuracy or partial results.

No major issues, but watch for integer division if any changed lines handle sums vs. floats.

Also applies to: 1634-1634, 1637-1643, 1647-1647


1668-1668: KL divergence readiness checks and parameter usage.

Any new lines around version compatibility or ensuring the correct vocab size must match the current model. The changes look straightforward, but test any new model with varied vocab sizes.

Also applies to: 1670-1670, 1726-1727


1732-1735: Reading log-prob file and chunking logic.

Confirm that your indexing into the input file aligns with these new line changes. Possibly, we skip unneeded lines or track chunk sizes differently.

Test with logs from older versions to ensure backward compatibility or fail gracefully if incompatible.

Also applies to: 1738-1742


1750-1751: Clearing batch and re-inserting tokens (KL analysis).

If we rely on common_batch again for re-processing tokens, ensure the correct offsets are used. Looks consistent with earlier refactors.

Also applies to: 1754-1758, 1760-1763


1774-1776: Aggregating chunk-based results.

Double-check that the final metrics (time per pass, etc.) are unaffected by changes to chunk indexing or partial decoding.

Also applies to: 1793-1801


1802-1826: Calculating partial perplexities and KL stats.

Large floating-point operations are always a spot to watch for numerics. The refactor is basically hooking to new data structures. No big concerns if tested.


1832-1909: Collecting distribution statistics.

Here you unify the stats with percentile calculations. No immediate red flags unless the refactor removed a boundary check. Code still looks consistent.


1910-1926: Final logs and prints.

These lines finalize the results. The changes appear to focus on printing or storing updated stats. Everything seems coherent with the new approach.

examples/embedding/embedding.cpp (5)

29-30: Adding tokens with common_batch.

Using batch.add_text() in a loop (lines 29–30, 32) replaces older batch logic. This is consistent with the new approach. No issues found, but confirm proper sequence IDs if multiple sequences are processed at once.

Also applies to: 32-32


36-36: Refactored embedding decode calls.

  • Line 36: Introduces batch_decode(...) referencing the new batch approach.
  • Lines 44–45: Logging adjusted to show llama_batch_ext_get_n_tokens(...) instead of direct batch.n_tokens.
  • Line 48: Checking return code from llama_encode_ext/llama_decode_ext.
  • Line 52: Another check or log statement updated similarly.

These changes consistently adopt the extended API. Recognize that the logic now branches between encoder-only vs. decoder-only. This is technically robust so far.

Also applies to: 44-45, 48-48, 52-52


58-59: Retrieving embeddings from new functions.

  • Lines 58–59 switch from direct usage of logits to checking batch.tokens[i].logits, aligning with the extended batch struct.
  • Lines 73–74 retrieve sequence embeddings with llama_get_embeddings_seq(...), ensuring the correct seq_id.

Looks correct if you consistently manage seq_id. This is critical for correct embedding retrieval.

Also applies to: 73-74


172-172: Instantiating common_batch for embeddings.

Replacing older code with common_batch(n_batch, 1) clarifies that you have one sequence in each batch for embeddings. This is consistent with the single-prompt embedding scenario or repeated single-sequence usage.

Also applies to: 174-174


200-200: Triggering partial batch decode for multiple lines.

  • Line 200 checks if adding new tokens would overflow the batch.
  • Lines 202–203 decode the batch, then line 205 increments the pointer for embeddings.
  • Line 207 calls batch.clear() to prepare for the next chunk.

Ensures repeated usage is stable. Good approach for embedding multiple inputs in a single run.

Also applies to: 202-203, 205-205, 207-207

examples/passkey/passkey.cpp (9)

5-5: Confirm include path.
Including "llama-cpp.h" here looks correct, but ensure that this header is available at compile time.


126-126: Check for potential null return.
llama_batch_ext_init may return nullptr on failure. Consider adding a null check for robust error handling.


144-144: Proper clearing usage.
Calling llama_batch_ext_clear(batch.get()) here is consistent with the extended batch mechanism.


152-152: Immediate output setting.
llama_batch_ext_set_output_last is used correctly to designate output for the final token in this batch.


155-155: Return code check is good.
Decoding errors are handled properly by exiting on non-zero return.


179-179: Batch clearing repeated.
Clearing the batch again is consistent with the iterative chunk processing flow.


182-183: Duplicate pattern with seq_id.
This repeats the pattern of always using seq_id = 0. Same considerations as lines 147-148 apply.


187-187: Final output flagging.
Designating this as the last output token in the chunk is consistent with the extended approach.


190-190: Decoding error handling repeated.
Same robust mechanism as before.

src/llama-batch.h (3)

22-32: New extended batch structure.
Fields and layout look sensible for extended usage, and they unify token/embedding details.


99-99: Updated signature.
Adopting llama_batch_ext in from_batch aligns with the broader refactor.


114-115: Constructor and destructor introduced.
Check that any allocated resources (e.g. batch) are fully freed in the destructor to prevent leaks.

examples/batched/batched.cpp (14)

105-105: Validate return pointer.
Add a null check for llama_batch_ext_init to safeguard against allocation failures.


114-114: Positioning logic check.
Using i as both token index and position might be correct here, but confirm it matches your intended sequence logic.


119-119: Encoding error handling.
Non-zero return leads to an error. This matches the approach used for decoding.


129-129: Batch clearing usage.
Resetting the batch before adding decoder tokens is appropriate.


130-130: Decoder start token position.
Position 0 is reused for all sequences. Confirm if this suits your decoding logic.


134-134: Last output token set.
llama_batch_ext_set_output_last is used correctly, consistent with standard patterns.


136-136: Decoding error handling.
Same pattern as above—this is good for consistency.


158-158: Last-token indexing.
Setting i_batch[i] = llama_batch_ext_get_n_tokens(batch) - 1 could lead to an off-by-one if tokens are added after.


167-167: Batch reset.
Clearing the batch again works for the iterative generation loop.


196-196: Updating i_batch.
i_batch[i] = llama_batch_ext_get_n_tokens(batch) updates the reference to newly added tokens. Confirm that the index remains in sync across parallel sequences.


199-199: Setting position to n_cur.
Passing n_cur as the token position seems logical.


205-205: Zero tokens check.
Breaking out when no tokens remain helps avoid extraneous decode calls.


212-212: Decoding loop.
Consistent error handling for incremental decoding.


237-237: Memory cleanup.
llama_batch_ext_free(batch) ensures no leaks for the extended batch.

examples/simple/simple.cpp (3)

146-146: Consider checking return value from llama_batch_ext_init_from_text
If prompt_tokens.size() is zero or memory allocation fails, batch might be nullptr. Checking for a null return can help avoid potential segmentation faults.


185-185: Validate sequence ID usage
The code always uses sequence ID 0 when adding text. If multi-sequence contexts are ever needed, ensure the correct sequence ID is passed, or add logic to handle multiple sequences.


203-203: Confirm memory cleanup
Freeing the batch pointer is correct for preventing resource leaks. Ensure no subsequent logic reads from this batch structure after it's freed.

examples/retrieval/retrieval.cpp (5)

77-80: Add error handling for empty tokens
If tokens is empty, this function silently skips adding anything to the batch. Consider logging or otherwise surfacing the scenario to avoid silent failures.


92-102: Check model type assumptions
This logic branches for encoder-only vs. decoder-only models. If future support expands to multi-encoder-decoder models, additional checks or error handling might be required.


105-106: Clarify skip condition
Skipping tokens where batch.tokens[i].logits is null might lead to missing or incomplete processing. Verify if additional logging or alternative handling is warranted.


253-254: Verify timing of batch clear
Calling batch_decode immediately followed by batch.clear() may cause issues if decoding is asynchronous or if data is accessed post-decode. Confirm decode completion before clearing.


288-288: Clearing query batch is appropriate
Erasing the query batch after its use ensures a clean slate. This aligns well with typical usage patterns here.

src/llama-context.cpp (3)

7-7: Reference new batch implementation
Including "llama-batch.h" enables the extended batch functionality. This is a good step toward the new API.


2738-2742: New llama_encode_ext
The extended API is a clearer entry point for advanced usage. This approach consistently leverages llama_batch_ext throughout.


2744-2747: New llama_decode_ext
Properly references the extended batch, aligning with the rest of the new APIs. This promotes a unified pattern across the codebase.

examples/save-load-state/save-load-state.cpp (5)

51-51: Ensure sufficient batch capacity.
llama_batch_ext_init_from_text() sets batch->max_tokens = n_tokens, so it should typically handle the text size. However, if there's a chance that tokens.size() is unexpectedly large, it might be prudent to confirm no overflow or out-of-range scenario occurs inside the internal loop.


82-84: Logic appears correct.
Clearing the batch each iteration, setting seq_id = 0, and adding the next token is a proper pattern for incremental decoding.


86-88: Good error handling.
You capture the error code from llama_decode_ext and free the batch on failure. This prevents resource leaks and undefined behavior.


139-141: Error handling remains consistent.
Same pattern as before: checking the return code and freeing the batch upon error.


235-235: Safe cleanup.
Freeing the batch after the final usage is correct. Just ensure that no other code path calls llama_batch_ext_free(batch) again to avoid double free.

src/llama-batch.cpp (6)

192-274: Potential nullness assumption for batch.n_seq_id.
Inside from_batch, you rely on batch.n_seq_id to access batch.n_seq_id[bi] unconditionally in multiple places. If batch.n_seq_id were truly null, this could segfault. The llama_batch_allocr constructor does allocate n_seq_id if missing, but verifying that all callsites respect that invariant would help avoid unexpected crashes.


326-329: Convenience constructor logic is straightforward.
This helper for creating a minimal llama_batch with tokens and no embeddings looks fine. Just ensure that the fields remain valid for the expected usage.


388-391: Initialize with the chosen number of tokens.
llama_batch_ext_init(n_kv_max, 0) or (n_kv_max, 1) is succinct, but ensure it matches the required sequence usage.


407-409: Simple accessor method.
llama_batch_ext_get_n_tokens() has no issues; it’s a straightforward getter.


455-463: Final-token marking is concise and clear.
llama_batch_ext_set_output_last sets the last token for output. This is a direct approach that matches the needs for typical “terminal token” usage.


489-505: Proper selective freeing.
You skip freeing arrays if the batch is a view. This approach is correct for shared data lifetimes. Just ensure no references remain active once the parent is freed.

examples/batched-bench/batched-bench.cpp (7)

62-62: Check n_seq_max in llama_batch_ext_init.
Here, you use llama_batch_ext_init(n_kv_max, 1). If you intend multiple sequences, ensure that using “1” as n_seq_max doesn’t block multi-sequence usage.


66-72: Nicely modular decoding step.
Splitting the batch into sub-batches using llama_batch_ext_get_view is a clean approach. Just confirm you don’t exceed the parent’s token count in edge cases.


87-88: Usage of token=0.
You repeatedly add the token value 0. Verify that this is a valid placeholder or system token in your domain logic (likely a start-of-sequence or dummy token).


118-118: Clearing the batch is correct.
This resets n_tokens, preparing for the next iteration. No immediate issues.


125-125: Marking last output token.
llama_batch_ext_set_output_last(batch) is consistent, given you want the final token flagged for output.


147-147: Recurring batch clear.
Again, systematically resetting n_tokens is fine. Watch for performance overhead if done too frequently, but likely acceptable in a benchmark.


188-188: Proper final cleanup.
Freeing the extended batch at the end ensures no memory leaks.

include/llama.h (8)

236-238: Deprecation Notice Acknowledged
This clarifies that the older llama_batch struct will be phased out in favor of llama_batch_ext.


263-266: Forward Declaration
No functional issues. Defining struct llama_batch_ext; aligns with the plan to deprecate llama_batch.


966-980: Confirm Concurrency
llama_batch_ext_set_output and llama_batch_ext_set_output_last modify batch data. Confirm if usage is serialized or if these functions are intended to be thread-safe.


989-991: Clear Function
llama_batch_ext_clear straightforwardly removes all tokens. Implementation appears sound.


992-994: Free Function
Appropriately handles both the original batch and sub-views. Looks correct.


1004-1007: Extended Encode
llama_encode_ext is a logical companion to the existing API, enabling the new batch structure.


1014-1015: Minor Modification
No functional issues observed in this signature adjustment.


1016-1018: Extended Decode
llama_decode_ext follows the same pattern as llama_encode_ext, enabling the extended batch approach.

Comment on lines 468 to 487
struct llama_batch_ext * llama_batch_ext_get_view(
struct llama_batch_ext * batch,
int32_t offset,
int32_t n_tokens) {
if (batch->embd) {
return nullptr; // not yet supported
}
llama_batch_ext * batch_view = new llama_batch_ext{
/*n_tokens =*/ n_tokens,
/*max_tokens =*/ n_tokens,
/*is_view =*/ true,
/*tokens =*/ batch->token + offset,
/*embd =*/ nullptr,
/*pos =*/ batch->pos + offset,
/*n_seq_id =*/ batch->n_seq_id + offset,
/*seq_id =*/ batch->seq_id + offset,
/*logits =*/ batch->logits + offset,
};
return batch_view;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Offset bounds are unchecked.
llama_batch_ext_get_view uses offset and n_tokens to compute ranges. If the offset plus size exceeds the parent’s capacity, it’ll create an invalid pointer range. Add boundary checks to avoid silent out-of-bounds usage.

include/llama.h Outdated
Comment on lines 940 to 949
// Same with llama_batch_init, but initializes the batch with the provided raw embeddings
// First token will be at position pos0
// The sequence ID will be fixed to seq_id
// The batch has to be freed with llama_batch_ext_free()
LLAMA_API struct llama_batch_ext * llama_batch_ext_init_from_embd(
float * embd,
size_t n_embd,
int32_t pos0,
int32_t seq_id);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Initialize From Embeddings
Ensure embd is non-null and n_embd > 0; otherwise, the function could run into undefined behavior.

+// Example check inside llama_batch_ext_init_from_embd:
+if (!embd || n_embd <= 0) {
+    // handle error or return null
+}

Committable suggestion skipped: line range outside the PR's diff.

@ngxson
Copy link
Owner Author

ngxson commented Mar 13, 2025

@coderabbitai help

Copy link

coderabbitai bot commented Mar 13, 2025

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

@ngxson
Copy link
Owner Author

ngxson commented Mar 13, 2025

@coderabbitai can I add some more context to the review?

Copy link

coderabbitai bot commented Mar 13, 2025

Tip

For best results, initiate chat on the files or code changes.

@ngxson Yes, you can absolutely add more context to the review! Additional information about your PR would be very helpful for me to better understand the purpose and implementation of the private batch API changes.

Feel free to share details about:

  • The motivation behind these changes
  • Key design decisions you made
  • Any specific areas you'd like me to focus on
  • Any challenges or considerations I should be aware of

The more context you provide, the more thorough and accurate my review can be.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

♻️ Duplicate comments (4)
examples/run/run.cpp (1)

954-957: ⚠️ Potential issue

Consider checking the return value of check_context_size.
Currently, the return value is ignored at line 998, which could lead to potential errors if the context is exceeded.

Apply this diff to handle the error properly:

-    check_context_size(llama_data.context, batch);
+    if (check_context_size(llama_data.context, batch) != 0) {
+        // context size exceeded, handle error here
+        return 1;
+    }
examples/save-load-state/save-load-state.cpp (1)

54-56: ⚠️ Potential issue

Decode result is ignored.
Check the return code from llama_decode_ext to catch potential failures.

Possible patch:

- llama_decode_ext(ctx, batch.get());
+ int ret_dec = llama_decode_ext(ctx, batch.get());
+ if (ret_dec != 0) {
+     fprintf(stderr, "llama_decode_ext failed with code %d\n", ret_dec);
+     return 1;
+ }
  n_past += llama_batch_ext_get_n_tokens(batch.get());
src/llama-batch.cpp (2)

404-427: ⚠️ Potential issue

Potential out-of-bounds if n_seq_ids exceeds allocated n_seq_max.
Although the code sets batch->n_seq_id[output_id] = n_seq_ids, there is no check that n_seq_ids fits into the allocated number of sequence IDs per token. This can cause out-of-bounds writes in batch->seq_id[batch->n_tokens][j].

 int32_t llama_batch_ext_add_text(..., size_t n_seq_ids, ...) {
+    // Hypothetical fix: assume n_seq_max is known (e.g., stored in batch->some_field)
+    if (n_seq_ids > batch->some_n_seq_max) {
+        return -3; // or handle error
+    }

     ...
 }

461-480: ⚠️ Potential issue

Missing boundary checks for offset and n_tokens.
If offset + n_tokens is beyond the parent batch’s capacity, pointer arithmetic can go out of bounds. Consider clarifying or enforcing valid ranges.

if (offset + n_tokens > batch->max_tokens) {
    // handle error
    return nullptr;
}
🧹 Nitpick comments (3)
examples/simple/simple.cpp (1)

184-187: Consider checking return value of batch operations.

While the transition to the new batch API is correct, consider checking the return value of llama_batch_ext_add_text, which can return error codes (such as -1 if batch size is exceeded or -2 if embeddings are already set).

-llama_batch_ext_add_text(batch.get(), new_token_id, n_pos, &seq_id, 1, true);
+int add_result = llama_batch_ext_add_text(batch.get(), new_token_id, n_pos, &seq_id, 1, true);
+if (add_result < 0) {
+    fprintf(stderr, "%s: failed to add token to batch, error code %d\n", __func__, add_result);
+    return 1;
+}
examples/lookahead/lookahead.cpp (1)

120-120: Consider using smart pointer for consistent resource management.

Using raw llama_batch_ext pointer here, unlike other batch initializations in the codebase that use llama_batch_ext_ptr. While this works, a smart pointer would provide more consistent RAII-style resource management.

-llama_batch_ext * batch = llama_batch_ext_init(params.n_ctx, W + G + 1);
+auto batch = llama_batch_ext_ptr(llama_batch_ext_init(params.n_ctx, W + G + 1));
src/llama-batch.cpp (1)

192-274: Consider extracting sorting logic into a helper for clarity.
This new function is quite large and has multiple branching conditions. Extracting the sorting and sequence grouping logic into helper methods could improve readability and maintainability.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd51d63 and 30f1db9.

📒 Files selected for processing (11)
  • common/common.cpp (2 hunks)
  • examples/lookahead/lookahead.cpp (6 hunks)
  • examples/lookup/lookup.cpp (5 hunks)
  • examples/run/run.cpp (4 hunks)
  • examples/save-load-state/save-load-state.cpp (5 hunks)
  • examples/simple-chat/simple-chat.cpp (3 hunks)
  • examples/simple/simple.cpp (3 hunks)
  • examples/speculative/speculative.cpp (9 hunks)
  • include/llama-cpp.h (1 hunks)
  • include/llama.h (5 hunks)
  • src/llama-batch.cpp (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • examples/simple-chat/simple-chat.cpp
  • examples/lookup/lookup.cpp
🧰 Additional context used
🧠 Learnings (2)
examples/lookahead/lookahead.cpp (1)
Learnt from: ngxson
PR: ngxson/llama.cpp#14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
src/llama-batch.cpp (1)
Learnt from: ngxson
PR: ngxson/llama.cpp#14
File: src/llama-batch.cpp:356-386
Timestamp: 2025-03-13T23:30:11.172Z
Learning: Null checks after malloc calls are not required in the llama.cpp codebase, as the project follows a design pattern where memory allocation failures are handled differently or allowed to propagate.
🧬 Code Definitions (7)
examples/run/run.cpp (3)
include/llama-cpp.h (4) (4)
  • batch (28-28)
  • batch (28-28)
  • tokens (44-57)
  • tokens (44-48)
src/llama-context.cpp (2) (2)
  • llama_decode_ext (2777-2781)
  • llama_decode_ext (2777-2779)
src/llama-batch.cpp (2) (2)
  • llama_batch_ext_get_n_tokens (400-402)
  • llama_batch_ext_get_n_tokens (400-400)
include/llama-cpp.h (2)
include/llama.h (4) (4)
  • llama_batch_ext_init (924-926)
  • llama_batch_ext_add_text (955-961)
  • llama_batch_ext_set_output_last (977-977)
  • llama_batch_ext_init_from_embd (934-939)
src/llama-batch.cpp (8) (8)
  • llama_batch_ext_init (372-374)
  • llama_batch_ext_init (372-372)
  • llama_batch_ext_add_text (404-427)
  • llama_batch_ext_add_text (404-410)
  • llama_batch_ext_set_output_last (448-455)
  • llama_batch_ext_set_output_last (448-448)
  • llama_batch_ext_init_from_embd (376-390)
  • llama_batch_ext_init_from_embd (376-381)
examples/lookahead/lookahead.cpp (1)
include/llama-cpp.h (2) (2)
  • batch (28-28)
  • batch (28-28)
examples/save-load-state/save-load-state.cpp (4)
include/llama-cpp.h (4) (4)
  • batch (28-28)
  • batch (28-28)
  • tokens (44-57)
  • tokens (44-48)
include/llama.h (5) (5)
  • llama_decode_ext (1013-1015)
  • llama_batch_ext_get_n_tokens (948-948)
  • llama_batch_ext_clear (987-987)
  • llama_batch_ext_add_text (955-961)
  • llama_batch_ext_free (991-991)
src/llama-context.cpp (2) (2)
  • llama_decode_ext (2777-2781)
  • llama_decode_ext (2777-2779)
src/llama-batch.cpp (8) (8)
  • llama_batch_ext_get_n_tokens (400-402)
  • llama_batch_ext_get_n_tokens (400-400)
  • llama_batch_ext_clear (457-459)
  • llama_batch_ext_clear (457-457)
  • llama_batch_ext_add_text (404-427)
  • llama_batch_ext_add_text (404-410)
  • llama_batch_ext_free (482-498)
  • llama_batch_ext_free (482-482)
examples/speculative/speculative.cpp (2)
include/llama.h (6) (6)
  • llama_decode_ext (1013-1015)
  • llama_batch_ext_init (924-926)
  • llama_batch_ext_clear (987-987)
  • llama_batch_ext_add_text (955-961)
  • llama_batch_ext_get_n_tokens (948-948)
  • llama_batch_ext_free (991-991)
src/llama-context.cpp (2) (2)
  • llama_decode_ext (2777-2781)
  • llama_decode_ext (2777-2779)
include/llama.h (3)
include/llama-cpp.h (4) (4)
  • tokens (44-57)
  • tokens (44-48)
  • batch (28-28)
  • batch (28-28)
src/llama-batch.cpp (2) (2)
  • llama_batch_ext_init (372-374)
  • llama_batch_ext_init (372-372)
examples/perplexity/perplexity.cpp (1) (1)
  • batch (1470-1470)
common/common.cpp (2)
include/llama.h (2) (2)
  • llama_encode_ext (1001-1003)
  • llama_decode_ext (1013-1015)
src/llama-context.cpp (4) (4)
  • llama_encode_ext (2771-2775)
  • llama_encode_ext (2771-2773)
  • llama_decode_ext (2777-2781)
  • llama_decode_ext (2777-2779)
⏰ Context from checks skipped due to timeout of 90000ms (20)
  • GitHub Check: android-build
  • GitHub Check: ubuntu-latest-llguidance
  • GitHub Check: ubuntu-latest-cmake-cuda
  • GitHub Check: ubuntu-latest-cmake-rpc
  • GitHub Check: ubuntu-22-cmake-vulkan
  • GitHub Check: ubuntu-22-cmake-sycl-fp16
  • GitHub Check: windows-2019-cmake-cuda (11.7, cuda)
  • GitHub Check: windows-2019-cmake-cuda (12.4, cuda)
  • GitHub Check: macOS-latest-cmake-ios
  • GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
  • GitHub Check: ubuntu-22-cmake-musa
  • GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
  • GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
  • GitHub Check: server (UNDEFINED, RelWithDebInfo)
  • GitHub Check: server (Release)
  • GitHub Check: server (ADDRESS, RelWithDebInfo)
  • GitHub Check: server-windows
🔇 Additional comments (54)
include/llama-cpp.h (2)

27-29: Well-designed deleter for batch_ext objects.

The llama_batch_ext_deleter struct follows the standard C++ deleter pattern and properly calls the C API function llama_batch_ext_free to release resources when a llama_batch_ext object goes out of scope.


36-67: Great RAII-compliant smart pointer wrapper.

This llama_batch_ext_ptr implementation is a well-designed wrapper that:

  1. Inherits from std::unique_ptr with a custom deleter for automatic resource management
  2. Provides convenient factory methods for common initialization patterns
  3. Encapsulates batch creation and memory management logic

The static factory methods provide clear, readable interfaces for creating batches from different inputs while handling memory management internally.

common/common.cpp (2)

1019-1021: Good use of the new batch API.

The code correctly transitions from manual batch initialization to using the new llama_batch_ext_ptr smart pointer and its initialization methods. This improves memory safety while simplifying the code.


1029-1031: Proper batch handling for decoder.

Consistent implementation of the batch API transition for decoder initialization.

examples/simple/simple.cpp (4)

2-2: Added required header for the new batch API.

The inclusion of "llama-cpp.h" is necessary for accessing the llama_batch_ext_ptr class and related functionality.


147-147: Improved batch initialization with smart pointer.

The new batch initialization using llama_batch_ext_ptr::init_from_text provides cleaner memory management and more explicit parameter handling than the previous approach.


155-157: Updated loop condition and decoding method.

The code correctly transitions to the new API with llama_batch_ext_get_n_tokens and llama_decode_ext. This change ensures proper interaction with the new batch structure.


162-162: Proper token position tracking.

Correctly updated to use llama_batch_ext_get_n_tokens for incrementing position counter.

examples/lookahead/lookahead.cpp (7)

95-98: Good RAII batch management with smart pointers.

Using llama_batch_ext_ptr for both batches ensures proper resource management, and the initialization parameters are correctly specified.

Based on your previous feedback, I understand you prefer to keep this part simple for now and plan to add error handling later.


209-213: Updated batch API usage correctly.

The transition from older batch functions to llama_batch_ext_clear and llama_batch_ext_add_text is implemented correctly.


235-239: Proper sequence ID handling.

The code correctly specifies sequence IDs when adding tokens to the batch, which is important for the context management in the lookahead implementation.


250-252: Correctly updated batch API for token addition.

The batch API is used correctly here for adding the lookahead tokens with proper sequence management.


257-259: Attention to logits flag details.

Good attention to detail by setting the logits flag only for the last level of tokens (j == N - 2), which helps optimize performance.


263-266: Good error handling for decode operation.

The code properly checks the return value of llama_decode_ext and provides a meaningful error message if decoding fails.


483-483: Proper resource cleanup.

The code correctly frees the batch resource at the end of the program.

examples/run/run.cpp (4)

995-995: Initialization of extended batch looks good.
No concerns.


999-999: Decoding call checks return value correctly.
This is a best practice to ensure errors are handled promptly.


1004-1005: Incrementing n_past is appropriate.
This correctly tracks how many tokens have been processed so far.


1020-1020: Single-token batch initialization is clearly written.
Helps produce incremental generation.

examples/speculative/speculative.cpp (6)

168-173: Validate non-empty input length before using n_input - 1.
If n_input is 0 or 1 (e.g., empty prompt), subtracting 1 risks negative sizing.

Would you like a verification script checking that inp.size() is at least 2 before these calls?


204-205: Extended batch initialization is aligned with the new API.
No immediate concerns here.


340-341: Potential out-of-bounds when accessing drafts[i].tokens[i_dft].
This concern was raised in earlier reviews for exactly the same lines.


478-485: Introducing batch_info struct and storing token drafting info.
This cleanly keeps token ID, position, and sequence IDs together.


489-489: Clearing the draft batch is straightforward.
No issues found.


651-652: Freeing both draft & target batches properly.
Looks correct and prevents memory leaks.

examples/save-load-state/save-load-state.cpp (5)

51-51: Extended batch initialization for prompt looks correct.
Uses init_from_text with the new structure.


86-86: Good job checking llama_decode_ext here.
Properly handles any decoding errors.

Also applies to: 88-89


139-139: Return code from decode is properly handled.
Consistent with safe error checking.

Also applies to: 141-142


221-221: Immediately checks llama_decode_ext result.
This consistent approach is good for robust error handling.

Also applies to: 223-224


235-235: Resource freed successfully.
Minimal risk of memory leaks.

src/llama-batch.cpp (12)

276-316: Ensure no double-free occurs if batch is also freed manually.
This constructor creates and owns a new llama_batch_ext which is later deleted in the destructor. If user code also calls llama_batch_ext_free(batch), a double-free may occur.

Would you like a verification script to confirm that no code path calls llama_batch_ext_free() on this batch pointer?


318-320: Destructor ownership may cause confusion.
Deleting batch unconditionally here is acceptable as long as usage never calls llama_batch_ext_free(batch) separately. Otherwise, it leads to undefined behavior.


327-329: No issues with the updated helper function.
This simple helper is in line with the deprecated llama_batch usage.


340-370: Internal initialization logic looks solid.
Allocations for token, embd, and other fields match the design approach. Null-checks are intentionally omitted as per codebase convention.


372-374: Straightforward wrapper function.
It delegates to llama_batch_ext_init_impl with zero embedding dimension.


376-390: Embeddings initialization function is consistent.
Copies embeddings into a newly allocated batch. If embd or n_embd is invalid, behavior is undefined, but that aligns with prior design decisions.


392-398: Position assignment method is concise.
Returns an error if size mismatches. Otherwise, copies positions. No further concerns.


400-403: Simple getter for batch size.
No issues here – the function correctly returns batch->n_tokens.


429-446: Output marking logic is acceptable.
Assigning logits[i] = true upon finding the matching sequence and position is clear. Document negative pos semantics (meaning “any position”) if not already stated.


448-455: Last-token output marker is straightforward.
No further suggestions: marking the final token as output is correct.


457-459: Clearing n_tokens is sufficient.
This function resets the batch by setting n_tokens = 0. No concerns here.


482-498: Free function is consistent with the dynamic allocations.
The pattern of skipping frees for views is correct; user must ensure not to pass a view to any code path that also expects to own memory.

include/llama.h (12)

238-239: Clear deprecation notice is helpful.
Explicitly warning that llama_batch will be removed in future promotes correct migration.


263-266: Forward declaration of llama_batch_ext is good.
This effectively prevents exposure of internal structure details.


921-927: Introducing llama_batch_ext_init as a replacement is appropriate.
This matches the transition away from llama_batch.


928-934: Embeddings-based init function parallels old API.
Just ensure callers provide valid non-null embeddings.


941-946: Pos-setting function documentation is clear.
It warns that n_pos must match the batch size.


947-948: Getter for the token count
Straightforward usage, consistent with the rest of the API.


963-968: Setting output for a sequence ID
No major issues. Negative pos is accepted as “match any position,” which is consistent with the source.


969-977: Last output function
Straightforward approach to mark the last token in the batch as output.


986-988: Clear function
Resets n_tokens to zero. No other concerns.


989-992: Free function
Same logic as in the CPP file, acknowledging special handling of views.


1001-1003: New encode_ext
Allowing the new extended batch structure is in line with the migration strategy.


1011-1011: Minor signature adjustment
Likely just indentation or doc updates. No new issues here.

Comment on lines 600 to 605
llama_batch_ext_clear(batch_tgt);
for (int i = 0; i < (int) batch_tgt_data.size(); ++i) {
const auto & data = batch_tgt_data[i];

llama_batch_ext_add_text(batch_tgt, data.id, data.pos, data.seq_id.data(), data.seq_id.size(), true);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Target-side decoding result is unchecked.
Propose capturing return to detect and handle errors.

Also applies to: 608-610

Comment on lines 570 to 572
drafts[s].i_batch_dft = llama_batch_ext_add_text(batch_dft, id, n_past_cur, &s, 1, true);

if (batch_tgt.n_tokens > n_draft) {
if (batch_tgt_data.size() > (size_t) n_draft) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Check the return code from llama_batch_ext_add_text.
Errors (-1 / -2) are never examined, which can lead to silent failures.

Proposed fix:

drafts[s].i_batch_dft = llama_batch_ext_add_text(batch_dft, id, n_past_cur, &s, 1, true);
-if (batch_tgt_data.size() > (size_t) n_draft) {
+if (drafts[s].i_batch_dft < 0) {
+    LOG_ERR("llama_batch_ext_add_text failed.\n");
+    drafts[s].drafting = false;
+    break;
+} else if (batch_tgt_data.size() > (size_t) n_draft) {
     drafts[s].drafting = false;
     ...
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
drafts[s].i_batch_dft = llama_batch_ext_add_text(batch_dft, id, n_past_cur, &s, 1, true);
if (batch_tgt.n_tokens > n_draft) {
if (batch_tgt_data.size() > (size_t) n_draft) {
drafts[s].i_batch_dft = llama_batch_ext_add_text(batch_dft, id, n_past_cur, &s, 1, true);
if (drafts[s].i_batch_dft < 0) {
LOG_ERR("llama_batch_ext_add_text failed.\n");
drafts[s].drafting = false;
break;
} else if (batch_tgt_data.size() > (size_t) n_draft) {
drafts[s].drafting = false;
...
}

Comment on lines 82 to 85
llama_batch_ext_clear(batch.get());
llama_seq_id seq_id = 0;
llama_batch_ext_add_text(batch.get(), next_token, 0, &seq_id, 1, true);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Verify llama_batch_ext_add_text return.
A negative result indicates an error (e.g., batch size exceeded).

Comment on lines 135 to 138
llama_batch_ext_clear(batch.get());
llama_seq_id seq_id = 0;
llama_batch_ext_add_text(batch.get(), next_token, 0, &seq_id, 1, true);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Again, llama_batch_ext_add_text return is not used.
Silently ignoring errors can cause unexpected behavior.

Comment on lines 217 to 220
llama_batch_ext_clear(batch.get());
llama_seq_id seq_id = 1; // seq 1 instead of 0
llama_batch_ext_add_text(batch.get(), next_token, 0, &seq_id, 1, true);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Adding token for seq 1 ignores error code.
Similar risk of silent failure from llama_batch_ext_add_text.

Comment on lines +950 to +962
// Add text tokens to the batch
// Return values:
// -1 : not enough space in the batch
// -2 : embd is already set, cannot add text tokens
// otherwise, returns the output ID
LLAMA_API int32_t llama_batch_ext_add_text(
struct llama_batch_ext * batch,
llama_token token,
llama_pos pos,
const llama_seq_id * seq_ids,
size_t n_seq_ids,
bool output);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Same n_seq_ids out-of-bounds risk applies.
Please note that n_seq_ids can exceed the allocated sequence capacity, leading to potential OOB when writing batch->seq_id[i][j].

Comment on lines +979 to +985
// Get a "view" from a number of tokens offset
// Return returned batch must be freed with llama_batch_ext_free()
LLAMA_API struct llama_batch_ext * llama_batch_ext_get_view(
struct llama_batch_ext * batch,
int32_t offset,
int32_t n_tokens);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

View creation lacks boundary checks
Similar to llama_batch_ext_get_view in the CPP file, offset + tokens could go beyond allocated memory.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
examples/perplexity/perplexity.cpp (1)

933-933: 💡 Verification agent

🧩 Analysis chain

Validate 'eval_results' indexing.
Ensure ir does not exceed eval_results.size() to avoid out-of-bounds writes.


🏁 Script executed:

#!/bin/bash
rg -A 10 "eval_results"

Length of output: 11099


Action: Ensure safe indexing for eval_results to prevent out-of-bound errors.

After reviewing the occurrences of eval_results in the file, note that while the vector is resized based on eval_pairs.size() and the lambda in compute_logprobs guards against writing past its end, later loops use a sequential index (ir) to read from it (e.g., in lines like hs_cur.ending_logprob[s] += eval_results[ir++];). There isn’t an explicit check to confirm that ir remains within bounds when these accesses occur. Please verify that the total number of reads exactly matches the size of eval_results. In addition, consider one or more of the following improvements:

  • Add an assertion or boundary check: Before each increment (e.g., before using eval_results[ir]), insert a check such as assert(ir < eval_results.size()) to catch any overrun during development.
  • Revisit the logic that computes the expected number of indices: Ensure that the loops accumulating log probabilities (both for tasks and multi-choice evaluations) are correctly synchronized with the number of logprob values written by compute_logprobs.

Addressing these points will help prevent possible out‐of‐bounds accesses and strengthen the code’s robustness.

🧹 Nitpick comments (3)
include/llama-cpp.h (2)

40-57: Consider making tokens parameter const

The implementation correctly handles positions incrementally from pos0, which aligns with the migration pattern mentioned in PR objectives. However, the tokens parameter could be declared as const llama_token* since the function doesn't modify the token values.

-    static llama_batch_ext_ptr init_from_text(llama_token * tokens,
+    static llama_batch_ext_ptr init_from_text(const llama_token * tokens,
                                              int32_t   n_tokens,
                                            llama_pos   pos0,
                                         llama_seq_id   seq_id,
                                                 bool   output_last) {

59-66: Consider making embd parameter const

Good implementation that directly wraps the C function. However, the embd parameter could be declared as const float* to match the signature of the underlying llama_batch_ext_init_from_embd function (as shown in the relevant snippets).

-    static llama_batch_ext_ptr init_from_embd(float * embd,
+    static llama_batch_ext_ptr init_from_embd(const float * embd,
                                             size_t   n_tokens,
                                             size_t   n_embd,
                                          llama_pos   pos0,
                                       llama_seq_id   seq_id) {
examples/speculative/speculative.cpp (1)

477-483: Consider documenting the 'batch_info' struct for clarity.
While the struct is mostly self-explanatory, adding a brief overview of how and why it is used would improve maintainability.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30f1db9 and 2134cab.

📒 Files selected for processing (5)
  • examples/perplexity/perplexity.cpp (19 hunks)
  • examples/speculative/speculative.cpp (8 hunks)
  • examples/tts/tts.cpp (6 hunks)
  • include/llama-cpp.h (1 hunks)
  • src/llama-context.cpp (4 hunks)
🔥 Files not summarized due to errors (1)
  • examples/tts/tts.cpp: Error: Disallowed special token found: <|im_start|>
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/llama-context.cpp
🧰 Additional context used
🧬 Code Definitions (4)
examples/tts/tts.cpp (1)
include/llama.h (7) (7)
  • llama_batch_ext_init (924-926)
  • llama_batch_ext_add_text (955-961)
  • llama_batch_ext_get_n_tokens (948-948)
  • llama_batch_ext_set_output_last (977-977)
  • llama_decode_ext (1013-1015)
  • llama_batch_ext_clear (987-987)
  • llama_batch_ext_free (991-991)
examples/speculative/speculative.cpp (3)
include/llama.h (6) (6)
  • llama_decode_ext (1013-1015)
  • llama_batch_ext_init (924-926)
  • llama_batch_ext_clear (987-987)
  • llama_batch_ext_add_text (955-961)
  • llama_batch_ext_get_n_tokens (948-948)
  • llama_batch_ext_free (991-991)
src/llama-context.cpp (2) (2)
  • llama_decode_ext (2780-2784)
  • llama_decode_ext (2780-2782)
src/llama-batch.cpp (10) (10)
  • llama_batch_ext_init (372-374)
  • llama_batch_ext_init (372-372)
  • llama_batch_ext_clear (457-459)
  • llama_batch_ext_clear (457-457)
  • llama_batch_ext_add_text (404-427)
  • llama_batch_ext_add_text (404-410)
  • llama_batch_ext_get_n_tokens (400-402)
  • llama_batch_ext_get_n_tokens (400-400)
  • llama_batch_ext_free (482-498)
  • llama_batch_ext_free (482-482)
include/llama-cpp.h (2)
src/llama-batch.cpp (8) (8)
  • llama_batch_ext_init (372-374)
  • llama_batch_ext_init (372-372)
  • llama_batch_ext_add_text (404-427)
  • llama_batch_ext_add_text (404-410)
  • llama_batch_ext_set_output_last (448-455)
  • llama_batch_ext_set_output_last (448-448)
  • llama_batch_ext_init_from_embd (376-390)
  • llama_batch_ext_init_from_embd (376-381)
include/llama.h (4) (4)
  • llama_batch_ext_init (924-926)
  • llama_batch_ext_add_text (955-961)
  • llama_batch_ext_set_output_last (977-977)
  • llama_batch_ext_init_from_embd (934-939)
examples/perplexity/perplexity.cpp (4)
src/llama-batch.h (1) (1)
  • batch (99-99)
include/llama.h (5) (5)
  • llama_batch_ext_init (924-926)
  • llama_batch_ext_clear (987-987)
  • llama_decode_ext (1013-1015)
  • llama_get_logits (1053-1053)
  • llama_batch_ext_set_output_last (977-977)
src/llama-context.cpp (4) (4)
  • llama_decode_ext (2780-2784)
  • llama_decode_ext (2780-2782)
  • llama_get_logits (2415-2419)
  • llama_get_logits (2415-2415)
src/llama-context.h (3) (3)
  • i (49-49)
  • i (52-52)
  • ctx (157-161)
⏰ Context from checks skipped due to timeout of 90000ms (20)
  • GitHub Check: windows-2019-cmake-cuda (12.4, cuda)
  • GitHub Check: windows-2019-cmake-cuda (11.7, cuda)
  • GitHub Check: ubuntu-22-cmake-sycl-fp16
  • GitHub Check: ubuntu-latest-cmake-cuda
  • GitHub Check: macOS-latest-swift (generic/platform=macOS)
  • GitHub Check: windows-latest-cmake (kompute-x64, -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML...
  • GitHub Check: windows-latest-cmake (vulkan-x64, -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_...
  • GitHub Check: windows-latest-cmake (avx-x64, -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_AVX...
  • GitHub Check: ubuntu-22-cmake-sycl
  • GitHub Check: ubuntu-22-cmake-musa
  • GitHub Check: ubuntu-22-cmake-vulkan
  • GitHub Check: macOS-latest-cmake-ios
  • GitHub Check: windows-latest-cmake-hip
  • GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
  • GitHub Check: macOS-latest-cmake-tvos
  • GitHub Check: ubuntu-latest-llguidance
  • GitHub Check: server (UNDEFINED, RelWithDebInfo)
  • GitHub Check: server-windows
  • GitHub Check: server (ADDRESS, RelWithDebInfo)
  • GitHub Check: check-requirements
🔇 Additional comments (25)
include/llama-cpp.h (5)

27-29: LGTM: Deleter for batch_ext follows established pattern

The implementation follows the same pattern as other deleter structures in the file, correctly calling llama_batch_ext_free for memory cleanup.


36-38: Clean smart pointer implementation

Good implementation of the smart pointer wrapper following the established pattern in the file. The constructors properly initialize the base class.


68-77: Proper position handling in add_seq method

Good implementation of the sequence addition method. The incremental position handling (i + pos0) correctly follows the migration pattern emphasis on proper position handling mentioned in the PR objectives.


79-85: LGTM: Appropriate handling of vector of sequence IDs

The method correctly uses seq_id.data() and seq_id.size() to pass the vector's data to the underlying C function, and properly handles the output flag.


87-93: LGTM: Consistent implementation for single sequence ID overload

This overload maintains consistent behavior with the multi-sequence version and provides a convenient interface for the common case of a single sequence ID.

examples/speculative/speculative.cpp (6)

204-205: Initialization looks correct.
Creating extended batches with specified capacities for the draft and target contexts appears correct.


489-489: Clearing the draft batch is appropriate here.
No concerns.


579-579: ****
Check llama_decode_ext's return value again.
As previously noted, ignoring the decode return code can hide errors. Checking and handling it ensures robust error handling.

Also applies to: 584-584, 588-588


600-605: ****
Check llama_decode_ext's return value for error handling.

Also applies to: 608-608


651-652: Proper cleanup.
Memory is freed for both the draft and target batches, which looks correct.


565-566:

❓ Verification inconclusive

Ensure consistent indexing between drafts and batch_tgt_data.
Because the code dynamically pushes to both data structures, verify that downstream references to these indexes remain valid.

Also applies to: 570-570, 572-572


🏁 Script executed:

#!/bin/bash
rg "i_batch_tgt|batch_tgt_data" -A 5

Length of output: 6928


Review: Confirm Consistent Indexing for Dynamic Data Structures

The code correctly uses batch_tgt_data.size() to determine the index that is pushed into drafts[s].i_batch_tgt (as seen at lines 565–566) and similar operations at lines 570 and 572. This approach appears to maintain the invariant that each index stored in a draft reliably corresponds to the element's position in batch_tgt_data.

However, please double-check that any future modifications to how these vectors are updated—such as clearing, copying (notably in the assignment to drafts[n_seq_cur].i_batch_tgt), or additional push/pop operations—do not break this consistency. Also, verify that all downstream accesses (e.g., in logging and sampler function calls) correctly account for these indices.

examples/perplexity/perplexity.cpp (14)

366-367: Decode return code is properly handled.
The approach to initialize, clear, add text, and decode with error checking is a good practice.

Also applies to: 374-374, 378-378


504-504: Adaptive batch capacity initialization.
Using std::min(n_batch, n_ctx*n_seq) for batch size is appropriate.


652-656: Robust error handling in decode_helper.
It's good that you check the return code from llama_decode_ext and skip memcpy on failure.

Also applies to: 659-659


834-834: Batch initialization for HellaSwag.
No issues found.


849-849: Clearing the batch.
No issues found.


899-899: Proper decode helper usage.
You correctly check decode_helper's result.


1115-1115: Batch initialization for Winogrande.
No issues found.


1134-1134: Clearing the batch again.
No immediate concerns.


1175-1175: Proper decode_helper usage.
Return value is checked; no issues.


1338-1338: Extended batch with 'max_seq'.
No issues found.


1448-1448: Clearing the batch.
No issues found.


1546-1546: decode_helper usage for multiple choice tasks.
You check the return value, which is consistent with good error handling.


1733-1733: Initializes a new batch.
No issues found.


1747-1747: Batch usage pattern with decode check.
The pattern of clearing, adding text, and checking decode return code is consistent.

Also applies to: 1749-1749, 1752-1752

Comment on lines 446 to 448
llama_batch_ext_clear(batch_dft);
llama_seq_id seq_id = 0;
llama_batch_ext_add_text(batch_dft, token_id, n_past_dft, &seq_id, 1, true);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Check llama_decode_ext's return value.
The llama_decode_ext call’s return code is ignored, which can obscure any runtime errors. Handle the return code to detect and address potential decoding failures.

Also applies to: 452-452

Comment on lines 167 to 172
// eval the prompt with both models
llama_decode(ctx_tgt, llama_batch_get_one( inp.data(), n_input - 1));
llama_decode(ctx_tgt, llama_batch_get_one(&inp.back(), 1));
llama_decode(ctx_dft, llama_batch_get_one( inp.data(), n_input));
auto batch0 = llama_batch_ext_ptr::init_from_text( inp.data(), n_input - 1, 0, 0, true);
auto batch1 = llama_batch_ext_ptr::init_from_text(&inp.back(), 1, n_input - 1, 0, true);
auto batch2 = llama_batch_ext_ptr::init_from_text( inp.data(), n_input , 0, 0, true);
llama_decode_ext(ctx_tgt, batch0.get());
llama_decode_ext(ctx_tgt, batch1.get());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Check input sizes to avoid out-of-bounds errors.
In lines 167 and 168, you subtract 1 from n_input and pass &inp.back(), which can lead to out-of-range or invalid pointer references if inp.size() < 2. Make sure n_input >= 2 is enforced or handle the edge cases gracefully.


🏁 Script executed:

#!/bin/bash
rg "n_input" -A 5

Length of output: 18081


Enforce Minimum Input Size Check in Batch Initialization

There’s a potential out-of-bounds error in lines 167–168 of examples/speculative/speculative.cpp. The use of n_input - 1 and &inp.back() assumes that inp contains at least 2 elements. If inp.size() < 2, these calls may lead to invalid pointer accesses. Please add a guard (or precondition) to ensure that n_input >= 2 before these calls, or handle the edge case gracefully.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (16)
examples/save-load-state/save-load-state.cpp (2)

51-55: Ensure consistent error checking after batch initialization and decoding.
While you call llama_decode_ext at line 54, the return code is not checked to detect potential decoding failures. You do check return values in subsequent calls (e.g., at lines 85, 137, 218). For consistency, consider verifying the return code here as well.

 llama_decode_ext(ctx, batch.get());
+nint ret_init_prompt = llama_decode_ext(ctx, batch.get());
+if (ret_init_prompt != 0) {
+    fprintf(stderr, "\n%s : llama_decode_ext() for the prompt failed, ret = %d\n", __func__, ret_init_prompt);
+    llama_batch_ext_free(batch.get());
+    return 1;
+}
 n_past += llama_batch_ext_get_n_tokens(batch.get());

82-87: Handle possible failure in the call to llama_decode_ext.
You've already added a check at line 85, which is good. However, consider logging a more descriptive error message when you detect a decoding failure, to help with troubleshooting.

examples/speculative/speculative.cpp (7)

452-452: Log integer return value from llama_decode_ext.
You are correctly checking if the function returned a non-zero result, but consider logging the return code or corresponding error text for better diagnostics.


477-481: Encapsulate batch_info to enhance maintainability.
Adding a dedicated struct is a good approach for grouping token, position, and sequence IDs. You might also consider adding helper methods (e.g., a constructor or validate()) to ensure each field is populated correctly.


483-483: Initialize batch_tgt_data capacity.
If the approximate size of batch_tgt_data can be estimated, consider reserving capacity to avoid repeated allocations when pushing back new elements.


578-578: Handle empty draft batch scenario.
At line 578, if llama_batch_ext_get_n_tokens(batch_dft.get()) returns 0, you break out of the loop. Confirm that you're not missing partial tokens or incorrectly concluding emptiness.


588-588: Stop condition correctness.
You break if batch_tgt_data.size() exceeds n_draft. Ensure the partial iteration doesn't skip essential tokens if you only exceed the limit by 1.


601-602: Leverage standard iteration idioms for clarity.
Currently, you use a for loop from 0 to batch_tgt_data.size() - 1. A range-based for (auto &data : batch_tgt_data) might improve readability.


608-608: Log the return status of llama_decode_ext for the target batch.
At line 608, you're calling llama_decode_ext without checking if the return was non-zero. Logging or handling errors would make troubleshooting easier.

examples/passkey/passkey.cpp (7)

126-126: Use separate function or wrapper for batch creation.
When you do llama_batch_ext_init(params.n_batch, 1), consider wrapping this in a function (e.g., init_batch_for_passkey()) to streamline repeated usage and ensure consistent parameters.


151-151: Validate usage of llama_batch_ext_set_output_last.
You set the output of the last token. Confirm that it matches your intended scenario — i.e., this token is indeed the final token in the chunk you want to generate output for. Otherwise, decoding might produce unexpected results.


154-154: Log decode failure with more context.
When llama_decode_ext(ctx, batch.get()) returns non-zero, log the return code or specify which chunk of tokens triggered the error for easier debugging.


178-181: Consider combining repeated batch operations.
You again clear (line 178) and add tokens (line 181). If you frequently do this pattern, factor it into a helper function that receives tokens and positions, then sets the output flag if needed.


185-185: Verify correctness of final token output flag.
Similar to line 151, setting the output of the last token might be unnecessary if the block only contains partial tokens. Verify you truly want to mark this token for output.


188-188: Propagate the error message.
You log the error but do not print the numeric return code. Consider including it for debugging.


251-251: Add error code check for the final decode.
Currently, you log a generic error but do not fetch the negative code from llama_decode_ext. Logging the code helps with debugging.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2134cab and 2cec1cf.

📒 Files selected for processing (15)
  • common/speculative.cpp (4 hunks)
  • examples/gritlm/gritlm.cpp (5 hunks)
  • examples/llava/gemma3-cli.cpp (6 hunks)
  • examples/llava/qwen2vl-cli.cpp (2 hunks)
  • examples/lookup/lookup.cpp (5 hunks)
  • examples/parallel/parallel.cpp (7 hunks)
  • examples/passkey/passkey.cpp (6 hunks)
  • examples/save-load-state/save-load-state.cpp (5 hunks)
  • examples/server/server.cpp (21 hunks)
  • examples/simple-chat/simple-chat.cpp (3 hunks)
  • examples/simple/simple.cpp (3 hunks)
  • examples/speculative-simple/speculative-simple.cpp (5 hunks)
  • examples/speculative/speculative.cpp (8 hunks)
  • examples/tts/tts.cpp (5 hunks)
  • include/llama-cpp.h (2 hunks)
🔥 Files not summarized due to errors (2)
  • examples/llava/qwen2vl-cli.cpp: Error: Disallowed special token found: <|im_start|>
  • examples/tts/tts.cpp: Error: Disallowed special token found: <|im_start|>
🚧 Files skipped from review as they are similar to previous changes (3)
  • examples/simple/simple.cpp
  • examples/lookup/lookup.cpp
  • examples/speculative-simple/speculative-simple.cpp
🧰 Additional context used
🧬 Code Definitions (10)
examples/tts/tts.cpp (1)
include/llama.h (5) (5)
  • llama_batch_ext_init (924-926)
  • llama_batch_ext_get_n_tokens (948-948)
  • llama_batch_ext_set_output_last (977-977)
  • llama_decode_ext (1013-1015)
  • llama_batch_ext_clear (987-987)
common/speculative.cpp (2)
include/llama-cpp.h (4) (4)
  • llama_batch_ext_ptr (38-38)
  • llama_batch_ext_ptr (39-39)
  • batch (29-29)
  • batch (29-29)
include/llama.h (4) (4)
  • llama_batch_ext_init (924-926)
  • llama_batch_ext_clear (987-987)
  • llama_batch_ext_get_n_tokens (948-948)
  • llama_decode_ext (1013-1015)
examples/gritlm/gritlm.cpp (2)
include/llama-cpp.h (6) (6)
  • batch (29-29)
  • batch (29-29)
  • token (70-76)
  • token (70-70)
  • token (79-85)
  • token (79-79)
include/llama.h (4) (4)
  • llama_batch_ext_init (924-926)
  • llama_batch_ext_clear (987-987)
  • llama_decode_ext (1013-1015)
  • llama_batch_ext_get_n_tokens (948-948)
include/llama-cpp.h (2)
src/llama-batch.cpp (8) (8)
  • llama_batch_ext_init (372-374)
  • llama_batch_ext_init (372-372)
  • llama_batch_ext_add_text (404-427)
  • llama_batch_ext_add_text (404-410)
  • llama_batch_ext_set_output_last (448-455)
  • llama_batch_ext_set_output_last (448-448)
  • llama_batch_ext_init_from_embd (376-390)
  • llama_batch_ext_init_from_embd (376-381)
include/llama.h (4) (4)
  • llama_batch_ext_init (924-926)
  • llama_batch_ext_add_text (955-961)
  • llama_batch_ext_set_output_last (977-977)
  • llama_batch_ext_init_from_embd (934-939)
examples/speculative/speculative.cpp (3)
include/llama.h (4) (4)
  • llama_decode_ext (1013-1015)
  • llama_batch_ext_init (924-926)
  • llama_batch_ext_clear (987-987)
  • llama_batch_ext_get_n_tokens (948-948)
src/llama-context.cpp (2) (2)
  • llama_decode_ext (2780-2784)
  • llama_decode_ext (2780-2782)
src/llama-batch.cpp (6) (6)
  • llama_batch_ext_init (372-374)
  • llama_batch_ext_init (372-372)
  • llama_batch_ext_clear (457-459)
  • llama_batch_ext_clear (457-457)
  • llama_batch_ext_get_n_tokens (400-402)
  • llama_batch_ext_get_n_tokens (400-400)
examples/server/server.cpp (6)
include/llama.h (7) (7)
  • llama_batch_ext_init (924-926)
  • llama_pooling_type (481-481)
  • llama_get_embeddings_seq (1080-1080)
  • llama_batch_ext_clear (987-987)
  • llama_batch_ext_get_n_tokens (948-948)
  • llama_batch_ext_set_output_last (977-977)
  • llama_batch_ext_get_view (981-984)
src/llama-batch.cpp (10) (10)
  • llama_batch_ext_init (372-374)
  • llama_batch_ext_init (372-372)
  • llama_batch_ext_clear (457-459)
  • llama_batch_ext_clear (457-457)
  • llama_batch_ext_get_n_tokens (400-402)
  • llama_batch_ext_get_n_tokens (400-400)
  • llama_batch_ext_set_output_last (448-455)
  • llama_batch_ext_set_output_last (448-448)
  • llama_batch_ext_get_view (461-480)
  • llama_batch_ext_get_view (461-464)
src/llama-batch.h (1) (1)
  • batch (99-99)
src/llama-context.cpp (4) (4)
  • llama_pooling_type (2368-2370)
  • llama_pooling_type (2368-2368)
  • llama_get_embeddings_seq (2439-2443)
  • llama_get_embeddings_seq (2439-2439)
src/llama-context.h (9) (9)
  • seq_id (53-53)
  • seq_id (100-100)
  • seq_id (101-101)
  • seq_id (102-102)
  • seq_id (115-120)
  • seq_id (122-126)
  • ctx (157-161)
  • i (49-49)
  • i (52-52)
common/common.h (1) (1)
  • common_embd_normalize (636-636)
examples/save-load-state/save-load-state.cpp (3)
include/llama-cpp.h (4) (4)
  • batch (29-29)
  • batch (29-29)
  • tokens (45-58)
  • tokens (45-49)
include/llama.h (4) (4)
  • llama_decode_ext (1013-1015)
  • llama_batch_ext_get_n_tokens (948-948)
  • llama_batch_ext_clear (987-987)
  • llama_batch_ext_free (991-991)
src/llama-batch.cpp (6) (6)
  • llama_batch_ext_get_n_tokens (400-402)
  • llama_batch_ext_get_n_tokens (400-400)
  • llama_batch_ext_clear (457-459)
  • llama_batch_ext_clear (457-457)
  • llama_batch_ext_free (482-498)
  • llama_batch_ext_free (482-482)
examples/passkey/passkey.cpp (3)
include/llama-cpp.h (2) (2)
  • batch (29-29)
  • batch (29-29)
include/llama.h (6) (6)
  • llama_batch_ext_init (924-926)
  • llama_batch_ext_clear (987-987)
  • llama_batch_ext_set_output_last (977-977)
  • llama_decode_ext (1013-1015)
  • llama_batch_ext_get_n_tokens (948-948)
  • llama_batch_ext_add_text (955-961)
src/llama-context.cpp (2) (2)
  • llama_decode_ext (2780-2784)
  • llama_decode_ext (2780-2782)
examples/simple-chat/simple-chat.cpp (6)
include/llama-cpp.h (2) (2)
  • batch (29-29)
  • batch (29-29)
src/llama-batch.h (1) (1)
  • batch (99-99)
src/llama-kv-cache.h (1) (1)
  • batch (111-111)
src/llama-batch.cpp (6) (6)
  • llama_batch_ext_get_n_tokens (400-402)
  • llama_batch_ext_get_n_tokens (400-400)
  • llama_batch_ext_clear (457-459)
  • llama_batch_ext_clear (457-457)
  • llama_batch_ext_free (482-498)
  • llama_batch_ext_free (482-482)
src/llama-context.cpp (4) (4)
  • n_ctx (412-414)
  • n_ctx (412-412)
  • llama_n_ctx (2340-2342)
  • llama_n_ctx (2340-2340)
src/llama-context.h (1) (1)
  • ctx (157-161)
examples/llava/gemma3-cli.cpp (3)
include/llama.h (5) (5)
  • llama_batch_ext_init (924-926)
  • llama_batch_ext_clear (987-987)
  • llama_batch_ext_set_output_last (977-977)
  • llama_decode_ext (1013-1015)
  • llama_batch_ext_init_from_embd (934-939)
src/llama-batch.cpp (8) (8)
  • llama_batch_ext_init (372-374)
  • llama_batch_ext_init (372-372)
  • llama_batch_ext_clear (457-459)
  • llama_batch_ext_clear (457-457)
  • llama_batch_ext_set_output_last (448-455)
  • llama_batch_ext_set_output_last (448-448)
  • llama_batch_ext_init_from_embd (376-390)
  • llama_batch_ext_init_from_embd (376-381)
src/llama-context.cpp (2) (2)
  • llama_decode_ext (2780-2784)
  • llama_decode_ext (2780-2782)
⏰ Context from checks skipped due to timeout of 90000ms (20)
  • GitHub Check: macOS-latest-swift (generic/platform=macOS)
  • GitHub Check: ubuntu-latest-cmake-cuda
  • GitHub Check: android-build
  • GitHub Check: macOS-latest-swift (generic/platform=iOS)
  • GitHub Check: windows-latest-cmake-hip
  • GitHub Check: ubuntu-22-cmake-sycl-fp16
  • GitHub Check: macOS-latest-cmake-tvos
  • GitHub Check: ubuntu-latest-llguidance
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
  • GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
  • GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
  • GitHub Check: ubuntu-latest-cmake-rpc
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: server (Release)
  • GitHub Check: server-windows
  • GitHub Check: server (ADDRESS, RelWithDebInfo)
  • GitHub Check: server (UNDEFINED, RelWithDebInfo)
🔇 Additional comments (79)
examples/gritlm/gritlm.cpp (7)

4-4: Added necessary header for batch extension support.

The inclusion of llama-cpp.h provides access to the llama_batch_ext_ptr smart pointer implementation required for the updated batch processing mechanism.


17-17: Updated to use smart pointer for better memory management.

The change from raw llama_batch* to llama_batch_ext_ptr enhances memory safety by using RAII principles. The smart pointer will automatically handle resource cleanup.


20-20: Updated batch clearing to use extended API.

The change from common_batch_clear(batch) to llama_batch_ext_clear(batch.get()) correctly adapts to the new API, properly accessing the underlying pointer with .get().


45-45: Updated token addition to use the new batch API.

The migration from common_batch_add to batch.add_text correctly implements the new API pattern. The conditional expression j >= n_inst appropriately determines when tokens should be considered for embedding output.


54-54: Updated decoding to use extended API.

Changing from llama_decode to llama_decode_ext is part of the batch API migration. Note that error checking is omitted as mentioned in previous review discussions.


108-108: Updated generation batch initialization.

The batch initialization in the generate function follows the same smart pointer pattern as in the encode function.


114-114: Updated batch handling in generation function.

The changes consistently apply the batch API migration pattern:

  1. Using llama_batch_ext_clear for batch clearing
  2. Using batch.add_text for token addition with position tracking
  3. Using llama_decode_ext for decoding
  4. Using llama_batch_ext_get_n_tokens to retrieve token count

This ensures proper sequence handling throughout generation.

Also applies to: 119-119, 124-124, 126-126

common/speculative.cpp (8)

17-17: Updated batch member to use smart pointer.

Changed from raw pointer struct llama_batch * batch to smart pointer llama_batch_ext_ptr batch. This improves memory safety by using RAII principles for automatic resource management.


26-26: Updated batch initialization in constructor.

Correctly initializes the smart pointer with llama_batch_ext_init, passing appropriate parameters for token capacity and sequence count.


65-73: Removed manual batch freeing.

The function no longer needs to call llama_batch_free(spec->batch) because the smart pointer will automatically handle memory cleanup when it goes out of scope or when the object is deleted.


207-212: Updated first batch operation sequence.

The batch clearing and token addition operations have been properly updated to use the extended API pattern:

  1. llama_batch_ext_clear(batch.get()) for clearing
  2. batch.add_text for adding tokens with appropriate position tracking

This ensures correct token processing in the prompt evaluation phase.


217-221: Updated token count check and decoding.

The changes properly:

  1. Use llama_batch_ext_get_n_tokens(batch.get()) instead of direct access to batch.n_tokens
  2. Use llama_decode_ext with the correct pointer access via .get()

This maintains proper batch handling while transitioning to the new API.


227-229: Updated batch handling for token processing.

The batch clearing and token addition operations following the last accepted token have been properly updated to use the extended API pattern.


234-234: Updated decode call for the last accepted token.

Correctly uses llama_decode_ext with proper pointer access via .get().


240-271: Updated batch operations in the token sampling loop.

The changes in the draft token generation loop correctly implement the batch API migration pattern:

  1. Using llama_batch_ext_clear for batch clearing between iterations
  2. Using batch.add_text for adding new tokens with proper position tracking
  3. Using llama_decode_ext for decoding each new token

This ensures proper token generation and evaluation in the speculative process.

examples/server/server.cpp (15)

1227-1227: Updated batch_spec type to use smart pointer

The type has been changed from llama_batch batch_spec to llama_batch_ext_ptr batch_spec, which is likely a unique_ptr wrapper. This is a good change as it provides automatic memory management.


1799-1799: Updated batch type to use smart pointer

Similar to the batch_spec change, batch is now a llama_batch_ext_ptr instead of a llama_batch. This improvement helps with automatic resource management.


1925-1926: Changed batch initialization to use the new API

The initialization now properly uses reset() with llama_batch_ext_init() instead of the previous default initialization. This is consistent with the migration pattern.


1961-1962: Updated batch initialization with proper sizing

The initialization of the main batch now uses reset() with an appropriate size calculation and explicitly sets the sequence limit to 1, which is clearer than the previous implementation.


2096-2097: Consistent use of batch_spec initialization

This follows the same pattern seen earlier, using reset() with proper parameters to initialize the batch for speculative decoding.


2404-2431: Refactored embedding retrieval to use sequence-based approach

The send_embedding method has been significantly refactored to:

  1. No longer require a batch parameter
  2. Use sequence IDs for embedding retrieval
  3. Properly handle pooling type checks
  4. Handle normalization consistently

Note: There's a temporary limitation with POOLING_TYPE_NONE which is now explicitly handled.


2454-2474: Refactored rerank to use sequence-based approach

Similar to the embedding changes, send_rerank now:

  1. Takes no batch parameter
  2. Uses sequence IDs to retrieve embeddings
  3. Has improved error handling

This is consistent with the new API pattern.


2853-2854: Updated batch clearing to use new API

Now using llama_batch_ext_clear() instead of the previous method. This is consistent with the migration to the new batch API.


2877-2878: Changed token addition method to use new API

Now using batch.add_text() instead of directly manipulating batch members. This provides a cleaner interface and better encapsulation.


3084-3085: Updated token addition in prompt processing

The code now uses batch.add_text() for adding prompt tokens to the batch, which is consistent with the new API and helps ensure proper memory management.


3111-3112: Updated output selection with new API

Now using llama_batch_ext_set_output_last() to mark the last token as the output token, which is cleaner than the previous approach.


3141-3147: Updated batch processing loop with view-based approach

The batch processing now:

  1. Uses llama_batch_ext_get_n_tokens() to get the token count
  2. Creates a view with llama_batch_ext_get_view() for batch processing
  3. Uses llama_decode_ext() instead of the previous function

This approach is more efficient as it processes chunks of the batch without copying data.


3281-3286: Updated speculative batch processing

The speculative batch processing now:

  1. Clears the batch with llama_batch_ext_clear()
  2. Adds tokens with batch_spec.add_text()
  3. Maintains proper position management

The code is more concise and readable than before.


4147-4150: Added important validation for pooling type

This new check aborts with an informative error message when the pooling type is NONE, which is not yet supported. This prevents silent failures and communicates limitations clearly.


3094-3095: Improved logging clarity with new token count method

The debug logging now uses llama_batch_ext_get_n_tokens() to report progress, which is more accurate than the previous approach.

examples/simple-chat/simple-chat.cpp (6)

2-2: Include of "llama-cpp.h" is appropriate.
No issues here.


112-114: Initialization and tracking of n_past.
The usage of n_past to store the starting position and then incrementing by llama_batch_ext_get_n_tokens(batch.get()) helps maintain alignment. Ensure that subsequent generate loops consistently track n_past, especially if multiple tokens might be appended in one loop iteration.


121-121: Consider a boundary check for context size.
You might want to verify whether using >= n_ctx instead of > n_ctx is more appropriate to avoid off-by-one overflows at the exact boundary.

Would you like to confirm that the code works as intended when n_ctx_used + llama_batch_ext_get_n_tokens(batch.get()) is exactly n_ctx?


127-127: Decode error handling seems sufficient.
Decoding failure is caught, and the program aborts as a fallback. The logic is concise.


151-153: Clearing the batch before adding the new token.
This pattern is consistent with single-token generation steps, then incrementing n_past. The approach looks good for iterative token decoding.


156-156: Explicitly freeing the extended batch.
Releasing resources here is appropriate, ensuring no memory leak.

include/llama-cpp.h (4)

8-8: Including <vector> is valid.
This is a minimal addition needed for sequence-handling methods.


28-30: Custom deleter for llama_batch_ext.
Wrapping llama_batch_ext_free in a functor is a neat approach to automatic resource management.


37-40: Introduction of llama_batch_ext_ptr.
Defining a specialized smart pointer for extended batches is a clean design. It clarifies ownership and cleanup responsibilities.


60-67: Creating a batch from embeddings.
Similarly, consider validating that llama_batch_ext_init_from_embd(...) returns a non-null pointer. If it fails (e.g., if memory is insufficient), the function currently does not handle the error.

Would you like to confirm that an internal check assures non-null returns from llama_batch_ext_init_from_embd?

examples/parallel/parallel.cpp (16)

9-9: New include of "llama-cpp.h".
No issues with adding this header for extended batch usage.


178-178: Initializing the extended batch with n_ctx.
This creates a single sequence slot of size n_ctx. Ensure that n_ctx >= 1 is enforced elsewhere to avoid out-of-bounds risk.


196-196: Adding system tokens to the batch.
This alignment sets each system token's position to i. Looks correct as a simpler approach to building the system prompt.


199-199: Decoding the system prompt in one step.
Error handling is consistent: logs an error on failure, then returns.


220-220: Clearing the batch.
Resetting the batch for the next wave of tokens is consistent with the new approach.


228-228: Capturing batch offset in client.i_batch.
This is used later to know how far into the batch the client’s tokens reside. Straightforward.


230-231: Adding sampled token with updated position and distinct sequence ID.
Adding one token at a time is valid. The true flag for output can ensure you only get logits for that token.


236-236: Checking for an empty batch.
If no tokens remain, the code resets the KV cache for all clients. The logic is consistent with reusing the system prompt.


248-248: Conditional insertion of new sequences.
Re-initializing requests if no tokens exist or if cont_batching is specified. Implementation is sound.


267-268: Generating prompt tokens for a new client.
seq_id is derived from client.id + 1; tokens are offset by n_tokens_system. Good logic for concurrency.


272-273: Marking the last token as output.
Ensures logits are extracted for the final token. Correct usage of llama_batch_ext_set_output_last.


279-279: Tracking the offset of the just-added token.
client.i_batch = <batch size> - 1 points to the newly inserted token's position. Straightforward indexing.


299-299: Retrieving total tokens in batch.
Storing it in n_tokens_in_batch is a standard approach for chunked decoding.


300-300: Looping over tokens in steps of n_batch.
This chunk-based iteration is consistent with partial decoding to avoid memory or time overhead.


308-308: Calculating effective tokens to process.
std::min(n_batch, n_tokens_in_batch - i) ensures we don't overshoot.


310-312: Creating a batch view for partial decoding and freeing it afterward.
Splitting the batch into sub-ranges, decoding, then freeing the temporary view is an appropriate pattern. The check on ret is also consistent.

examples/save-load-state/save-load-state.cpp (3)

134-139: Double-check sequence handling when re-clearing and re-decoding.
When you clear and re-use batch at lines 134–135, ensure that the changed sequence or context is correctly accounted for during subsequent decodes (line 137). If the context depends on progressive accumulation of tokens, clearing too prematurely might reset needed state tokens. Confirm that this is intended.


215-220: Evaluate positional consistency for multi-sequence decoding.
At lines 215–216, you switch the sequence ID from 0 to 1 when adding the token. Ensure that the new sequence ID is reflected appropriately in your n_past logic and subsequent decode calls to avoid mismatched positions between the two sequences.


232-232: Avoid double-free risk.
You invoke llama_batch_ext_free(batch.get()) at line 232. Verify that this pointer is not freed or reset in any other block (like lines 87, 139, or 220). Multiple frees of the same pointer would cause undefined behavior.

examples/speculative/speculative.cpp (9)

6-6: Confirm correct include order.
Placing "llama-cpp.h" after "llama.h" is fine, but ensure there are no forward dependency issues or macros needed from llama.h before llama-cpp.h.


169-174: Check for off-by-one errors in prompt split.
You split the prompt tokens into batch0 (n_input - 1) and batch1 (&inp.back()). If inp.size() is 1, n_input - 1 becomes 0, which might risk an empty initialization. Confirm there's always at least 2 tokens.


205-206: Favor descriptive naming for batch pointers.
You name them batch_dft and batch_tgt for the draft and target contexts, which is clear. Make sure that each is used only in the correct context to avoid confusion in multi-model usage.


447-448: Validate token positions when clearing and re-adding.
Clearing batch_dft at line 447 resets its existing tokens. Ensure that re-adding a token at line 448 with pos = n_past_dft does not conflict with prior tokens if re-adding them was intended.


485-485: Double-check index usage for the first token.
You push the first token from drafts[0].tokens[0] at line 485. Ensure drafts[0].tokens is not empty. If there's a chance the array could be empty, add a safety check to avoid out-of-bounds errors.


489-489: Clarify the reason for clearing batch_dft.
Before you loop over n_draft tokens, you clear batch_dft again. Ensure you don't lose essential context tokens. If the intent is a fresh start for each iteration, this usage is valid.


566-567: Validate target batch data indexing.
At line 566, you push back a new index into i_batch_tgt; at line 567, you push a new struct into batch_tgt_data. Confirm that only one new element is appended, and that the index is correct for the newly appended item.


584-584: Validate decoding result in the next loop iteration.
After calling llama_decode_ext at line 584, the next iteration of the loop increments n_past_cur. If decoding fails or partially processes tokens, confirm you handle it appropriately in subsequent logic.


600-600: Clarify rationale for clearing batch_tgt.
Similar to batch_dft, you clear batch_tgt at line 600. Double-check that any partial tokens aren’t lost before you loop through batch_tgt_data.

examples/passkey/passkey.cpp (3)

5-5: Include guard check.
Adding "llama-cpp.h" after "llama.h" is fine, but ensure "llama-cpp.h" does not require macros or declarations only found in llama.h. If so, reorder or update includes.


144-147: Confirm the loop variable updates for chunked token adds.
Here, you're clearing the batch (line 144) and then re-adding tokens (line 147). Ensure that n_past++ remains accurate across iterations, especially if partial tokens remain in the batch from prior steps.


227-227: Prevent negative index usage.
You pass llama_batch_ext_get_n_tokens(batch.get()) - 1 to llama_sampler_sample. If no tokens exist in batch, this becomes -1, which can cause issues. Ensure there's at least one token.

examples/llava/gemma3-cli.cpp (8)

8-8: New dependency added for extended batch API

This include is necessary for accessing the extended batch implementation and smart pointer definitions.


67-67: Migration to extended batch API with smart pointer

The type has been updated from llama_batch to llama_batch_ext_ptr, which is an appropriate change for the migration pattern. The use of a smart pointer will help with automatic memory management.


77-77: Proper initialization of extended batch smart pointer

The initialization now uses reset() with llama_batch_ext_init(), correctly setting up the extended batch with the specified batch size and 1 sequence.


93-96: Consistent batch clearing and token addition using extended API

The code has been properly migrated from using common_batch_clear and common_batch_add to their extended API equivalents. The sequence ID is consistently maintained as 0.


98-99: Output logits flag correctly set

The migration from direct logits access to using llama_batch_ext_set_output_last is appropriate for the extended batch API.


101-101: Updated decode function for extended batch API

The function call has been correctly updated from llama_decode to llama_decode_ext, maintaining the same error checking pattern.


150-151: Image batch initialization migrated to extended API

The image batch creation has been properly updated to use llama_batch_ext_init_from_embd. The position handling with ctx.n_past is maintained correctly, and the sequence ID is consistently set to 0.


181-183: Response generation updated to use extended batch API

The token generation logic has been correctly migrated to use the extended batch API, maintaining the same position handling with ctx.n_past++ and sequence ID of 0.

Comment on lines 78 to 86
// Wrapper to add a single token to the batch (single sequence ID)
int32_t add_text(llama_token token, llama_pos pos, llama_seq_id seq_id, bool output_last) {
int32_t output_id = llama_batch_ext_add_text(this->get(), token, pos, &seq_id, 1, false);
if (output_last) {
llama_batch_ext_set_output_last(this->get());
}
return output_id;
}
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Single sequence ID addition.
The same concern about ignoring negative return values applies.

for (int i = 0; i < (int) batch_tgt_data.size(); ++i) {
const auto & data = batch_tgt_data[i];

batch_tgt.add_text(data.id, data.pos, data.seq_id, true);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Check each add_text call for success.
Similar to prior calls, batch_tgt.add_text can fail. Consider capturing and verifying its return value.

drafts[s].i_batch_dft = batch_dft.n_tokens;

common_batch_add(batch_dft, id, n_past_cur, { s }, true);
drafts[s].i_batch_dft = batch_dft.add_text(id, n_past_cur, s, true);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Check return value from batch_dft.add_text.
batch_dft.add_text may return an error code (negative values). Ignoring this could result in silent failures if the token cannot be added.

drafts[s].i_batch_dft = batch_dft.add_text(id, n_past_cur, s, true);
+if (drafts[s].i_batch_dft < 0) {
+    LOG_ERR("%s : batch_dft.add_text failed for seq %d\n", __func__, s);
+    drafts[s].drafting = false;
+    break;
+}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
drafts[s].i_batch_dft = batch_dft.add_text(id, n_past_cur, s, true);
drafts[s].i_batch_dft = batch_dft.add_text(id, n_past_cur, s, true);
if (drafts[s].i_batch_dft < 0) {
LOG_ERR("%s : batch_dft.add_text failed for seq %d\n", __func__, s);
drafts[s].drafting = false;
break;
}

Comment on lines 241 to 245
llama_batch_ext_clear(batch.get());

// push this new token for next evaluation
common_batch_add(batch, new_token_id, n_past++, { 0 }, true);
llama_seq_id seq_id = 0;
llama_batch_ext_add_text(batch.get(), new_token_id, n_past++, &seq_id, 1, true);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Align seq_id usage with multi-sequence logic.
At lines 244–245, you create a single seq_id = 0 and add a token with output = true. In multi-sequence scenarios, you might need multiple seq_ids. Confirm this usage.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🔭 Outside diff range comments (1)
examples/perplexity/perplexity.cpp (1)

1734-1765: 🛠️ Refactor suggestion

Evaluate reusability of the batch object across multiple loops.
Re-initializing a new llama_batch_ext_ptr batch in each loop can be expensive. Review if you can reuse a single batch, calling batch.clear() each time, and only re-init if capacity needs to change.

♻️ Duplicate comments (6)
examples/run/run.cpp (1)

998-999: ⚠️ Potential issue

Handle the return value of check_context_size.

Currently, its return value is ignored, which could let decoding proceed even when the context limit is exceeded.

Apply this diff to check the context size properly:

-    check_context_size(llama_data.context, batch);
-    if (llama_decode_ext(llama_data.context.get(), batch.get())) {
+    if (check_context_size(llama_data.context, batch) != 0) {
+        return 1; // or handle the context overflow error
+    }
+    if (llama_decode_ext(llama_data.context.get(), batch.get())) {
examples/speculative/speculative.cpp (5)

169-174: ⚠️ Potential issue

Enforce minimum input size check in batch initialization.

There's a potential out-of-bounds error in these lines. The use of n_input - 1 and &inp.back() assumes that inp contains at least 2 elements.

If inp.size() < 2, these calls may lead to invalid pointer accesses. Add a guard to ensure that n_input >= 2 before these calls, or handle the edge case gracefully.


570-570: ⚠️ Potential issue

Check return value from batch_dft.add_text.

batch_dft.add_text may return an error code (negative values). Ignoring this could result in silent failures if the token cannot be added.

drafts[s].i_batch_dft = batch_dft.add_text(id, n_past_cur, s, true);
+if (drafts[s].i_batch_dft < 0) {
+    LOG_ERR("%s : batch_dft.add_text failed for seq %d\n", __func__, s);
+    drafts[s].drafting = false;
+    break;
+}

584-585: ⚠️ Potential issue

Check the return code from llama_decode_ext.

The return value of llama_decode_ext is ignored, which can lead to silent failures if decoding fails.

Add error handling for the decode operation:

-llama_decode_ext(ctx_dft, batch_dft.get());
+int ret_decode = llama_decode_ext(ctx_dft, batch_dft.get());
+if (ret_decode != 0) {
+    LOG_ERR("%s: llama_decode_ext failed with error code %d\n", __func__, ret_decode);
+    return 1;
+}

604-604: ⚠️ Potential issue

Check each add_text call for success.

Similar to prior calls, batch_tgt.add_text can fail. Consider capturing and verifying its return value.

-batch_tgt.add_text(data.id, data.pos, data.seq_id, true);
+int ret = batch_tgt.add_text(data.id, data.pos, data.seq_id, true);
+if (ret < 0) {
+    LOG_ERR("%s: batch_tgt.add_text failed with error code %d\n", __func__, ret);
+    // Handle error appropriately
+}

608-608: ⚠️ Potential issue

Check the return code from llama_decode_ext.

The target-side decoding result is unchecked, which could hide potential errors.

-llama_decode_ext(ctx_tgt, batch_tgt.get());
+int ret_decode = llama_decode_ext(ctx_tgt, batch_tgt.get());
+if (ret_decode != 0) {
+    LOG_ERR("%s: target llama_decode_ext failed with error code %d\n", __func__, ret_decode);
+    return 1;
+}
🧹 Nitpick comments (5)
examples/perplexity/perplexity.cpp (3)

366-379: Promote consistency checks for batch.add_text return values.
Although llama_decode_ext error handling is present, batch.add_text can also fail (e.g., exceeding capacity). Consider checking its return status for additional resilience.

You can wrap batch.add_text(...) in a small helper that logs or handles negative return codes, like:

+ int err = batch.add_text(tokens[batch_start + i], j*n_batch + i, 0, true);
+ if (err < 0) {
+   LOG_ERR("batch.add_text failed with code %d\n", err);
+   return ...;
+ }

652-660: Provide more context in decode error logging.
Currently, only the failure code ret is logged. Add relevant context, e.g., the function name or batch details, to make troubleshooting easier.

- LOG_ERR("failed to decode the batch, ret = %d\n", ret);
+ LOG_ERR("decode_helper: llama_decode_ext failed, ret = %d; n_outputs=%zu, n_vocab=%d\n",
+         ret, n_outputs, n_vocab);

833-900: Use consistent batch naming to avoid confusion.
You create a new batch for image-like data but still call it the same name as in text-based usage. For clarity, consider naming it batch_img or similar, especially if you’re repeatedly switching between text-based and other-based batch usage.

examples/save-load-state/save-load-state.cpp (2)

82-87: Safeguard batch usage after decode failures.
If llama_decode_ext fails, the code immediately frees the batch. Consider setting batch to nullptr or re-initializing it after this block to avoid potential misuse in subsequent steps.


232-232: Freeing the batch.
You manually free batch but keep using batch.get() earlier. Double-check you’re not calling it again. Consider RAII usage or destruct the pointer to avoid confusion.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2cec1cf and 3802ff2.

📒 Files selected for processing (16)
  • common/speculative.cpp (4 hunks)
  • examples/gritlm/gritlm.cpp (5 hunks)
  • examples/llava/gemma3-cli.cpp (6 hunks)
  • examples/lookup/lookup.cpp (5 hunks)
  • examples/parallel/parallel.cpp (7 hunks)
  • examples/passkey/passkey.cpp (6 hunks)
  • examples/perplexity/perplexity.cpp (19 hunks)
  • examples/run/run.cpp (4 hunks)
  • examples/save-load-state/save-load-state.cpp (5 hunks)
  • examples/server/server.cpp (21 hunks)
  • examples/simple-chat/simple-chat.cpp (3 hunks)
  • examples/simple/simple.cpp (3 hunks)
  • examples/speculative-simple/speculative-simple.cpp (5 hunks)
  • examples/speculative/speculative.cpp (8 hunks)
  • examples/tts/tts.cpp (5 hunks)
  • include/llama-cpp.h (2 hunks)
🔥 Files not summarized due to errors (1)
  • examples/tts/tts.cpp: Error: Disallowed special token found: <|im_start|>
🚧 Files skipped from review as they are similar to previous changes (4)
  • examples/simple/simple.cpp
  • examples/lookup/lookup.cpp
  • examples/passkey/passkey.cpp
  • include/llama-cpp.h
🧰 Additional context used
🧠 Learnings (1)
examples/speculative-simple/speculative-simple.cpp (1)
Learnt from: ngxson
PR: ngxson/llama.cpp#14
File: examples/speculative-simple/speculative-simple.cpp:172-172
Timestamp: 2025-03-21T10:48:55.882Z
Learning: For the speculative-simple example in llama.cpp, ngxson prefers deferring error handling of decode functions to maintain code simplicity in the initial implementation.
🧬 Code Definitions (11)
common/speculative.cpp (5)
include/llama-cpp.h (4) (4)
  • llama_batch_ext_ptr (38-38)
  • llama_batch_ext_ptr (39-39)
  • batch (29-29)
  • batch (29-29)
include/llama.h (2) (2)
  • llama_batch_ext_init (924-926)
  • llama_decode_ext (1013-1015)
src/llama-batch.cpp (2) (2)
  • llama_batch_ext_init (372-374)
  • llama_batch_ext_init (372-372)
src/llama-context.cpp (2) (2)
  • llama_decode_ext (2780-2784)
  • llama_decode_ext (2780-2782)
src/llama-batch.h (1) (1)
  • batch (99-99)
examples/simple-chat/simple-chat.cpp (4)
include/llama-cpp.h (2) (2)
  • batch (29-29)
  • batch (29-29)
src/llama-context.cpp (4) (4)
  • llama_kv_self_used_cells (2522-2524)
  • llama_kv_self_used_cells (2522-2522)
  • llama_decode_ext (2780-2784)
  • llama_decode_ext (2780-2782)
src/llama-context.h (1) (1)
  • ctx (157-161)
src/llama-batch.cpp (2) (2)
  • llama_batch_ext_free (482-498)
  • llama_batch_ext_free (482-482)
examples/gritlm/gritlm.cpp (5)
include/llama-cpp.h (6) (6)
  • batch (29-29)
  • batch (29-29)
  • token (70-76)
  • token (70-70)
  • token (79-85)
  • token (79-79)
src/llama-batch.h (1) (1)
  • batch (99-99)
include/llama.h (3) (3)
  • llama_batch_ext_init (924-926)
  • llama_n_batch (468-468)
  • llama_decode_ext (1013-1015)
src/llama-context.cpp (4) (4)
  • llama_n_batch (2344-2346)
  • llama_n_batch (2344-2344)
  • llama_decode_ext (2780-2784)
  • llama_decode_ext (2780-2782)
src/llama-context.h (3) (3)
  • ctx (157-161)
  • i (49-49)
  • i (52-52)
examples/parallel/parallel.cpp (7)
examples/perplexity/perplexity.cpp (1) (1)
  • batch (1468-1468)
include/llama-cpp.h (2) (2)
  • batch (29-29)
  • batch (29-29)
src/llama-batch.h (1) (1)
  • batch (99-99)
src/llama-context.cpp (6) (6)
  • n_ctx (412-414)
  • n_ctx (412-412)
  • llama_decode_ext (2780-2784)
  • llama_decode_ext (2780-2782)
  • n_batch (420-422)
  • n_batch (420-420)
src/llama-context.h (3) (3)
  • i (49-49)
  • i (52-52)
  • ctx (157-161)
src/llama-model.h (1) (1)
  • params (389-392)
src/llama-vocab.cpp (2) (2)
  • n_tokens (2751-2753)
  • n_tokens (2751-2751)
examples/speculative-simple/speculative-simple.cpp (5)
include/llama-cpp.h (2) (2)
  • batch (29-29)
  • batch (29-29)
include/llama.h (3) (3)
  • llama_decode_ext (1013-1015)
  • llama_batch_ext_init (924-926)
  • llama_n_batch (468-468)
src/llama-context.cpp (4) (4)
  • llama_decode_ext (2780-2784)
  • llama_decode_ext (2780-2782)
  • llama_n_batch (2344-2346)
  • llama_n_batch (2344-2344)
examples/speculative/speculative.cpp (1) (1)
  • batch_tgt (206-206)
src/llama-batch.cpp (2) (2)
  • llama_batch_ext_init (372-374)
  • llama_batch_ext_init (372-372)
examples/server/server.cpp (2)
include/llama.h (3) (3)
  • llama_batch_ext_init (924-926)
  • llama_batch_ext_set_output_last (977-977)
  • llama_batch_ext_get_view (981-984)
src/llama-batch.cpp (6) (6)
  • llama_batch_ext_init (372-374)
  • llama_batch_ext_init (372-372)
  • llama_batch_ext_set_output_last (448-455)
  • llama_batch_ext_set_output_last (448-448)
  • llama_batch_ext_get_view (461-480)
  • llama_batch_ext_get_view (461-464)
examples/perplexity/perplexity.cpp (1)
include/llama.h (4) (4)
  • llama_batch_ext_init (924-926)
  • llama_decode_ext (1013-1015)
  • llama_get_logits (1053-1053)
  • llama_batch_ext_set_output_last (977-977)
examples/save-load-state/save-load-state.cpp (5)
include/llama-cpp.h (4) (4)
  • batch (29-29)
  • batch (29-29)
  • tokens (45-58)
  • tokens (45-49)
src/llama-batch.h (1) (1)
  • batch (99-99)
include/llama.h (2) (2)
  • llama_decode_ext (1013-1015)
  • llama_batch_ext_free (991-991)
src/llama-context.cpp (2) (2)
  • llama_decode_ext (2780-2784)
  • llama_decode_ext (2780-2782)
src/llama-batch.cpp (2) (2)
  • llama_batch_ext_free (482-498)
  • llama_batch_ext_free (482-482)
examples/run/run.cpp (3)
include/llama-cpp.h (4) (4)
  • batch (29-29)
  • batch (29-29)
  • tokens (45-58)
  • tokens (45-49)
src/llama-context.cpp (2) (2)
  • llama_decode_ext (2780-2784)
  • llama_decode_ext (2780-2782)
include/llama.h (1) (1)
  • llama_decode_ext (1013-1015)
examples/speculative/speculative.cpp (4)
include/llama.h (4) (4)
  • llama_decode_ext (1013-1015)
  • llama_batch_ext_init (924-926)
  • llama_n_batch (468-468)
  • llama_kv_self_seq_rm (673-677)
src/llama-context.cpp (6) (6)
  • llama_decode_ext (2780-2784)
  • llama_decode_ext (2780-2782)
  • llama_n_batch (2344-2346)
  • llama_n_batch (2344-2344)
  • llama_kv_self_seq_rm (2544-2550)
  • llama_kv_self_seq_rm (2544-2548)
src/llama-batch.cpp (2) (2)
  • llama_batch_ext_init (372-374)
  • llama_batch_ext_init (372-372)
src/llama-context.h (2) (2)
  • i (49-49)
  • i (52-52)
examples/llava/gemma3-cli.cpp (3)
include/llama.h (4) (4)
  • llama_batch_ext_init (924-926)
  • llama_batch_ext_set_output_last (977-977)
  • llama_decode_ext (1013-1015)
  • llama_batch_ext_init_from_embd (934-939)
src/llama-batch.cpp (6) (6)
  • llama_batch_ext_init (372-374)
  • llama_batch_ext_init (372-372)
  • llama_batch_ext_set_output_last (448-455)
  • llama_batch_ext_set_output_last (448-448)
  • llama_batch_ext_init_from_embd (376-390)
  • llama_batch_ext_init_from_embd (376-381)
src/llama-context.cpp (2) (2)
  • llama_decode_ext (2780-2784)
  • llama_decode_ext (2780-2782)
⏰ Context from checks skipped due to timeout of 90000ms (20)
  • GitHub Check: ubuntu-latest-cmake-cuda
  • GitHub Check: android-build
  • GitHub Check: windows-2019-cmake-cuda (11.7, cuda)
  • GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
  • GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
  • GitHub Check: macOS-latest-cmake-tvos
  • GitHub Check: ubuntu-22-cmake-hip
  • GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
  • GitHub Check: ubuntu-latest-llguidance
  • GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
  • GitHub Check: ubuntu-latest-cmake-rpc
  • GitHub Check: macOS-latest-cmake-visionos
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: macOS-latest-cmake-ios
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: server (Release)
  • GitHub Check: server (UNDEFINED, RelWithDebInfo)
  • GitHub Check: server (ADDRESS, RelWithDebInfo)
  • GitHub Check: server-windows
🔇 Additional comments (77)
examples/gritlm/gritlm.cpp (7)

4-4: Appropriate header inclusion for the new batch API.

The addition of llama-cpp.h provides the necessary smart pointer wrapper for llama_batch_ext structures.


17-17: Good migration to batch extension API.

The change from raw llama_batch to smart pointer llama_batch_ext_ptr improves memory management. The initialization parameter count differs from the original API (removed the middle parameter), which is part of the intended API change.


20-20: Properly migrated batch operations to object methods.

The migration from C-style function calls (common_batch_clear, common_batch_add) to C++ object methods (batch.clear(), batch.add_text()) makes the code more consistent with modern C++ practices.

Also applies to: 45-45


54-54: Decoding function update without error handling.

Migration from llama_decode to llama_decode_ext is correct. Note that as discussed in previous comments, error handling will be implemented in a future update.

Also applies to: 124-124


108-108: Consistent batch initialization pattern.

The pattern for initializing the second batch with llama_batch_ext_ptr is consistent with the earlier instance, maintaining the same approach throughout the file.


119-119: Position tracking handled correctly in token addition.

The token position is properly incremented with i_current_token++ when adding tokens to the batch, which is crucial for maintaining correct context in the sequence.


126-126: Method-based access for batch properties.

Changed from direct member access via pointer (batch->n_tokens) to method call (batch.n_tokens()), which is appropriate for the object-oriented approach of the new API.

common/speculative.cpp (7)

17-17: Good migration to smart pointer batch implementation.

Changed from raw llama_batch to llama_batch_ext_ptr for improved memory management, consistent with the broader API migration.


26-26: Proper initialization of batch extension structure.

The initialization syntax using llama_batch_ext_ptr(llama_batch_ext_init(...)) correctly wraps the C-style allocation in a smart pointer for automatic cleanup.


65-73: Removed manual batch cleanup.

The manual call to llama_batch_free(spec->batch) has been removed since the smart pointer llama_batch_ext_ptr now handles cleanup automatically through its destructor.


207-207: Consistently migrated batch clearing operations.

All instances of common_batch_clear(batch) have been properly converted to the object method batch.clear().

Also applies to: 227-227, 240-240


211-211: Position tracking correctly maintained in token addition.

The positions used in batch.add_text() calls are correctly incremented throughout the code:

  • Line 211: Uses relative position i - i_start
  • Line 228: Uses n_past after processing the prompt
  • Line 267: Uses n_past + i + 1 in the draft generation loop, properly incrementing positions

This ensures tokens maintain their correct sequential positions, which is critical for proper context handling.

Also applies to: 228-228, 267-267


217-217: Method-based access for token count.

Changed from direct member access (batch.n_tokens) to method call (batch.n_tokens()), consistent with the object-oriented approach of the new API.


220-220: Consistent decoding function migration.

All instances of llama_decode have been properly updated to llama_decode_ext with the appropriate .get() accessor to obtain the raw pointer from the smart pointer wrapper.

Also applies to: 234-234, 270-270

examples/server/server.cpp (21)

1227-1227: Data type migration from llama_batch to smart pointer

The type for batch_spec member has been changed from llama_batch to llama_batch_ext_ptr, which is a smart pointer to the extended batch structure. This is part of the broader migration to the new batch API.


1799-1799: Data type migration for server context batch

Similar to the slot change, the batch member in server_context has been updated to use the new llama_batch_ext_ptr type, following the migration to the extended batch API.


1925-1926: Updated batch initialization using extended API

The initialization approach has changed from direct struct initialization to using the llama_batch_ext_init function that returns a managed pointer to the batch structure. This provides better memory management and aligns with the new batch API.


1961-1962: Proper batch initialization with appropriate sizing

Good implementation of batch initialization using the extended API. The code correctly passes the maximum of n_batch and params_base.n_parallel to ensure sufficient capacity, along with a value of 1 for n_seq_max since only a single sequence ID per token is needed here.


2096-2097: Consistent allocation pattern for speculative decoding

The speculative batch allocation consistently follows the same pattern as other batch allocations, using llama_batch_ext_init with appropriate sizing parameters.


2404-2414: Method signature update for send_embedding

The method signature has been updated to remove the batch parameter since the embedding data is now accessed directly from the batch associated with the slot.


2454-2469: Method signature and implementation update for send_rerank

Similar to send_embedding, this method has been updated to remove the batch parameter and use sequence IDs from the slot, which is consistent with the new batch API approach.


2853-2854: Using clear() method instead of deprecated common_batch_clear

The code now properly uses the batch object's clear method instead of the deprecated common_batch_clear function, consistent with the migration to the extended batch API.


2875-2878: Position handling in token processing

The implementation correctly maintains the token position within the batch and adds tokens with their appropriate positions and sequence IDs.


2881-2887: Proper position management for speculative batch

The code correctly initializes and adds tokens to the speculative batch with their appropriate positions, maintaining the context integrity needed for proper inference.


3080-3085: Improved token addition with position tracking

The token addition loop now correctly uses the slot's tracked position and sequence ID when adding tokens to the batch, ensuring proper position handling.


3094-3096: Updated logging message with batch metrics

The logging message has been updated to reference the batch token count using the new access pattern batch.n_tokens() instead of direct field access, consistent with the new API.


3111-3115: Using extended batch API for output handling

The code now uses llama_batch_ext_set_output_last to mark the last token in the batch as output, and correctly tracks the position within the batch, which is essential for proper inference.


3141-3147: Proper batch view management for processing

The implementation now correctly creates views into the batch using llama_batch_ext_get_view for processing chunks of tokens within the decode loop, allowing efficient handling of large batches.


2894-2894: Consistent check for batch processing condition

The condition now uses batch.n_tokens() to check the batch size, which is consistent with the extended batch API. This ensures continuation of batch processing when appropriate.


3126-3126: Batch size check using the new API

The code correctly uses batch.n_tokens() to check if there are tokens to decode, consistent with the new batch API. This ensures processing only happens when there are tokens available.


3131-3131: Updated logging with batch size

The debugging log now correctly uses batch.n_tokens() to report the number of tokens in the batch, which is consistent with the extended batch API.


3281-3286: Speculative batch preparation with proper positioning

The speculative batch is correctly prepared by clearing the previous batch and adding tokens with their appropriate positions and sequence IDs, which is essential for accurate speculative decoding.


3288-3288: Updated debugging log for speculative batch

The speculative batch debugging log correctly uses batch_spec.n_tokens() to report the size of the batch, consistent with the extended batch API.


3147-3147: Proper batch processing using extended API

The code now correctly uses llama_decode_ext with the batch view, ensuring efficient processing of tokens in manageable chunks.


3290-3290: Direct use of speculative batch with extended API

The speculative decoding now directly uses llama_decode_ext with the speculative batch, which is consistent with the migration to the extended batch API.

examples/simple-chat/simple-chat.cpp (6)

2-2: Include extended batch methods from llama-cpp.h.

This new include header is necessary to access llama_batch_ext functionality and related APIs.


112-114: Initialize and track n_past properly.

Storing the initial n_past = 0 and updating it after creating the batch is correct. This ensures the position is tracked consistently throughout the process.


121-121: Check the updated condition with batch.n_tokens().

Switching to batch.n_tokens() ensures accurate token count retrieval. This is an appropriate adjustment matching the new API.


127-127: Switch to extended decode function.

Replacing llama_decode with llama_decode_ext aligns with the new extended batch system. The existing error handling (abort on failure) remains consistent with the current style.


151-153: Clear and add the single sampled token.

Re-initializing the batch with clear() and adding only the new token ensures correct incremental token feeding. Incrementing n_past each time looks correct.


156-156: Free the extended batch.

Calling llama_batch_ext_free(batch.get()) finalizes resource cleanup properly. Ensure no subsequent usage of batch occurs after this point.

examples/parallel/parallel.cpp (13)

9-9: Include llama-cpp.h for extended batch API.

This header is needed for the new llama_batch_ext and llama_decode_ext functions.


178-178: Initialize extended batch with single sequence capacity.

Creating a batch with llama_batch_ext_init(n_ctx, 1) is valid for the system prompt. Verify that 1 is sufficient, given that you replicate the system prompt to multiple client sequences.


220-220: Clear the batch before adding new tokens.

Invoking batch.clear() is the correct approach to reset batch state between iterations.


228-228: Store the current batch token count.

Storing batch.n_tokens() into client.i_batch helps reference the correct position for subsequent decoding logic.


230-231: Add new token to the batch with the correct position.

Using n_tokens_system + client.n_prompt + client.n_decoded for the position and setting the sequence ID to client.id + 1 maintains proper offset and ownership of tokens.


236-236: Handle empty batches by resetting KV cache.

Checking if (batch.n_tokens() == 0) to clear the KV cache ensures stale data does not linger.


248-248: Allow new sequences when empty or in continuous batching mode.

This condition ensures new requests can be inserted seamlessly and prevents idle cycles.


267-268: Add initial prompt tokens for each client sequence.

Using seq_id = client.id + 1 and adding tokens with batch.add_text properly differentiates each client's sequence from the system prompt.


272-272: Set output extraction for the last token only if batch not empty.

This helps collect logits for the final token in the batch, preventing unnecessary overhead when the batch is empty.


279-279: Update client batch index.

Storing batch.n_tokens() - 1 aligns with the zero-based index for the token just added.


299-300: Retrieve batch size and iterate in chunks.

Retrieving n_tokens_in_batch = batch.n_tokens() and looping over it with i += n_batch correctly processes the tokens in sub-batches.


308-308: Compute the current chunk size safely.

\std::min ensures we only process up to the remaining tokens in the batch.


310-312: Create and free a batch view for partial decoding.

Obtaining a subrange via llama_batch_ext_get_view and then decoding ensures incremental processing. Freeing it immediately avoids memory leaks.

examples/speculative-simple/speculative-simple.cpp (6)

7-7: Include llama-cpp.h for extended decoding.

Necessary for llama_batch_ext_ptr and the extended decode functions.


117-118: Initialize batch from prompt input, then decode once.

Combining init_from_text and an immediate llama_decode_ext call is a straightforward approach for processing the initial tokens.


137-137: Create a secondary extended batch for final decoding.

Instantiating batch_tgt with llama_batch_ext_init(llama_n_batch(ctx_tgt), 1) is consistent with the approach in other files. Verify if 1 max sequence is enough for your usage.


156-157: Clear and add the last token for target decoding.

Resetting with clear() and then adding id_last at n_past++ ensures a consistent continuation point.


167-167: Append draft tokens incrementally.

Appending each draft token at (n_past + i) is aligned with the target position tracking.


172-172: Decode the combined target batch.

No error handling here aligns with your decision to keep the example code simpler for now.

examples/run/run.cpp (5)

643-643: Added position tracking with n_past.

Good addition of the n_past member variable to track token positions across multiple decode operations.


954-957: Updated function signature to use extended batch.

The function has been properly updated to accept the llama_batch_ext_ptr type and use its n_tokens() method instead of the previous n_tokens member.


995-995: Updated batch initialization with position tracking.

Good update to use the extended batch initialization with proper position tracking via llama_data.n_past.


1004-1005: Added position update tracking.

Good addition to increment n_past by the number of tokens in the batch, which ensures accurate position tracking for subsequent operations.


1020-1020: Updated next token batch initialization.

Correctly using the extended batch initialization with the updated n_past value to maintain proper token positions.

examples/speculative/speculative.cpp (6)

6-6: Added include for extended batch functionality.

Good addition of the llama-cpp.h header which provides the extended batch functionality.


205-206: Updated batch initialization to use extended API.

Good replacement of the old batch initialization with the extended batch initialization using proper context-specific batch sizes.


477-481: Created well-structured batch info.

Good addition of the batch_info structure to organize token ID, position, and sequence ID data, which makes the code more maintainable.


483-485: Initialized batch data with proper position tracking.

Good use of the new structure to properly track token positions with the first token's position set to n_past_tgt.


489-489: Updated batch clear operation.

Correctly using the member function clear() instead of the previous function call.


600-605: Properly cleared and prepared the target batch.

Good update to use the new batch API for clearing and adding text with full position and sequence ID support.

examples/perplexity/perplexity.cpp (1)

505-579: Clarify position logic and ensure correctness in output toggling.
The code conditionally sets output = pos >= first, which increments n_outputs in line 572 if true. Confirm that first is correctly updated prior to each iteration—any off-by-one or stale value could lead to missing or extra outputs in logits.

examples/save-load-state/save-load-state.cpp (3)

51-55: Confirm that n_past aligns with new tokens.
After calling batch.n_tokens(), you increment n_past by that value. Ensure that all tokens in batch are evaluated, especially if you set output_last or skip any positions.


134-139: Check matching contexts for ctx2 vs. original.
Here, you decode in ctx2 but still reuse the same batch. Make sure the batch contents (like positions) truly match the new context’s state.


215-221: Validate sequence ID usage.
You add tokens with seq_id=1. Confirm that ctx3 expects a second sequence in valid range (E.g., if the model or context has 2 parallel sequences).

examples/llava/gemma3-cli.cpp (2)

8-8: Include guards synergy and forward declarations.
Including "llama-cpp.h" is correct for usage, but monitor if multiple llama-related headers cause redefinition or macro conflicts. Also, ensure the new llama_batch_ext_ptr batch is forward-declared if possible to reduce compile times.

Also applies to: 67-67, 77-77


93-101: Re-check empty prompts for eval_text.
You always clear the batch, then add tokens. If input is empty, confirm no tokens are queued. This might skip setting output_last if tokens is empty.

Comment on lines +2413 to 2448
const llama_seq_id seq_id = slot.id;

for (int i = 0; i < batch.n_tokens; ++i) {
if (!batch.logits[i] || batch.seq_id[i][0] != slot.id) {
continue;
}
std::vector<float> embd_res(n_embd, 0.0f);

const float * embd = llama_get_embeddings_seq(ctx, batch.seq_id[i][0]);
if (embd == NULL) {
embd = llama_get_embeddings_ith(ctx, i);
}
if (llama_pooling_type(slot.ctx) != LLAMA_POOLING_TYPE_NONE) {
const float * embd = llama_get_embeddings_seq(ctx, seq_id);

if (embd == NULL) {
SLT_ERR(slot, "failed to get embeddings, token = %d, seq_id = %d\n", batch.token[i], batch.seq_id[i][0]);
SLT_ERR(slot, "failed to get sequence embeddings, seq_id = %d\n", seq_id);

res->embedding.push_back(std::vector<float>(n_embd, 0.0f));
continue;
}

// normalize only when there is pooling
// TODO: configurable
if (llama_pooling_type(slot.ctx) != LLAMA_POOLING_TYPE_NONE) {
common_embd_normalize(embd, embd_res.data(), n_embd, 2);
res->embedding.push_back(embd_res);
} else {
res->embedding.push_back({ embd, embd + n_embd });
}
common_embd_normalize(embd, embd_res.data(), n_embd, 2);
res->embedding.push_back(embd_res);
} else {
GGML_ABORT("embeddings without pooling is not supported yet");
//for (int i = 0; i < batch.n_tokens(); ++i) {
// auto tok = batch.tokens[i];
// if (!tok.logits || tok.seq_id != slot.id) {
// continue;
// }

// const float * embd = llama_get_embeddings_ith(ctx, tok.seq_id);
// if (embd == NULL) {
// SLT_ERR(slot, "failed to get embeddings, token = %d, seq_id = %d\n", tok.token, tok.seq_id);

// res->embedding.push_back(std::vector<float>(n_embd, 0.0f));
// continue;
// }

// res->embedding.push_back({ embd, embd + n_embd });
//}
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Embeddings implementation with sequence ID

The implementation now correctly uses sequence IDs from the slot instead of iterating through batch tokens. Note that the commented-out code (currently not used) contains a limitation warning that embeddings without pooling are not yet supported.

Consider implementing support for embeddings without pooling to make the feature complete. The commented code shows the intended approach but is not yet functional.

Comment on lines 150 to 151
llama_batch_ext_ptr batch_img(llama_batch_ext_init_from_embd(image_embd_v.data(), n_tokens, n_embd, ctx.n_past, 0));
if (llama_decode_ext(ctx.lctx, batch_img.get())) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Ensure embedding dimension consistency.
llama_batch_ext_init_from_embd(image_embd_v.data(), n_tokens, n_embd, ...)—validate that n_embd matches the model’s dimension. A mismatch can lead to undefined behavior.

Comment on lines +181 to +183
ctx.batch.clear();
ctx.batch.add_text(token_id, ctx.n_past++, 0, true);
if (llama_decode_ext(ctx.lctx, ctx.batch.get())) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Check concurrency for shared ctx.batch.
If generate_response is called concurrently with other code, ensure ctx.batch usage is thread-safe. Currently, multiple updates to the same batch from different threads might cause data races.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
include/llama-cpp.h (3)

86-93: 🛠️ Refactor suggestion

Check the return value from token addition.

Similar to the init_from_text method, this function doesn't handle potential errors returned by llama_batch_ext_add_text.

-        int32_t output_id = llama_batch_ext_add_text(this->get(), token, pos, seq_id.data(), seq_id.size(), false);
-        if (output_last) {
+        int32_t output_id = llama_batch_ext_add_text(this->get(), token, pos, seq_id.data(), seq_id.size(), false);
+        if (output_id < 0) {
+            return output_id;
+        }
+        if (output_last) {

95-102: 🛠️ Refactor suggestion

Add error handling for single sequence ID token addition.

The same error handling issue exists in this overload of the add_text method.

-        int32_t output_id = llama_batch_ext_add_text(this->get(), token, pos, &seq_id, 1, false);
-        if (output_last) {
+        int32_t output_id = llama_batch_ext_add_text(this->get(), token, pos, &seq_id, 1, false);
+        if (output_id < 0) {
+            return output_id;
+        }
+        if (output_last) {

52-55: 💡 Verification agent

❓ Verification inconclusive

Verify error handling for llama_batch_ext_add_text.

The function doesn't check the return value of llama_batch_ext_add_text, which can return negative values to indicate errors. If any token addition fails, this will silently continue with a corrupted batch.

Add error checking to handle potential failures:

-    llama_batch_ext * batch = llama_batch_ext_init(ctx);
-    for (int32_t i = 0; i < n_tokens; i++) {
-        llama_batch_ext_add_text(batch, tokens[i], pos0 + i, &seq_id, 1, false);
-    }
+    llama_batch_ext * batch = llama_batch_ext_init(ctx);
+    for (int32_t i = 0; i < n_tokens; i++) {
+        int32_t rc = llama_batch_ext_add_text(batch, tokens[i], pos0 + i, &seq_id, 1, false);
+        if (rc < 0) {
+            llama_batch_ext_free(batch);
+            return llama_batch_ext_ptr(nullptr);
+        }
+    }

ACTION: Implement Robust Error Handling for Token Addition

The review comment’s concern is valid. Analysis indicates that the function llama_batch_ext_add_text can indeed return negative values to signal an error, yet its return value is not being checked in the current loop. Failing to check for errors may result in silently propagating a corrupted batch to later processing stages.

Suggested changes:

  • Add Error Checking: Capture the return code from llama_batch_ext_add_text and check if it indicates an error (i.e. a negative value).
  • Clean Up on Failure: If an error is detected, free the batch using llama_batch_ext_free(batch) to prevent memory leaks.
  • Propagate Error: Return an appropriate error (here, using llama_batch_ext_ptr(nullptr)) to signal the failure upstream.

The refactored code snippet in include/llama-cpp.h (lines 52-55) should be modified as follows:

-    llama_batch_ext * batch = llama_batch_ext_init(ctx);
-    for (int32_t i = 0; i < n_tokens; i++) {
-        llama_batch_ext_add_text(batch, tokens[i], pos0 + i, &seq_id, 1, false);
-    }
+    llama_batch_ext * batch = llama_batch_ext_init(ctx);
+    for (int32_t i = 0; i < n_tokens; i++) {
+        int32_t rc = llama_batch_ext_add_text(batch, tokens[i], pos0 + i, &seq_id, 1, false);
+        if (rc < 0) {
+            llama_batch_ext_free(batch);
+            return llama_batch_ext_ptr(nullptr);
+        }
+    }

Please ensure that the chosen error propagation method (llama_batch_ext_ptr(nullptr)) aligns with overall error handling patterns elsewhere in the codebase.

examples/speculative/speculative.cpp (1)

169-174: 🛠️ Refactor suggestion

Check for Minimum Input Size & Return Codes
Using n_input - 1 and borrowing inp.back() may lead to out-of-bounds errors if inp.size() is less than 2. Additionally, the return codes from llama_decode_ext() are not checked, which can silence decoding failures. Consider adding range checks and capturing the return code.

 if (inp.size() < 2) {
+    LOG_ERR("Not enough tokens (need at least 2).");
+    return 1;
 }

 int ret0 = llama_decode_ext(ctx_tgt, batch0.get());
-if (ret0 != 0) { /* handle error */ }
src/llama-batch.cpp (1)

456-475: ⚠️ Potential issue

Add boundary checks for offset and n_tokens.
If offset + n_tokens goes beyond the original batch’s capacity, pointers to out-of-range elements will be returned, risking undefined behavior. A simple check can prevent this.

Apply this diff to enforce a boundary check:

 int32_t total_tokens = batch->n_tokens;
 if (offset < 0 || n_tokens < 0 || offset + n_tokens > total_tokens) {
     return nullptr; // or handle error
 }
🧹 Nitpick comments (27)
src/llama-model.cpp (2)

12060-12062: Ternary Operator for Token Position Calculation

The function returns 4 positions per token if the architecture is LLM_ARCH_QWEN2VL, otherwise it defaults to 1. This inline conditional is concise and works for the current two-case scenario; however, if additional architectures begin to require differentiated handling in the future, a more extensible construct (such as a switch-case) might improve clarity and maintainability.


12064-12066: Delegation with Pointer Parameter

This helper function delegates to the previously defined version by extracting model->arch. While this approach is straightforward, consider whether a null-check (or an assertion) should be added for model to prevent potential NULL pointer dereferences if there’s any chance the function might be invoked with a NULL model. Documenting the precondition would also be helpful.

src/llama-model.h (1)

405-405: Function looks good but needs a docstring

The new function llama_n_pos_per_token declaration correctly accepts an llm_arch parameter and returns a uint32_t, matching the implementation seen in the relevant code snippets. Consider adding a docstring to explain its purpose and return value.

+/**
+ * Returns the number of positions per token for a given architecture.
+ * 
+ * @param arch The model architecture
+ * @return 4 for LLM_ARCH_QWEN2VL, 1 for all other architectures
+ */
 uint32_t llama_n_pos_per_token(llm_arch arch);
examples/llama.swiftui/llama.cpp.swift/LibLlama.swift (1)

199-199: Remove trailing semicolon
Swift code typically avoids trailing semicolons.

- let t_pp_start = DispatchTime.now().uptimeNanoseconds / 1000;
+ let t_pp_start = DispatchTime.now().uptimeNanoseconds / 1000
🧰 Tools
🪛 SwiftLint (0.57.0)

[Warning] 199-199: Lines should not have trailing semicolons

(trailing_semicolon)

examples/lookahead/lookahead.cpp (1)

263-266: Partial error handling for decode calls.
Here, you check the result of llama_decode_ext. If you decide to unify error checks across all decode invocations (e.g., lines 97-98), I can help with that refactor later.

examples/batched/batched.cpp (1)

158-158: Stream index initialization

Storing the last token's batch index is a neat approach. Consider verifying that the batch isn’t empty (to avoid -1 if the prompt is missing).

examples/speculative/speculative.cpp (5)

447-449: Return Code from llama_decode_ext Not Checked
The call to llama_decode_ext(ctx_dft, batch_dft.get()) lacks error checking. Ignoring an error code may lead to silent failures.

 llama_decode_ext(ctx_dft, batch_dft.get());
+// Optionally capture and analyze return code
+int decode_ret = llama_decode_ext(ctx_dft, batch_dft.get());
+if (decode_ret != 0) {
+    LOG_ERR("llama_decode_ext failed with code %d\n", decode_ret);
+    // handle or abort
+}

570-572: Conditional Drafting Logic
After adding the token to batch_dft, you conditionally disable drafting if the combined data size exceeds n_draft. This is logical for controlling branching. Ensure all sequences are properly cleaned up if truncated.


584-584: Decode Return Code Not Captured
Again, llama_decode_ext(ctx_dft, batch_dft.get()) return value is ignored. Consistently handle potential errors.


600-605: Batch Population Loop
Clearing batch_tgt and then populating it from batch_tgt_data is coherent. However, llama_batch_ext_add_text(...) may return an error. Consider capturing and logging potential failures.


608-608: Decode Return Code Not Captured
As with previous calls, please consider checking the return value from llama_decode_ext(ctx_tgt, batch_tgt.get()).

examples/llama.android/llama/src/main/cpp/llama-android.cpp (4)

194-195: Unverified Return
llama_batch_ext_add_text(batch, 0, i, &seq_id, 1, false) may fail if the batch exceeds capacity. Consider checking and logging its return code.


214-218: Prompt for Generation
Clearing the batch and adding tokens for text generation is correct. However, each llama_batch_ext_add_text(...) call can return an error code. Check to avoid silent failures.


356-357: Check for llama_batch_ext_add_text Failures
Though unlikely, it’s safer to check the return code in case of size or parameter issues.


413-415: Adding Next Token
Clearing and then adding the newly sampled token is correct. Again, keep a note on error checking if batch is full.

examples/save-load-state/save-load-state.cpp (1)

54-55: Decode Return Code & Token Count
Line 54 ignores the return of llama_decode_ext. Consider checking it. The token count update on line 55 is correct.

src/llama-batch.cpp (4)

193-275: Validate n_seq_id boundaries and usage assumptions.
This method sorts and groups tokens by sequence IDs. If batch.n_seq_id or batch.seq_id have unexpected values (e.g., negative or null), it could break the logic. Consider adding sanity checks or clarifying that upstream code ensures valid ranges.


341-371: Consider adding range checks for allocated resources.
The function allocates memory for arrays based on n_tokens_alloc. If a caller passes very large or negative values, it could lead to allocation issues. However, it appears consistent with your codebase rules on ignoring null checks.


399-422: Check if n_seq_ids exceeds allocation.
The loop writes up to n_seq_ids entries into batch->seq_id[...][j]. If the caller passes a larger n_seq_ids than allocated, it could cause out-of-bounds writes. Consider bounding this or documenting the requirement.


424-441: Linear search approach is acceptable; watch performance on large batches.
This function performs a nested loop to set output flags, which may be slow if batch->n_tokens is large. It might be worth optimizing or clarifying performance trade-offs.

examples/batched-bench/batched-bench.cpp (2)

64-82: Helper function uses batch views correctly; watch for offset range.
llama_batch_ext_get_view can fail if offsets are out of range. Currently, the loop bounds keep it safe. If usage expands, validate offset usage.


88-88: Return code from llama_batch_ext_add_text is ignored.
If the batch is full or has embeddings, the function can fail and return negative codes. Consider checking the return value.

examples/llava/gemma3-cli.cpp (1)

95-95: Return value of add_text is not checked.
The function can fail if the batch is at capacity. Handling that return code may prevent silent errors.

include/llama.h (4)

265-266: Duplicate documentation for a deprecated struct.
This doc block may cause confusion since the struct is already marked as deprecated. Consider consolidating or removing it to maintain clarity.


958-967: Consider setting clear boundaries and fallback conditions.
The function’s documentation does not specify the behavior for out-of-range pos or invalid seq_id. A guard or fallback could improve robustness.


974-980: Minor wording improvement.
Correct “Return returned batch” to “The returned batch” for clarity in the documentation.


1006-1010: Document return codes for llama_decode_ext.
Since the existing decode function describes possible return values, replicate that detail here to ensure consistency and clarity for users.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a9efdbb and d18a79e.

📒 Files selected for processing (42)
  • common/common.cpp (2 hunks)
  • common/speculative.cpp (4 hunks)
  • examples/batched-bench/batched-bench.cpp (5 hunks)
  • examples/batched/batched.cpp (6 hunks)
  • examples/cvector-generator/cvector-generator.cpp (1 hunks)
  • examples/embedding/embedding.cpp (5 hunks)
  • examples/eval-callback/eval-callback.cpp (1 hunks)
  • examples/gritlm/gritlm.cpp (5 hunks)
  • examples/imatrix/imatrix.cpp (3 hunks)
  • examples/infill/infill.cpp (1 hunks)
  • examples/llama-bench/llama-bench.cpp (5 hunks)
  • examples/llama.android/llama/src/main/cpp/llama-android.cpp (9 hunks)
  • examples/llama.android/llama/src/main/java/android/llama/cpp/LLamaAndroid.kt (2 hunks)
  • examples/llama.swiftui/llama.cpp.swift/LibLlama.swift (8 hunks)
  • examples/llava/gemma3-cli.cpp (6 hunks)
  • examples/llava/llava-cli.cpp (1 hunks)
  • examples/llava/llava.cpp (2 hunks)
  • examples/llava/minicpmv-cli.cpp (1 hunks)
  • examples/llava/qwen2vl-cli.cpp (2 hunks)
  • examples/lookahead/lookahead.cpp (6 hunks)
  • examples/lookup/lookup.cpp (5 hunks)
  • examples/main/main.cpp (2 hunks)
  • examples/parallel/parallel.cpp (7 hunks)
  • examples/passkey/passkey.cpp (6 hunks)
  • examples/perplexity/perplexity.cpp (19 hunks)
  • examples/retrieval/retrieval.cpp (7 hunks)
  • examples/run/run.cpp (4 hunks)
  • examples/save-load-state/save-load-state.cpp (5 hunks)
  • examples/server/server.cpp (21 hunks)
  • examples/simple-chat/simple-chat.cpp (3 hunks)
  • examples/simple/simple.cpp (3 hunks)
  • examples/speculative-simple/speculative-simple.cpp (5 hunks)
  • examples/speculative/speculative.cpp (8 hunks)
  • examples/tts/tts.cpp (5 hunks)
  • include/llama-cpp.h (2 hunks)
  • include/llama.h (6 hunks)
  • src/llama-batch.cpp (5 hunks)
  • src/llama-context.cpp (4 hunks)
  • src/llama-graph.cpp (3 hunks)
  • src/llama-graph.h (2 hunks)
  • src/llama-model.cpp (1 hunks)
  • src/llama-model.h (1 hunks)
🔥 Files not summarized due to errors (5)
  • examples/llava/llava-cli.cpp: Error: Disallowed special token found: <|im_end|>
  • examples/llava/minicpmv-cli.cpp: Error: Disallowed special token found: <|im_start|>
  • examples/llava/qwen2vl-cli.cpp: Error: Disallowed special token found: <|im_start|>
  • examples/tts/tts.cpp: Error: Disallowed special token found: <|im_start|>
  • src/llama-model.cpp: Error: Server error: no LLM provider could handle the message
🚧 Files skipped from review as they are similar to previous changes (13)
  • examples/eval-callback/eval-callback.cpp
  • examples/cvector-generator/cvector-generator.cpp
  • examples/llava/llava.cpp
  • examples/infill/infill.cpp
  • examples/gritlm/gritlm.cpp
  • src/llama-graph.h
  • examples/main/main.cpp
  • examples/parallel/parallel.cpp
  • examples/speculative-simple/speculative-simple.cpp
  • examples/passkey/passkey.cpp
  • src/llama-context.cpp
  • examples/lookup/lookup.cpp
  • examples/simple/simple.cpp
🧰 Additional context used
🧠 Learnings (3)
examples/imatrix/imatrix.cpp (1)
Learnt from: ngxson
PR: ngxson/llama.cpp#14
File: examples/imatrix/imatrix.cpp:500-500
Timestamp: 2025-03-21T10:48:55.882Z
Learning: The function llama_batch_ext_init never returns nullptr in the llama.cpp codebase, so null checks after calling this function are unnecessary.
examples/lookahead/lookahead.cpp (1)
Learnt from: ngxson
PR: ngxson/llama.cpp#14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-21T10:48:55.882Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
src/llama-batch.cpp (1)
Learnt from: ngxson
PR: ngxson/llama.cpp#14
File: src/llama-batch.cpp:356-386
Timestamp: 2025-03-21T10:48:55.882Z
Learning: Null checks after malloc calls are not required in the llama.cpp codebase, as the project follows a design pattern where memory allocation failures are handled differently or allowed to propagate.
🧬 Code Definitions (23)
examples/llava/qwen2vl-cli.cpp (3)
include/llama-cpp.h (7)
  • pos (79-79)
  • batch (29-29)
  • batch (29-29)
  • token (87-93)
  • token (87-87)
  • token (96-102)
  • token (96-96)
include/llama.h (2)
  • llama_decode_ext (1008-1010)
  • llama_batch_ext_set_output_last (972-972)
src/llama-batch.cpp (2)
  • llama_batch_ext_set_output_last (443-450)
  • llama_batch_ext_set_output_last (443-443)
examples/tts/tts.cpp (3)
include/llama-cpp.h (2)
  • batch (29-29)
  • batch (29-29)
include/llama.h (2)
  • llama_batch_ext_set_output_last (972-972)
  • llama_decode_ext (1008-1010)
src/llama-batch.cpp (2)
  • llama_batch_ext_set_output_last (443-450)
  • llama_batch_ext_set_output_last (443-443)
src/llama-model.cpp (2)
include/llama.h (1)
  • llama_n_pos_per_token (486-486)
src/llama-model.h (1)
  • llama_n_pos_per_token (405-405)
src/llama-graph.cpp (2)
src/llama-model.cpp (4)
  • llama_n_pos_per_token (12060-12062)
  • llama_n_pos_per_token (12060-12060)
  • llama_n_pos_per_token (12064-12066)
  • llama_n_pos_per_token (12064-12064)
src/llama-model.h (1)
  • llama_n_pos_per_token (405-405)
common/speculative.cpp (7)
include/llama-cpp.h (11)
  • llama_batch_ext_ptr (38-38)
  • llama_batch_ext_ptr (39-39)
  • llama_batch_ext_ptr (40-40)
  • batch (29-29)
  • batch (29-29)
  • ctx (46-60)
  • ctx (46-51)
  • ctx (63-70)
  • ctx (63-68)
  • ctx (73-84)
  • ctx (73-78)
examples/server/server.cpp (10)
  • ctx_dft (1349-1351)
  • ctx (218-469)
  • ctx (218-221)
  • ctx (2044-2054)
  • ctx (2044-2044)
  • params (1840-1914)
  • params (1840-1840)
  • id (1433-1453)
  • id (1972-1980)
  • id (1972-1972)
src/llama-batch.h (1)
  • batch (99-99)
common/sampling.cpp (6)
  • i (64-69)
  • i (64-64)
  • ctx (114-129)
  • ctx (114-114)
  • common_sampler_reset (310-314)
  • common_sampler_reset (310-310)
include/llama.h (1)
  • llama_decode_ext (1008-1010)
src/llama-context.cpp (2)
  • llama_decode_ext (2797-2801)
  • llama_decode_ext (2797-2799)
src/llama-kv-cache.h (2)
  • id (36-38)
  • id (36-36)
examples/batched/batched.cpp (1)
include/llama.h (6)
  • llama_batch_ext_init (926-926)
  • llama_batch_ext_add_text (950-956)
  • llama_batch_ext_get_n_tokens (943-943)
  • llama_batch_ext_clear (982-982)
  • llama_batch_ext_set_output_last (972-972)
  • llama_batch_ext_free (986-986)
examples/simple-chat/simple-chat.cpp (4)
include/llama-cpp.h (8)
  • batch (29-29)
  • batch (29-29)
  • ctx (46-60)
  • ctx (46-51)
  • ctx (63-70)
  • ctx (63-68)
  • ctx (73-84)
  • ctx (73-78)
src/llama-batch.h (1)
  • batch (99-99)
src/llama-context.cpp (8)
  • n_ctx (429-431)
  • n_ctx (429-429)
  • llama_n_ctx (2357-2359)
  • llama_n_ctx (2357-2357)
  • llama_kv_self_used_cells (2539-2541)
  • llama_kv_self_used_cells (2539-2539)
  • llama_decode_ext (2797-2801)
  • llama_decode_ext (2797-2799)
src/llama-batch.cpp (2)
  • llama_batch_ext_free (477-493)
  • llama_batch_ext_free (477-477)
examples/llama.android/llama/src/main/cpp/llama-android.cpp (3)
include/llama.h (6)
  • llama_init_from_model (447-449)
  • llama_batch_ext_clear (982-982)
  • llama_batch_ext_add_text (950-956)
  • llama_batch_ext_set_output_last (972-972)
  • llama_batch_ext_init (926-926)
  • llama_batch_ext_free (986-986)
src/llama-context.cpp (2)
  • llama_init_from_model (2303-2344)
  • llama_init_from_model (2303-2305)
src/llama-batch.cpp (10)
  • llama_batch_ext_clear (452-454)
  • llama_batch_ext_clear (452-452)
  • llama_batch_ext_add_text (399-422)
  • llama_batch_ext_add_text (399-405)
  • llama_batch_ext_set_output_last (443-450)
  • llama_batch_ext_set_output_last (443-443)
  • llama_batch_ext_init (373-375)
  • llama_batch_ext_init (373-373)
  • llama_batch_ext_free (477-493)
  • llama_batch_ext_free (477-477)
examples/save-load-state/save-load-state.cpp (5)
include/llama-cpp.h (8)
  • batch (29-29)
  • batch (29-29)
  • ctx (46-60)
  • ctx (46-51)
  • ctx (63-70)
  • ctx (63-68)
  • ctx (73-84)
  • ctx (73-78)
src/llama-batch.h (1)
  • batch (99-99)
include/llama.h (2)
  • llama_decode_ext (1008-1010)
  • llama_batch_ext_free (986-986)
src/llama-context.cpp (2)
  • llama_decode_ext (2797-2801)
  • llama_decode_ext (2797-2799)
src/llama-batch.cpp (2)
  • llama_batch_ext_free (477-493)
  • llama_batch_ext_free (477-477)
examples/retrieval/retrieval.cpp (1)
src/llama-batch.h (1)
  • batch (99-99)
include/llama-cpp.h (1)
include/llama.h (6)
  • llama_batch_ext_init (926-926)
  • llama_batch_ext_add_text (950-956)
  • llama_batch_ext_set_output_last (972-972)
  • llama_batch_ext_init_from_embd (934-940)
  • llama_batch_ext_clear (982-982)
  • llama_batch_ext_get_n_tokens (943-943)
examples/imatrix/imatrix.cpp (1)
include/llama.h (5)
  • llama_batch_ext_init (926-926)
  • llama_batch_ext_clear (982-982)
  • llama_batch_ext_add_text (950-956)
  • llama_decode_ext (1008-1010)
  • llama_batch_ext_free (986-986)
examples/run/run.cpp (4)
include/llama-cpp.h (8)
  • ctx (46-60)
  • ctx (46-51)
  • ctx (63-70)
  • ctx (63-68)
  • ctx (73-84)
  • ctx (73-78)
  • batch (29-29)
  • batch (29-29)
src/llama-batch.h (1)
  • batch (99-99)
src/llama-context.cpp (4)
  • n_ctx (429-431)
  • n_ctx (429-429)
  • llama_decode_ext (2797-2801)
  • llama_decode_ext (2797-2799)
include/llama.h (1)
  • llama_decode_ext (1008-1010)
examples/perplexity/perplexity.cpp (5)
src/llama-context.cpp (4)
  • llama_decode_ext (2797-2801)
  • llama_decode_ext (2797-2799)
  • llama_get_logits (2432-2436)
  • llama_get_logits (2432-2432)
src/llama-batch.h (1)
  • batch (99-99)
src/llama-context.h (4)
  • i (49-49)
  • i (52-52)
  • ctx (157-161)
  • data (78-83)
include/llama.h (3)
  • llama_decode_ext (1008-1010)
  • llama_get_logits (1048-1048)
  • llama_batch_ext_set_output_last (972-972)
src/llama-batch.cpp (2)
  • llama_batch_ext_set_output_last (443-450)
  • llama_batch_ext_set_output_last (443-443)
examples/embedding/embedding.cpp (3)
include/llama.h (9)
  • llama_batch_ext_add_text (950-956)
  • llama_batch_ext_get_n_tokens (943-943)
  • llama_model_has_encoder (533-533)
  • llama_model_has_decoder (536-536)
  • llama_encode_ext (996-998)
  • llama_decode_ext (1008-1010)
  • llama_batch_ext_init (926-926)
  • llama_batch_ext_clear (982-982)
  • llama_batch_ext_free (986-986)
src/llama-context.cpp (4)
  • llama_encode_ext (2791-2795)
  • llama_encode_ext (2791-2793)
  • llama_decode_ext (2797-2801)
  • llama_decode_ext (2797-2799)
src/llama-model.cpp (4)
  • llama_model_has_encoder (12134-12140)
  • llama_model_has_encoder (12134-12134)
  • llama_model_has_decoder (12142-12147)
  • llama_model_has_decoder (12142-12142)
examples/speculative/speculative.cpp (4)
include/llama.h (1)
  • llama_decode_ext (1008-1010)
src/llama-context.cpp (2)
  • llama_decode_ext (2797-2801)
  • llama_decode_ext (2797-2799)
src/llama-context.h (2)
  • i (49-49)
  • i (52-52)
src/llama-kv-cache.h (2)
  • id (36-38)
  • id (36-36)
examples/lookahead/lookahead.cpp (1)
include/llama.h (5)
  • llama_batch_ext_init (926-926)
  • llama_batch_ext_clear (982-982)
  • llama_batch_ext_add_text (950-956)
  • llama_batch_ext_get_n_tokens (943-943)
  • llama_batch_ext_free (986-986)
src/llama-batch.cpp (6)
src/llama-batch.h (3)
  • batch (99-99)
  • llama_batch_allocr (113-113)
  • llama_batch_allocr (115-115)
include/llama-cpp.h (9)
  • batch (29-29)
  • batch (29-29)
  • pos (79-79)
  • model (13-13)
  • model (13-13)
  • token (87-93)
  • token (87-87)
  • token (96-102)
  • token (96-96)
examples/llama-bench/llama-bench.cpp (2)
  • tokens (1437-1437)
  • n_tokens (946-952)
examples/run/run.cpp (12)
  • model (682-693)
  • model (682-682)
  • model (709-736)
  • model (709-709)
  • model (738-762)
  • model (738-738)
  • model (764-787)
  • model (764-764)
  • model (789-820)
  • model (789-789)
  • model (891-898)
  • model (891-891)
src/llama-model.h (1)
  • llama_n_pos_per_token (405-405)
src/llama-model.cpp (4)
  • llama_n_pos_per_token (12060-12062)
  • llama_n_pos_per_token (12060-12060)
  • llama_n_pos_per_token (12064-12066)
  • llama_n_pos_per_token (12064-12064)
examples/batched-bench/batched-bench.cpp (2)
include/llama.h (8)
  • llama_batch_ext_init (926-926)
  • llama_batch_ext_get_n_tokens (943-943)
  • llama_batch_ext_get_view (976-979)
  • llama_decode_ext (1008-1010)
  • llama_batch_ext_add_text (950-956)
  • llama_batch_ext_clear (982-982)
  • llama_batch_ext_set_output_last (972-972)
  • llama_batch_ext_free (986-986)
src/llama-batch.cpp (14)
  • llama_batch_ext_init (373-375)
  • llama_batch_ext_init (373-373)
  • llama_batch_ext_get_n_tokens (395-397)
  • llama_batch_ext_get_n_tokens (395-395)
  • llama_batch_ext_get_view (456-475)
  • llama_batch_ext_get_view (456-459)
  • llama_batch_ext_add_text (399-422)
  • llama_batch_ext_add_text (399-405)
  • llama_batch_ext_clear (452-454)
  • llama_batch_ext_clear (452-452)
  • llama_batch_ext_set_output_last (443-450)
  • llama_batch_ext_set_output_last (443-443)
  • llama_batch_ext_free (477-493)
  • llama_batch_ext_free (477-477)
examples/llama-bench/llama-bench.cpp (4)
include/llama-cpp.h (12)
  • ctx (46-60)
  • ctx (46-51)
  • ctx (63-70)
  • ctx (63-68)
  • ctx (73-84)
  • ctx (73-78)
  • batch (29-29)
  • batch (29-29)
  • token (87-93)
  • token (87-87)
  • token (96-102)
  • token (96-96)
src/llama-context.h (4)
  • ctx (157-161)
  • n_threads (61-61)
  • i (49-49)
  • i (52-52)
src/llama-context.cpp (6)
  • n_threads (449-451)
  • n_threads (449-449)
  • llama_decode_ext (2797-2801)
  • llama_decode_ext (2797-2799)
  • llama_synchronize (2428-2430)
  • llama_synchronize (2428-2428)
include/llama.h (2)
  • llama_decode_ext (1008-1010)
  • llama_synchronize (1041-1041)
include/llama.h (6)
include/llama-cpp.h (11)
  • model (13-13)
  • model (13-13)
  • batch (29-29)
  • batch (29-29)
  • ctx (46-60)
  • ctx (46-51)
  • ctx (63-70)
  • ctx (63-68)
  • ctx (73-84)
  • ctx (73-78)
  • pos (79-79)
examples/run/run.cpp (12)
  • model (682-693)
  • model (682-682)
  • model (709-736)
  • model (709-709)
  • model (738-762)
  • model (738-738)
  • model (764-787)
  • model (764-764)
  • model (789-820)
  • model (789-789)
  • model (891-898)
  • model (891-891)
src/llama-context.cpp (2)
  • n_seq_max (445-447)
  • n_seq_max (445-445)
src/llama-batch.cpp (2)
  • llama_batch_ext_init (373-375)
  • llama_batch_ext_init (373-373)
src/llama-batch.h (1)
  • batch (99-99)
src/llama-context.h (1)
  • ctx (157-161)
examples/server/server.cpp (4)
src/llama-batch.h (1)
  • batch (99-99)
src/llama-context.h (9)
  • seq_id (53-53)
  • seq_id (100-100)
  • seq_id (101-101)
  • seq_id (102-102)
  • seq_id (115-120)
  • seq_id (122-126)
  • ctx (157-161)
  • i (49-49)
  • i (52-52)
src/llama-context.cpp (6)
  • llama_pooling_type (2385-2387)
  • llama_pooling_type (2385-2385)
  • llama_get_embeddings_seq (2456-2460)
  • llama_get_embeddings_seq (2456-2456)
  • llama_decode_ext (2797-2801)
  • llama_decode_ext (2797-2799)
src/llama-batch.cpp (4)
  • llama_batch_ext_set_output_last (443-450)
  • llama_batch_ext_set_output_last (443-443)
  • llama_batch_ext_get_view (456-475)
  • llama_batch_ext_get_view (456-459)
examples/llama.swiftui/llama.cpp.swift/LibLlama.swift (1)
include/llama.h (6)
  • llama_batch_ext_init (926-926)
  • llama_batch_ext_free (986-986)
  • llama_batch_ext_get_n_tokens (943-943)
  • llama_batch_ext_clear (982-982)
  • llama_batch_ext_add_text (950-956)
  • llama_batch_ext_set_output_last (972-972)
🪛 SwiftLint (0.57.0)
examples/llama.swiftui/llama.cpp.swift/LibLlama.swift

[Warning] 199-199: Lines should not have trailing semicolons

(trailing_semicolon)

⏰ Context from checks skipped due to timeout of 90000ms (20)
  • GitHub Check: server-windows
  • GitHub Check: server (UNDEFINED, RelWithDebInfo)
  • GitHub Check: server (ADDRESS, RelWithDebInfo)
  • GitHub Check: server (Release)
  • GitHub Check: ubuntu-22-cmake-sycl-fp16
  • GitHub Check: android-build
  • GitHub Check: ubuntu-22-cmake-sycl
  • GitHub Check: ubuntu-latest-cmake-cuda
  • GitHub Check: windows-2019-cmake-cuda (12.4, cuda)
  • GitHub Check: windows-latest-cmake (avx-x64, -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_AVX...
  • GitHub Check: windows-latest-cmake (avx512-x64, -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_...
  • GitHub Check: windows-latest-cmake-sycl
  • GitHub Check: macOS-latest-cmake-tvos
  • GitHub Check: macOS-latest-swift (generic/platform=iOS)
  • GitHub Check: macOS-latest-swift (generic/platform=tvOS)
  • GitHub Check: ubuntu-latest-llguidance
  • GitHub Check: ubuntu-22-cmake-vulkan
  • GitHub Check: ubuntu-latest-cmake-rpc
  • GitHub Check: ubuntu-22-cmake-hip
  • GitHub Check: check-requirements
🔇 Additional comments (172)
src/llama-graph.cpp (4)

5-5: Include correctly added for accessing the new function

Good addition of the llama-model.h header file to access the newly added llama_n_pos_per_token function.


569-569: Properly migrated from method to direct initialization

The initialization of n_pos_per_token now uses the new function llama_n_pos_per_token(params.arch) instead of a class method. This change is consistent with the refactoring pattern to move the logic outside the class.


1004-1004: Updated constructor to use class member

The llm_graph_input_pos constructor now correctly takes n_pos_per_token as a parameter, matching the migration from method to class member.


1008-1008: Updated tensor size calculation

Tensor size calculation now correctly uses the class member n_pos_per_token instead of calling a method, consistent with the overall refactoring.

examples/llama.android/llama/src/main/java/android/llama/cpp/LLamaAndroid.kt (2)

48-48: Updated JNI function signature to use context

The new_batch function signature has been updated to take a context parameter (Long) instead of three integers. This aligns with the migration to llama_batch_ext in the native code.


105-105: Updated function call to match new signature

The call to new_batch now correctly passes the context parameter instead of the hardcoded values (512, 0, 1), aligning with the updated function signature.

common/speculative.cpp (11)

17-17: Updated struct member type to use the new batch API

Changed the batch member type from llama_batch to llama_batch_ext_ptr, which is a smart pointer wrapper around the new extended batch structure. This change is part of the migration to the new batch API and will ensure proper memory management.


26-26: Updated initialization to use the new batch API

The batch initialization now uses llama_batch_ext_ptr(ctx_dft) constructor, which internally calls llama_batch_ext_init(ctx). This aligns with the migration pattern described in the PR objectives.


207-207: Updated to use the batch object's clear method

Changed from using the standalone common_batch_clear function to using the object's clear() method, which is part of the llama_batch_ext_ptr interface.


211-211: Updated to use the batch object's add_text method

Changed from using common_batch_add to batch.add_text(), maintaining the same parameters but using the object's method instead of a standalone function.


217-217: Updated token count check to use method

Now using batch.n_tokens() method instead of directly accessing a field, which is consistent with the encapsulation provided by the new batch API.


220-220: Updated to use the extended decode function

Changed from using llama_decode to llama_decode_ext and passing batch.get() to retrieve the raw pointer from the smart pointer. This correctly uses the new API for decoding.


227-229: Batch operations updated to use new API methods

These consecutive operations on the batch (clear and add_text) now use the object's methods instead of standalone functions, consistent with the migration to the new batch API.


234-234: Another decode operation updated to use extended API

Consistently updated to use llama_decode_ext with batch.get() to retrieve the raw pointer from the smart pointer.


240-240: Consistent use of batch clear method

Consistently using the batch object's clear() method throughout the code.


267-267: Consistent use of batch add_text method

Consistently using the batch object's add_text() method throughout the code with the same parameter order as before.


270-270: Consistent use of extended decode function

Consistently using llama_decode_ext with batch.get() throughout the code.

common/common.cpp (2)

1019-1020: Correct migration to llama_batch_ext_ptr API

The warm-up code has been properly migrated from the legacy batch API to the new extended batch API. This implementation correctly uses position 0 for the initial encoding of tokens.


1029-1030: Properly migrated decoder warm-up with batch size limit

The decoding step of the warm-up process has been correctly migrated to the new extended batch API. Good use of std::min() to ensure the batch size doesn't exceed the configured limit.

examples/server/server.cpp (12)

1232-1232: Structure type updated for better memory management

The change from llama_batch batch_spec to llama_batch_ext_ptr batch_spec aligns with the migration pattern from llama_batch to the extended batch API. The smart pointer version helps with proper resource management.


1804-1804: Structure type updated for better memory management

The change from llama_batch batch to llama_batch_ext_ptr batch follows the same migration pattern as the slot's batch_spec field, ensuring consistency throughout the codebase.


1930-1930: Using clear() method instead of empty initialization

The code now properly calls clear() instead of initializing with empty braces, which is more explicit and ensures proper cleanup of resources.

Also applies to: 1966-1966, 2858-2858, 3286-3286


2409-2457: Embeddings implementation with sequence ID

The function signature has been updated to remove the batch parameter. The implementation now correctly uses a sequence ID (the slot ID) to retrieve embeddings directly from the context rather than iterating through batch tokens. This is more efficient and cleaner.

Note that the commented-out code indicates a known limitation: embeddings without pooling are not yet supported, as flagged in the conditional check.


2459-2479: Reranking implementation with sequence ID

Similar to the embeddings function, the reranking implementation has been updated to use sequence IDs instead of batch tokens, which simplifies the code and makes it more maintainable.


2880-2882: Updated batch token handling

The code now uses the batch.n_tokens() method instead of directly accessing the n_tokens member, and uses batch.add_text() instead of common_batch_add(), which provides better encapsulation and follows the new API pattern.


3085-3090: Updated batch token handling for prompt processing

The code now uses batch.add_text() with proper position and sequence ID parameters for adding prompt tokens, replacing the old common_batch_add() function. This ensures consistent use of the new batch API.


3116-3116: Using new API for setting output tokens

Replaced the previous implementation with llama_batch_ext_set_output_last(batch.get()) to set the last token as output, which aligns with the new API.


3146-3152: Using batch view for efficient token processing

The code now processes tokens in chunks by creating a view of the batch using llama_batch_ext_get_view(), which allows for more efficient token processing without copying the entire batch.


3287-3291: Updated batch token handling for speculative decoding

The code now uses batch_spec.add_text() with proper position and sequence ID parameters for adding tokens to the speculative batch, replacing the old common_batch_add() function.


3295-3295: Updated decoding with extended batch API

Changed to use llama_decode_ext(ctx, slot.batch_spec.get()) to align with the new API, ensuring that the speculative batch is properly processed.


4152-4155: Warning about unsupported pooling type

The code correctly checks for and rejects requests with pooling type 'none' since this isn't yet supported, as indicated by the commented-out implementation in send_embedding.

examples/simple-chat/simple-chat.cpp (6)

2-2: Include llama-cpp.h
This header is necessary for using the extended batch and decoding functions.


112-114: Initialize and track n_past
Initializing n_past at 0 and then updating it with batch.n_tokens() ensures consistent tracking of token positions across batches.


121-121: Safe boundary check
Using batch.n_tokens() here is correct for preventing context overflow.


127-127: Extended decoding call
Switching to llama_decode_ext is a valid upgrade; the error handling via GGML_ABORT is consistent with the rest of the code.


151-153: Prepare batch for next token
Clearing the batch, adding the new token with the same sequence ID, and incrementing n_past is a clean approach to building the prompt incrementally.


156-156: Free the batch
Freeing the batch at the end ensures no leaks. This call is correctly placed outside the loop.

examples/imatrix/imatrix.cpp (6)

500-500: Init new batch
Calling llama_batch_ext_init(ctx) is correct. Null checks are unnecessary as this function never returns nullptr per project learnings.


514-514: Clearing the batch
Using llama_batch_ext_clear(batch) properly resets the batch before adding new tokens.


516-516: Add tokens with extended API
The extended function call correctly includes the sequence ID reference. This aligns with the new batch approach.


520-520: Handle decode failures
Performing the decode with llama_decode_ext and checking errors ensures the process can abort cleanly on failure.


522-522: Free batch upon failure
This early return avoids potential leaks or invalid usage if the decode fails.


535-535: Final batch cleanup
Ensures resources are freed if decode succeeds. The logic avoids double-free issues.

examples/llama.swiftui/llama.cpp.swift/LibLlama.swift (18)

13-13: Opaque pointer for the batch
Storing the batch as an OpaquePointer aligns with the Swift bridging of C APIs.


20-20: Context length set to 128
Reducing n_len from a larger value may limit generation length; verify if this meets your needs.


29-29: Initialize extended batch
Initialization is correct, and no null check is required based on the library’s guarantees.


40-40: Free batch in deinit
Properly releases resources and avoids memory leaks when this actor is deallocated.


98-98: Return extended batch token count
Delegating token count to llama_batch_ext_get_n_tokens(batch) is the correct approach.


120-120: Clear batch before adding prompt tokens
Resets the batch to a known empty state.


124-124: Add tokens with sequence ID
Providing [llama_seq_id(0)] for each token is suitable for a single-sequence context.


126-126: Mark last token for output
llama_batch_ext_set_output_last(batch) ensures the final token is treated as output.


128-129: Improve error handling for decode failure
Only printing an error might allow continued execution with an invalid state. Consider throwing an exception or returning early.


132-132: Track current token count
Retrieving the new token count from the batch is crucial to keep n_cur in sync.


138-138: Sampler usage
Sampling based on llama_batch_ext_get_n_tokens(batch) - 1 seems intentional, but confirm it correctly references the last token.


165-166: Prepare batch for appended token
Clearing the batch and adding the new token with updated n_cur is consistent with incremental text generation.


171-171: Add more robust error handling
Similar to lines 128-129, only printing a message can lead to unpredictable flow. Return early or raise an error.


188-188: Clear batch
Properly resets prior to adding the next set of tokens.


193-195: Add tokens and mark last output
Ensures the final token in this batch is flagged for output.


201-202: Refine error handling
Consider throwing or returning an error code to prevent continuing after a decode failure.


215-218: Batch reuse for text generation
Clearing and adding tokens in a loop is consistent. Each iteration handles a slice of text with its own sequence ID.


221-222: Consistent decode error strategy
Again, only printing an error may risk future logic. A robust approach could error-out completely.

examples/llama-bench/llama-bench.cpp (6)

1430-1454: Ensure n_past remains valid.
This refactoring cleanly incorporates the new n_past parameter in test_prompt(). Just be mindful of any scenarios where n_past might be negative or exceed expected bounds.


1455-1470: Decode error handling not present.
This new test_gen() function also omits error checks on llama_decode_ext. Based on previous feedback, you plan to simplify for now and handle errors later, so this is noted.


1613-1613: Warm-up call looks good.
Calling test_prompt(...) with n_past = 0 aligns with the intended warm-up approach.


1619-1619: Consistent generation warm-up.
Similar to the prompt warm-up, using test_gen(..., 0, ...) is consistent with the new signature.


1632-1632: Additional prompt run.
Repeated invocation of test_prompt(...) seems intentional for benchmarking multiple runs. No concerns here.


1639-1639: Final generation run invocation.
Invoking test_gen(...) with t.n_gen, t.n_prompt is consistent with the revised batch API usage.

examples/lookahead/lookahead.cpp (9)

95-96: Partitioning prompt tokens into two batches.
Splitting the initial tokens (n_input - 1) and the last token separately is a neat approach. Be cautious if n_input is ever ≤ 1, but the prior checks should cover that.


97-98: No error checks for decode calls (revisited).
As previously noted, you’ve chosen not to handle llama_decode_ext return codes here for simplicity.


120-120: Batch initialization.
Allocating a dedicated llama_batch_ext for the lookahead logic appears correct and is properly paired with a free at line 483.


209-209: Clearing batch state.
Good use of llama_batch_ext_clear before repopulating tokens.


212-212: Adding the current token with output enabled.
Marking this token as true for output is consistent, ensuring it contributes to decoded logits.


235-239: Tracking token indices and adding verification tokens.
Storing the batch token position via llama_batch_ext_get_n_tokens right before calling llama_batch_ext_add_text helps manage multi-sequence verification.


250-251: Filling the next tokens at this level.
Appending lookahead tokens in a single call is succinct and leverages the false output flag for intermediate tokens.


257-258: Handling sub-level tokens.
Using i + 1 as the sequence ID and conditionally setting output to j == N - 2 ensures tokens become logits only at the final stage.


483-483: Proper resource cleanup.
Freeing the extended batch pointer prevents leaks and finalizes the lookahead logic neatly.

examples/run/run.cpp (6)

643-643: Initialize past-tokens counter

Adding a dedicated n_past field is a solid approach for tracking the number of tokens processed so far.


954-964: New context size check function

This helper cleanly encapsulates the logic for verifying that the batch does not exceed the context limit. Consider ensuring that its return value is consistently used or handled at the call site to avoid decoding when the limit is exceeded.


995-995: Initialize batch from text with current position

Using llama_data.n_past as the pos0 argument is a clean way to maintain correct offset. Just verify that other function calls also pass updated positions to prevent misalignment.


998-999: Revisit return-value handling
Previously flagged in older commits: the return value of check_context_size is ignored, which can lead to a context overflow scenario undetected until after decoding.


1004-1005: Incrementing n_past

Updating llama_data.n_past by the batch size is correct for preserving generation context.


1020-1020: Clearing batch mid-generation

Clearing the batch to prepare for the next token set is appropriately placed. Ensure no logic relies on the old batch contents.

examples/embedding/embedding.cpp (9)

29-34: New helper for adding tokens

batch_add_seq correctly transforms each token into a call to llama_batch_ext_add_text. Be mindful that using true for the output parameter produces embeddings for every token; confirm that all tokens need embedding.


36-43: Dedicated batch-decoding function

This new function centralizes the batch decoding logic for embeddings. Parameter choices (e.g., embd_norm) appear consistent with the code’s usage.


45-45: Informational log

Logging the token count and sequence number is helpful for debugging.


49-50: Encoder invocation

Switching to llama_encode_ext aligns with the extended batch approach. The error-handling block looks good.


54-56: Decoder invocation

Similarly, calling llama_decode_ext ensures consistency with extended API usage.


59-75: Conditional pooling logic

Splitting the logic by pooling type is clear. Iterating token-by-token vs. sequence-by-sequence is properly separated.


170-170: Extended batch creation

Replacing legacy batch initialization with llama_batch_ext_init properly adopts the new mechanism.


210-210: Final decode call

Using batch_decode one last time for any remaining tokens completes the embedding process.


316-316: Freeing the extended batch

Releasing batch resources at the end ensures no memory leaks remain.

examples/batched/batched.cpp (14)

105-105: Extended batch initialization

Creating the batch with llama_batch_ext_init(ctx) is the correct replacement for the legacy batch struct.


114-114: Prompt token addition

Appropriately adds prompt tokens for decoding. Confirm that the final false for output matches the desired logits usage.


116-116: Assert on prompt size

Verifying that the batch token count matches the prompt size ensures correct prompt loading.


119-120: Hybrid model check

Entering the encoder path to invoke llama_encode_ext is correct. The immediate error check is a good practice.


129-130: Reset and add decoder start

Clearing the batch, then adding a decoder_start_token_id sets the stage for the next decode phase.


134-134: Output for last token

Calling llama_batch_ext_set_output_last ensures the final prompt token's logits are available for sampling.


136-136: Check decode result

The nonzero return from llama_decode_ext is handled via an error condition, which is good for ensuring robust flow.


160-160: Initialize n_cur

Keeping track of the total tokens processed so far with llama_batch_ext_get_n_tokens helps produce correct offsets.


167-167: Clearing batch per iteration

Clearing between iterations is essential for the next set of tokens.


197-197: Update stream index after sampling

Reassigning i_batch[i] to the new last token index ensures correct sampling references.


199-199: Add new token for output

Passing true for the output argument logs the token’s logits. This is consistent with the decoding loop’s flow.


205-205: Detect completion

Short-circuiting the loop if there are no new tokens to decode makes sense to avoid unnecessary processing.


212-212: Error-handling decode

Ensuring decode success each iteration is critical for reliability.


237-237: Clean up extended batch

Properly freeing the batch finalizes the memory usage.

examples/perplexity/perplexity.cpp (3)

652-661: Good update to the decode_helper function.

The function has been properly updated to work with the new batch extension API. The error handling is implemented correctly, checking the return value from llama_decode_ext.


366-375: Batch initialization and token addition are properly updated.

The code now uses the new llama_batch_ext_ptr API correctly. The batch is initialized, cleared, and tokens are added using the appropriate methods.


567-573: Position tracking is handled correctly.

The updated code properly tracks positions for tokens, which is critical for maintaining context. The conversion from the old batch API to the new extended batch API preserves this important aspect.

examples/retrieval/retrieval.cpp (4)

77-82: Proper update to batch_add_seq function.

The function signature and implementation have been updated to work with the new llama_batch_ext API, using llama_batch_ext_add_text for token addition.


84-110: Improved model type handling in batch_decode.

The updated function now properly handles both encoder-only and decoder-only models, selecting the appropriate decoding method based on the model type. The addition of the embd_norm parameter also provides more flexibility for embedding normalization.

The assertion checking for non-null embeddings is a good defensive programming practice.


232-239: Batch management is correctly updated.

The code properly uses llama_batch_ext_get_n_tokens to check the batch size and llama_batch_ext_clear to reset the batch. This maintains the same logic flow as the original code with the new API.


299-301: Proper memory cleanup.

The code correctly frees the allocated batch memory using llama_batch_ext_free, ensuring there are no memory leaks.

examples/speculative/speculative.cpp (7)

6-6: Include Usage Check
Including "llama-cpp.h" here is valid, but ensure it does not conflict with the existing "llama.h" and that all required definitions are uniquely provided without redefinition.


205-206: No Immediate Issues
Instantiating both batch_dft and batch_tgt from the correct contexts looks appropriate for parallel usage.


452-452: Repeated Omission of Error Checking
Similarly, this second call to llama_decode_ext also does not verify success or failure. Maintain consistency by handling its return code.


477-484: Struct Introduction & Vector Usage
Defining batch_info and using batch_tgt_data clarifies structured token storage. The approach is readable. No issues found.


489-489: No Issues with Clearing
Calling batch_dft.clear() is straightforward and should reset the batch accurately.


579-580: Early Break on Empty Batch
Breaking if batch_dft.n_tokens() == 0 is a neat safeguard against aimless decoding. No problem found.


588-588: Draft Data Size Check
Breaking if batch_tgt_data.size() > n_draft is fine for preventing excessive tokens. Ensure the user is informed of partial drafting.

examples/llama.android/llama/src/main/cpp/llama-android.cpp (14)

178-178: Correct Casting of Batch Pointer
Casting the batch pointer to llama_batch_ext * is consistent with the new extended batch API. No immediate issues.


189-189: Batch Clear
llama_batch_ext_clear(batch); effectively resets the batch. Looks good.


197-197: Setting Output Last
Marking the last token as output is valid for capturing logits from the prompt. No concerns.


201-202: Good Error Checking
Here, the failure state from llama_decode_ext(...) is properly logged. This consistent pattern of checking is recommended everywhere.


221-222: Decode Failure Handling
Again, robust error handling is displayed here for text generation. Good job.


274-278: Initializing a New Batch
Creating a new llama_batch_ext using llama_batch_ext_init(context) aligns with the extended batch approach. Looks correct.


286-286: Freeing the Batch
Calling llama_batch_ext_free ensures we release associated memory. No issues.


334-334: Consistent Reinterpret Cast
Using reinterpret_cast<llama_batch_ext *>(batch_pointer) maintains consistency with your new API calls.


352-352: Batch Reset Before Filling
Clearing the batch before adding the prompt tokens is sensible. No concerns.


362-362: Designating Last Token as Output
No issues. This is needed if you want to collect logits for the final token.


364-364: Error Handling
You properly handle decode errors. Good consistency.


369-369: Returning the Batch Token Count
This is a helpful interface for the Java layer. Looks fine.


383-383: Batch Pointer Reinterpretation
Same pattern of casting. No immediate issues.


419-420: Logging Decode Failures
As before, capturing decode errors helps in debugging. This is a good pattern.

examples/save-load-state/save-load-state.cpp (5)

51-51: Batch Initialization
Using llama_batch_ext_ptr::init_from_text to create and populate the batch from your prompt tokens is a clean approach.


82-85: Decoding & Freeing on Failure
This block properly checks llama_decode_ext and frees the batch upon failure. Good error handling.


134-139: Consistent Error Handling
You again handle decode failures and free resources. Keeping the cleanup consistent is commendable.


218-220: Error Handling for Third Context
The pattern remains consistent, returning if decode fails and freeing the batch. This is good practice.


232-232: Memory Cleanup
Final call to free the batch ensures resources are released. Looks correct.

src/llama-batch.cpp (11)

2-2: Header inclusion looks fine.
No issues noted with adding #include "llama-graph.h"; it aligns with the rest of the include directives.


277-317: Confirm intent for forcing the last token’s logits to true.
By default, line 314 sets the final token's logits value to true. Ensure this behavior is intentional regarding how output tokens are flagged.


319-321: Constructor/Destructor alignment.
The destructor deletes the batch pointer. Past review comments mentioned double-free concerns. Since they are addressed, no further action is needed.


328-330: Helper function is concise and correct.
No obvious issues. This function simply returns a valid llama_batch struct without extra checks, which is consistent with the project style.


373-375: Simple initialization logic.
Constructing the batch with the specified parameters is straightforward and consistent with the rest of the file.


377-393: Embedding dimension validation.
If n_embd mismatches the actual model dimension, it can lead to undefined behavior. This was flagged in a previous review.


395-397: Trivial accessor is correct.
Returning batch->n_tokens meets the function’s purpose.


443-450: Marking the last token as output is straightforward.
No issues with the logic; it is a minimal function that sets logits on the final token.


452-454: Method clears n_tokens only.
Resetting n_tokens is enough for reusing the batch structure. This appears intentional and consistent with the code style.


477-493: View-aware free logic is correct.
Memory is freed only if !is_view, preventing double-free of shared pointers. This aligns with the introduced view mechanism.


495-495: No additional comment needed.
This line simply documents deprecation.

examples/batched-bench/batched-bench.cpp (7)

62-62: Batch initialization appears standard.
The call to llama_batch_ext_init(ctx) is straightforward.


118-118: Clearing the batch is consistent.
No issues: resetting n_tokens each iteration is valid.


122-122: Again ignoring error codes.
Same concern regarding ignoring the return code from llama_batch_ext_add_text.


125-125: Mark last token as output.
Straightforward usage of llama_batch_ext_set_output_last.


147-147: Batch clear repeated.
Nothing unusual; consistent usage with earlier calls.


150-150: Ignoring error codes once more.
As with earlier calls to llama_batch_ext_add_text, evaluating the return could improve reliability.


188-188: Freeing the extended batch is correct.
Deallocating at the end follows the introduced pattern.

examples/llava/gemma3-cli.cpp (10)

8-8: New include file.
Including "llama-cpp.h" for extended batch features is consistent with the migration.


67-67: Migration to llama_batch_ext_ptr.
Switching from a raw struct to a smart pointer should aid safe memory management.


77-78: Batch reset with llama_batch_ext_init.
Properly assigning a fresh batch instance. No concerns.


93-93: Clearing the batch.
Ensures subsequent calls start with a clean slate.


98-98: Setting the last token as output.
No issues observed, consistent with usage.


101-101: Decoding call with error check.
You handle the return value, ensuring any decode failure is logged.


150-152: Embedding-based batch initialization.
Be sure n_embd matches the model dimension (already noted).


183-183: Clearing batch for next token.
Logical step before reusing or adding new tokens.


184-184: Ignoring add_text return code.
Same concern that ignoring a failure could lead to hidden errors.


188-188: Decoding token with proper error handling.
Checks llama_decode_ext return value and logs on failure.

include/llama.h (6)

238-240: Deprecation notice is clear.
The documentation effectively warns about the upcoming removal of this struct.


928-941: Reiterate the need for pointer and dimension checks.
As noted in previous reviews, explicitly validate that embd is non-null and n_tokens, n_embd > 0 to avoid undefined behavior.


942-943: Function for retrieving the token count looks good.
No immediate issues found.


945-957: Clarify usage of seq_ids parameter.
The documentation does not specify how to handle a null pointer or zero-length array in seq_ids. Please provide details or consider adding checks to avoid potential runtime errors.


968-973: Straightforward logic for setting output on the last token.
Implementation looks clear and consistent.


981-986: Clear guidance on memory management.
These functions provide straightforward handling of batch clearing and freeing.

Comment on lines 197 to 201
if (llama_batch_ext_get_n_tokens(batch) + n_toks > n_batch) {
batch_decode(ctx, batch, emb + e * n_embd, s, n_embd, params.embd_normalize);
llama_batch_ext_clear(batch);

e += pooling_type == LLAMA_POOLING_TYPE_NONE ? llama_batch_ext_get_n_tokens(batch) : s;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Potential off-by-one error after clearing

llama_batch_ext_clear(batch) resets the token count, yet line 201 uses llama_batch_ext_get_n_tokens(batch) to update e. This will yield zero instead of the tokens just decoded.

Below is a suggested fix to store the old count before clearing:

    int old_tokens = llama_batch_ext_get_n_tokens(batch);
    batch_decode(ctx, batch, emb + e * n_embd, s, n_embd, params.embd_normalize);
    llama_batch_ext_clear(batch);

-   e += pooling_type == LLAMA_POOLING_TYPE_NONE ? llama_batch_ext_get_n_tokens(batch) : s;
+   e += pooling_type == LLAMA_POOLING_TYPE_NONE ? old_tokens : s;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (llama_batch_ext_get_n_tokens(batch) + n_toks > n_batch) {
batch_decode(ctx, batch, emb + e * n_embd, s, n_embd, params.embd_normalize);
llama_batch_ext_clear(batch);
e += pooling_type == LLAMA_POOLING_TYPE_NONE ? llama_batch_ext_get_n_tokens(batch) : s;
if (llama_batch_ext_get_n_tokens(batch) + n_toks > n_batch) {
int old_tokens = llama_batch_ext_get_n_tokens(batch);
batch_decode(ctx, batch, emb + e * n_embd, s, n_embd, params.embd_normalize);
llama_batch_ext_clear(batch);
e += pooling_type == LLAMA_POOLING_TYPE_NONE ? old_tokens : s;

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (9)
src/llama-model.cpp (2)

6078-6082: Clarify and Verify Transposition Logic for Input Positions

The new transposition sequence—from reshaping to transposing then flattening—appears intended to convert a layout from 0000111122223333 to 0123012301230123. This approach is clear given the comment, but please verify that:

  • The intermediate reshapes and transpose indeed produce the expected layout for all n_tokens and n_pos_per_token values.
  • There is no unintended overhead or memory management issue (e.g., resource leaks if ggml_cont allocates new tensors).

As a follow-up, consider encapsulating this common transformation in a helper function and adding unit tests to ensure correctness.


12065-12072: Consistent Overloads for Position-Per-Token Calculation

The overloads of llama_n_pos_per_token that differentiate based on the architecture ensure that for LLM_ARCH_QWEN2VL four positions per token are returned, and one otherwise. This refactoring is concise and improves clarity by abstracting the logic behind position calculation.

For added clarity, a brief comment explaining why the QWEN2VL architecture requires 4 positions (if not documented elsewhere) would be beneficial.

examples/server/server.cpp (1)

2403-2447: Embeddings implementation with sequence ID

The send_embedding method has been refactored to use sequence IDs instead of directly processing batch tokens. This aligns with the sequence-centric design of the extended batch API. However, as noted in the commented-out code, embeddings without pooling are not currently supported.

Consider implementing support for embeddings without pooling in a future update to make this feature complete. The commented-out code shows the intended approach but is not yet functional.

src/llama-batch.h (1)

8-33: Extended batch documentation
These added comments detail usage of llama_batch_ext (tokens, embeddings, positions, etc.). Good clarity. Consider clarifying multi-position behavior if more than one position per token is eventually used.

src/llama-batch.cpp (1)

431-448: Selective output marking
llama_batch_ext_set_output marks tokens as output if matching seq_id (and optional position). This is a flexible approach. Just ensure that the unrolled search doesn’t cause performance issues with large batches.

include/llama.h (4)

264-267: Consider adding a dedicated documentation block
While the forward declaration of llama_batch_ext is helpful, consider adding a short docstring describing usage, memory ownership, and any constraints for users of this new extended batch structure.

 // Input data for llama_decode / llama_encode
 // It can contain text tokens and embeddings for one or many sequences
 struct llama_batch_ext;
+// Example usage:
+//    1) Allocate with llama_batch_ext_init(ctx).
+//    2) Add tokens or embeddings.
+//    3) Pass to llama_encode_ext / llama_decode_ext.
+//    4) Free with llama_batch_ext_free().

485-486: Add clarifying comments for these new model functions
The new functions for rope type and position retrieval appear useful, but adding brief details about their purpose (e.g., how the rope type influences positional encoding) would make the API more self-documenting.

 LLAMA_API enum llama_rope_type       llama_model_rope_type(const struct llama_model * model);
 LLAMA_API uint32_t                   llama_n_pos_per_token(const struct llama_model * model);
+/// @brief Returns the number of positions per token for the given model.
+///        Typically used to configure internal data structures for positional encodings.

997-999: Document the extended encoding workflow
The newly introduced llama_encode_ext function lacks a descriptive comment explaining its input, output, and error codes. Consider replicating or adapting the documentation from llama_encode with emphasis on how extended batches differ.

 LLAMA_API int32_t llama_encode_ext(
     struct llama_context * ctx,
     struct llama_batch_ext * batch);
+/// @brief Encode a batch of tokens or embeddings using the new llama_batch_ext API.
+/// @return 0 on success, < 0 on error, as with llama_encode.
+/// @note This function updates internal KV cache state.

1007-1011: Improve documentation and consistency with llama_decode
Similarly, llama_decode_ext would benefit from usage notes and parameter details that mirror the original llama_decode doc—especially around batch sizing, concurrency, and return codes.

 struct llama_batch batch);

 LLAMA_API int32_t llama_decode_ext(
     struct llama_context * ctx,
     struct llama_batch_ext * batch);
+/// @brief Decode a batch of extended tokens or embeddings.
+/// @return 0 on success, < 0 on error.
+/// @note Validate the batch size and sequence IDs before calling this function.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d18a79e and 42062cc.

📒 Files selected for processing (7)
  • examples/llava/qwen2vl-cli.cpp (2 hunks)
  • examples/parallel/parallel.cpp (7 hunks)
  • examples/server/server.cpp (20 hunks)
  • include/llama.h (6 hunks)
  • src/llama-batch.cpp (5 hunks)
  • src/llama-batch.h (4 hunks)
  • src/llama-model.cpp (2 hunks)
🔥 Files not summarized due to errors (2)
  • examples/llava/qwen2vl-cli.cpp: Error: Disallowed special token found: <|im_start|>
  • src/llama-model.cpp: Error: Server error: no LLM provider could handle the message
🧰 Additional context used
🧬 Code Definitions (6)
examples/llava/qwen2vl-cli.cpp (4)
include/llama-cpp.h (7)
  • pos (79-79)
  • batch (29-29)
  • batch (29-29)
  • token (87-93)
  • token (87-87)
  • token (96-102)
  • token (96-96)
src/llama-batch.h (1)
  • batch (100-100)
include/llama.h (2)
  • llama_decode_ext (1009-1011)
  • llama_batch_ext_set_output_last (973-973)
src/llama-batch.cpp (2)
  • llama_batch_ext_set_output_last (450-457)
  • llama_batch_ext_set_output_last (450-450)
examples/parallel/parallel.cpp (5)
include/llama-cpp.h (8)
  • batch (29-29)
  • batch (29-29)
  • ctx (46-60)
  • ctx (46-51)
  • ctx (63-70)
  • ctx (63-68)
  • ctx (73-84)
  • ctx (73-78)
src/llama-batch.h (1)
  • batch (100-100)
src/llama-context.h (9)
  • i (49-49)
  • i (52-52)
  • ctx (157-161)
  • seq_id (53-53)
  • seq_id (100-100)
  • seq_id (101-101)
  • seq_id (102-102)
  • seq_id (115-120)
  • seq_id (122-126)
src/llama-context.cpp (4)
  • llama_decode_ext (2797-2801)
  • llama_decode_ext (2797-2799)
  • n_batch (437-439)
  • n_batch (437-437)
src/llama-batch.cpp (4)
  • llama_batch_ext_set_output_last (450-457)
  • llama_batch_ext_set_output_last (450-450)
  • llama_batch_ext_get_view (463-483)
  • llama_batch_ext_get_view (463-466)
include/llama.h (3)
include/llama-cpp.h (15)
  • model (13-13)
  • model (13-13)
  • batch (29-29)
  • batch (29-29)
  • ctx (46-60)
  • ctx (46-51)
  • ctx (63-70)
  • ctx (63-68)
  • ctx (73-84)
  • ctx (73-78)
  • pos (79-79)
  • token (87-93)
  • token (87-87)
  • token (96-102)
  • token (96-96)
src/llama-context.cpp (2)
  • n_seq_max (445-447)
  • n_seq_max (445-445)
src/llama-batch.cpp (2)
  • llama_batch_ext_init (375-378)
  • llama_batch_ext_init (375-375)
src/llama-batch.h (3)
include/llama-cpp.h (2)
  • batch (29-29)
  • batch (29-29)
src/llama-kv-cache.h (1)
  • batch (111-111)
src/llama-batch.cpp (2)
  • llama_batch_allocr (277-318)
  • llama_batch_allocr (320-322)
src/llama-batch.cpp (3)
src/llama-batch.h (3)
  • batch (100-100)
  • llama_batch_allocr (114-114)
  • llama_batch_allocr (116-116)
include/llama.h (3)
  • llama_n_pos_per_token (486-486)
  • llama_n_batch (469-469)
  • llama_n_seq_max (471-471)
src/llama-model.h (1)
  • llama_n_pos_per_token (405-405)
examples/server/server.cpp (4)
src/llama-batch.h (1)
  • batch (100-100)
src/llama-context.h (9)
  • seq_id (53-53)
  • seq_id (100-100)
  • seq_id (101-101)
  • seq_id (102-102)
  • seq_id (115-120)
  • seq_id (122-126)
  • ctx (157-161)
  • i (49-49)
  • i (52-52)
src/llama-context.cpp (4)
  • llama_pooling_type (2385-2387)
  • llama_pooling_type (2385-2385)
  • llama_get_embeddings_seq (2456-2460)
  • llama_get_embeddings_seq (2456-2456)
src/llama-batch.cpp (2)
  • llama_batch_ext_set_output_last (450-457)
  • llama_batch_ext_set_output_last (450-450)
⏰ Context from checks skipped due to timeout of 90000ms (20)
  • GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
  • GitHub Check: windows-latest-cmake-hip
  • GitHub Check: windows-latest-cmake-sycl
  • GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
  • GitHub Check: macOS-latest-swift (generic/platform=macOS)
  • GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
  • GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
  • GitHub Check: ubuntu-22-cmake-sycl
  • GitHub Check: editorconfig
  • GitHub Check: ubuntu-22-cmake-musa
  • GitHub Check: ubuntu-22-cmake-vulkan
  • GitHub Check: ubuntu-22-cmake-sycl-fp16
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: ubuntu-22-cmake-hip
  • GitHub Check: ubuntu-latest-cmake-rpc
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: server (Release)
  • GitHub Check: server (ADDRESS, RelWithDebInfo)
  • GitHub Check: server (UNDEFINED, RelWithDebInfo)
  • GitHub Check: server-windows
🔇 Additional comments (41)
examples/server/server.cpp (10)

1232-1232: Type change from llama_batch to smart pointer llama_batch_ext_ptr

This is part of the migration from the legacy batch API to the extended batch API. The change to a smart pointer should improve memory management and resource cleanup.


1804-1804: Consistent type change to llama_batch_ext_ptr in server_context

This change is consistent with the migration pattern from llama_batch to llama_batch_ext_ptr seen in the slot struct, ensuring uniform adoption of the extended batch API throughout the codebase.


1930-1930: Replaced initialization with clear() method

The code now uses the clear() method on the batch objects instead of reinitializing them. This is a more explicit and potentially more efficient way to reset the batch state.

Also applies to: 1962-1962, 2095-2095


2453-2468: Reranking implementation with sequence ID

Similar to the embeddings implementation, the send_rerank method now uses sequence IDs instead of directly processing batch tokens. This provides a more consistent approach across different functionalities.


2874-2877: Updated token addition API in update_slots

The code now uses batch.add_text() to add tokens to the batch, which is part of the new extended batch API. This provides a more direct and potentially more efficient way to build batches.


3079-3084: Consistent use of add_text for prompt tokens

The code for processing prompt tokens now uses the same batch.add_text() method, ensuring consistency in how tokens are added to batches throughout the codebase.


3110-3110: Using extended batch API to set output token

The code now uses llama_batch_ext_set_output_last(batch.get()) to set the last token as the output, which is part of the new API. This provides a clearer way to specify which tokens should produce output logits.


2893-2893: Updated batch size check

The condition now uses batch.n_tokens() instead of likely a direct field access in the previous implementation. This is consistent with the object-oriented approach of the new batch API.


3140-3146: Improved batch processing with views

The code now creates batch views for processing, which allows processing batches in chunks without copying data. This is a more efficient approach that aligns well with the extended batch architecture.


3280-3285: Updated speculative decoding batch handling

Speculative decoding now uses the extended batch API consistently, maintaining compatibility with the rest of the code. The clear and add_text pattern is applied consistently here as well.

examples/parallel/parallel.cpp (15)

9-9: Include for new extended batch API
Including "llama-cpp.h" here aligns with the refactor to the new extended batch API. Good addition.


176-176: Instantiating an extended batch pointer
Creating llama_batch_ext_ptr batch(ctx); ensures the batch is tied to the context. Make sure ctx is valid and remains in scope for this usage.


194-194: Adding system tokens to the new batch
Using batch.add_text(tokens_system[i], i, 0, false); shows the updated approach to add tokens. Looks correct for the incremental indexing of i.


197-197: Error handling for extended decode
The conditional checks the return value of llama_decode_ext and logs an error before returning. The handling is consistent with the new decoding API.


218-218: Clearing the extended batch
batch.clear(); correctly resets the batch for reuse. Straightforward usage of the new method.


226-226: Tracking client batch indices
Assigning client.i_batch = batch.n_tokens(); associates the client’s position with the current end of the batch tokens. This is logically sound.


229-229: Adding sampled token with position offset
batch.add_text(client.sampled, n_tokens_system + client.n_prompt + client.n_decoded, seq_id, true); properly offsets the position. Ensure total position count doesn’t exceed the model’s limit.


247-247: Conditional insertion of new sequences
if (cont_batching || batch.n_tokens() == 0) { ... } ensures new sequences are only appended when continuous batching is allowed or the batch is empty. No immediate concerns.


265-266: Appending prompt tokens
These lines compute a sequence ID and add tokens with batch.add_text(...). The logic appears correct for binding tokens to the correct client-driven sequence.


271-272: Mark last token for output
Conditionally sets the last token as an output token. This is a clean way to retrieve logits/embeddings only for the final token if needed.


277-277: Index alignment for newly added tokens
client.i_batch = batch.n_tokens() - 1; updates the client’s reference to the last token position. Straightforward usage.


290-290: Early exit if batch is empty
if (batch.n_tokens() == 0) { break; } stops processing when there are no tokens to decode. Approved logic.


297-298: Looping over batch tokens
Storing n_tokens_in_batch = batch.n_tokens() and iterating in chunks is consistent with the new extended batch usage.


306-306: Chunk size determination
const int32_t n_tokens = std::min(n_batch, (int32_t) (n_tokens_in_batch - i)); neatly caps each iteration’s size. Looks correct.


308-310: Batch view usage without explicit boundary checks
llama_batch_ext_get_view(batch.get(), i, n_tokens) assumes i + n_tokens <= batch->max_tokens. If incorrect, this could go out of bounds. Verify that the combination of offset + n_tokens never exceeds the allocated range.

src/llama-batch.h (5)

77-77: Pointer to extended batch
const llama_batch_ext * batch = nullptr; suggests a read-only pointer usage. Ensure the batch is valid for the entire lifetime of this struct.


100-100: from_batch updated to accept llama_batch_ext
Switching to the extended type is consistent with the overall refactor. The function properly handles tokens, pos, seq IDs, etc.


105-106: Pointer member for the allocated batch
struct llama_batch_ext * batch; in llama_batch_allocr shows a new approach for dynamic ownership. Confirm that usage patterns avoid double frees.


113-114: New constructor transition
The constructor now receives struct llama_batch & in_batch to create an extended batch. This is a straightforward bridging approach from the old batch type.


116-117: Ensuring proper resource teardown
The destructor calls delete batch; preventing leaks. This finalizes the one-to-one ownership pattern.

src/llama-batch.cpp (10)

2-2: Including llama-graph
Adding #include "llama-graph.h" may be necessary if extended batch APIs or graph structures cross-reference. This is presumably correct.


342-373: llama_batch_ext_init_impl allocation strategy
The function allocates dynamic buffers for tokens, embeddings, positions, etc., and sets default fields. Notably, there's no explicit out-of-memory check, consistent with codebase learnings. Ensure the caller will handle or assume success.


375-379: Initialize extended batch from context
llama_batch_ext_init references the model’s batch/seq maxima. The code sets up a batch with zero tokens. Looks correct for a typical initialization.


380-396: Initialize extended batch from embeddings
Creates a batch for embd, copying positions and assigning the single seq_id. This is consistent with an embedding-based input pipeline.


398-400: Getter for token count
llama_batch_ext_get_n_tokens simply returns batch->n_tokens. Straightforward accessor.


402-429: Adding tokens to extended batch
llama_batch_ext_add_text appends one token, copying seq_ids and setting logits flags. This method returns negative codes on error. See if upstream code checks these error codes thoroughly.


450-457: Set the last token as output
llama_batch_ext_set_output_last sets the logits flag on the last token if batch->n_tokens > 0. Appropriately returns the output index.


459-462: Clearing the extended batch
llama_batch_ext_clear(batch) sets n_tokens = 0. Straightforward approach to reuse the allocated buffers.


463-484: Creating a batch view without buffer copies
llama_batch_ext_get_view returns a new llama_batch_ext object referencing a sub-range (offset, n_tokens). It’s flagged as is_view so memory isn’t freed. Beware of offset boundary checks.


485-501: Memory release for extended batch
llama_batch_ext_free frees only if it’s not a view. This upholds the is_view semantics. Potential for double-free is mitigated if usage is consistent.

include/llama.h (1)

238-240: Clear deprecation note for the legacy struct
The addition of a prominent warning helps guide developers away from legacy usage.

Comment on lines +4146 to +4149
if (llama_pooling_type(ctx_server.ctx) == LLAMA_POOLING_TYPE_NONE) {
res_error(res, format_error_response("Pooling type 'none' is not yet supported. Please use a different pooling type", ERROR_TYPE_INVALID_REQUEST));
return;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Warning about unsupported pooling type

The code correctly checks for and rejects requests with pooling type 'none' since this isn't yet supported, as indicated by the commented-out implementation in send_embedding.

Consider implementing support for the 'none' pooling type to make the feature complete, or provide more detailed error messages to help users understand the limitation.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
examples/server/server.cpp (1)

1256-1257: TODO comment should be addressed.

There's a TODO comment about removing i_batch and using only sequence-based sampling, which would be a more modern approach. Consider implementing this change now as part of the batch API migration, or create a follow-up task to track this improvement.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 42062cc and 56e82d0.

📒 Files selected for processing (1)
  • examples/server/server.cpp (20 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
examples/server/server.cpp (6)
src/llama-batch.h (1)
  • batch (100-100)
src/llama-context.h (9)
  • ctx (157-161)
  • seq_id (53-53)
  • seq_id (100-100)
  • seq_id (101-101)
  • seq_id (102-102)
  • seq_id (115-120)
  • seq_id (122-126)
  • i (49-49)
  • i (52-52)
include/llama.h (4)
  • llama_pooling_type (482-482)
  • llama_get_embeddings_seq (1076-1076)
  • llama_batch_ext_set_output_last (973-973)
  • llama_batch_ext_get_view (977-980)
src/llama-context.cpp (6)
  • llama_pooling_type (2385-2387)
  • llama_pooling_type (2385-2385)
  • llama_get_embeddings_seq (2456-2460)
  • llama_get_embeddings_seq (2456-2456)
  • n_batch (437-439)
  • n_batch (437-437)
common/common.cpp (2)
  • common_embd_normalize (1795-1829)
  • common_embd_normalize (1795-1795)
src/llama-batch.cpp (4)
  • llama_batch_ext_set_output_last (450-457)
  • llama_batch_ext_set_output_last (450-450)
  • llama_batch_ext_get_view (463-483)
  • llama_batch_ext_get_view (463-466)
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: ubuntu-22-cmake-sycl-fp16
  • GitHub Check: ubuntu-22-cmake-sycl
  • GitHub Check: ubuntu-22-cmake-vulkan
  • GitHub Check: ubuntu-latest-cmake-cuda
  • GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
  • GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: server (ADDRESS, RelWithDebInfo)
  • GitHub Check: server (UNDEFINED, RelWithDebInfo)
  • GitHub Check: server-windows
  • GitHub Check: server (Release)
🔇 Additional comments (21)
examples/server/server.cpp (21)

1232-1232: Data structure update to use extended batch.

The server_slot struct now uses llama_batch_ext_ptr which is a smart pointer wrapper for the new extended batch structure, replacing the previous llama_batch type. This is part of the migration to the new batch API.


1804-1804: Data structure update in server context.

The server_context struct now uses llama_batch_ext_ptr instead of llama_batch, consistent with the migration to the extended batch API.


1930-1930: Updated batch_spec clearing.

Now using the clear() method of the smart pointer wrapper instead of initializing with empty braces. This is the correct approach for the new API.


2095-2095: Consistent use of clear() method.

Consistently using the clear() method on batch_spec throughout the codebase, which is appropriate for the new batch pointer type.


2403-2447: Embeddings implementation with sequence ID

The implementation now correctly uses sequence IDs from the slot instead of iterating through batch tokens. Note that the commented-out code (currently not used) contains a limitation warning that embeddings without pooling are not yet supported.

Consider implementing support for embeddings without pooling to make the feature complete. The commented code shows the intended approach but is not yet functional.


2453-2468: Improved reranking implementation with sequence ID.

The send_rerank method has been updated to work with the new batch API by using sequence IDs rather than passing a batch parameter. This is consistent with the similar changes to the send_embedding method and aligns with the broader batch API migration.


2852-2852: Properly clearing the batch.

Using batch.clear() instead of resetting with empty braces, which is the correct approach for the new batch API.


2874-2877: Correct use of new batch API methods.

Using the batch's accessor methods (n_tokens()) and builder methods (add_text()) instead of directly accessing properties, which is consistent with the new API design.


2893-2893: Using proper batch checking.

The condition now properly uses batch.n_tokens() to check if the batch is empty, which is consistent with the new API.


3059-3060: Proper batch capacity checking.

Using batch.n_tokens() + slot.n_prompt_tokens > n_batch to check if the prompt will fit in the batch, which is the correct approach with the new API.


3083-3084: Properly adding token to batch.

Using batch.add_text() to add tokens to the batch, which is the proper method for the new batch API.


3093-3095: Consistent batch size reporting.

Using batch.n_tokens() in logging statement to report the current batch size, maintaining consistency with the new API.


3100-3100: Proper batch assertion.

Using batch.n_tokens() > 0 in the assertion, which is the correct way to check for a non-empty batch with the new API.


3110-3110: Using new batch API for output management.

Calling llama_batch_ext_set_output_last(batch.get()) correctly sets up the batch for output processing with the new API.


3113-3115: Properly setting batch index.

Setting slot.i_batch = batch.n_tokens() - 1 to correctly identify the position of the last token in the batch, which is important for retrieving outputs.


3119-3125: Proper batch size checking.

Using batch.n_tokens() >= n_batch and batch.n_tokens() == 0 for batch processing control flow, which is consistent with the new API.


3130-3130: Improved logging with batch size.

Using batch.n_tokens() in the debug log message to properly report the batch size being processed.


3140-3145: Proper batch processing with views.

Using the new batch view functionality (llama_batch_ext_get_view) to process subsets of the batch efficiently. This is an important feature of the new batch API that allows for better batch management.


4146-4149: Warning about unsupported pooling type

The code correctly checks for and rejects requests with pooling type 'none' since this isn't yet supported, as indicated by the commented-out implementation in send_embedding.

Consider implementing support for the 'none' pooling type to make the feature complete, or provide more detailed error messages to help users understand the limitation.


3280-3284: Speculative batch initialization using new API.

The code properly initializes and adds tokens to the speculative batch using the new batch API methods like clear() and add_text(). This maintains consistency with the rest of the codebase.


3287-3288: Improved debug logging for batch processing.

The debug message now correctly reports the size of the speculative batch using batch_spec.n_tokens() which is consistent with the new API.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
include/llama-cpp.h (1)

37-60: 🛠️ Refactor suggestion

Consider checking for negative return values.
Within init_from_text, no error checks are performed on llama_batch_ext_add_text. A negative return indicates a failure to add the token, which is silently ignored.

 for (int32_t i = 0; i < n_tokens; i++) {
-    llama_batch_ext_add_text(batch, tokens[i], pos0 + i, &seq_id, 1, false);
+    int32_t ret = llama_batch_ext_add_text(batch, tokens[i], pos0 + i, &seq_id, 1, false);
+    if (ret < 0) {
+        // Handle error or propagate it, e.g.:
+        return llama_batch_ext_ptr(nullptr);
+    }
 }
🧹 Nitpick comments (4)
include/llama-cpp.h (2)

62-84: Validate initialization result.
In init_from_embd, if llama_batch_ext_init_from_embd fails, it might return a null pointer or an invalid batch. Consider adding error checks or fallback handling.


104-114: Warn about ignoring the last returned output_id in add_seq.
Inside the loop, you overwrite output_id without checking for error codes. If any call fails, it remains undetected.

examples/embedding/embedding.cpp (1)

29-70: Consider early return on encode/decode failure.
Currently, if llama_encode_ext or llama_decode_ext fails, the error is logged, but the function proceeds with embedding extraction. This can lead to unexpected results.

 if (llama_encode_ext(ctx, batch) < 0) {
     LOG_ERR("%s : failed to encode\n", __func__);
+    return; // or handle gracefully
 }
examples/retrieval/retrieval.cpp (1)

77-105: Handle encode/decode failures more robustly.
As in the embedding example, logging an error may not suffice. Consider returning or skipping remaining logic to avoid invalid state use.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 50fb396 and 8ec0ff9.

📒 Files selected for processing (3)
  • examples/embedding/embedding.cpp (3 hunks)
  • examples/retrieval/retrieval.cpp (5 hunks)
  • include/llama-cpp.h (2 hunks)
🧰 Additional context used
🧬 Code Definitions (2)
examples/retrieval/retrieval.cpp (5)
include/llama.h (4)
  • llama_get_model (480-480)
  • llama_batch_ext_get_n_tokens (944-944)
  • llama_model_has_encoder (533-533)
  • llama_model_has_decoder (536-536)
src/llama-context.cpp (2)
  • llama_get_model (2373-2375)
  • llama_get_model (2373-2373)
src/llama-batch.cpp (2)
  • llama_batch_ext_get_n_tokens (398-400)
  • llama_batch_ext_get_n_tokens (398-398)
src/llama-model.cpp (4)
  • llama_model_has_encoder (12139-12145)
  • llama_model_has_encoder (12139-12139)
  • llama_model_has_decoder (12147-12152)
  • llama_model_has_decoder (12147-12147)
common/common.h (1)
  • common_embd_normalize (636-636)
examples/embedding/embedding.cpp (4)
src/llama-context.cpp (4)
  • pooling_type (823-825)
  • pooling_type (823-823)
  • llama_pooling_type (2385-2387)
  • llama_pooling_type (2385-2385)
src/llama-batch.cpp (2)
  • llama_batch_ext_get_n_tokens (398-400)
  • llama_batch_ext_get_n_tokens (398-398)
src/llama-model.cpp (4)
  • llama_model_has_encoder (12139-12145)
  • llama_model_has_encoder (12139-12139)
  • llama_model_has_decoder (12147-12152)
  • llama_model_has_decoder (12147-12147)
common/common.cpp (2)
  • common_embd_normalize (1795-1829)
  • common_embd_normalize (1795-1795)
⏰ Context from checks skipped due to timeout of 90000ms (20)
  • GitHub Check: windows-latest-cmake (msvc-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-win...
  • GitHub Check: windows-2019-cmake-cuda (12.4, cuda)
  • GitHub Check: macOS-latest-cmake-tvos
  • GitHub Check: macOS-latest-cmake-visionos
  • GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
  • GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
  • GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
  • GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
  • GitHub Check: ubuntu-22-cmake-sycl
  • GitHub Check: ubuntu-22-cmake-sycl-fp16
  • GitHub Check: ubuntu-22-cmake-musa
  • GitHub Check: macOS-latest-cmake-ios
  • GitHub Check: ubuntu-22-cmake-hip
  • GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
  • GitHub Check: ubuntu-22-cmake-vulkan
  • GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: server (Release)
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: server-windows
🔇 Additional comments (13)
include/llama-cpp.h (4)

8-8: No issues with the new include.
The addition of <vector> is appropriate for managing dynamic arrays.


28-30: Proper unique_ptr deleter setup.
The llama_batch_ext_deleter correctly invokes llama_batch_ext_free and ensures clean resource deallocation.


95-102: Single-sequence add_text: same concern about error handling.
Similarly, consider handling a potentially negative return value from llama_batch_ext_add_text.


116-123: Clear and n_tokens usage looks good.
Both methods wrap their underlying C API nicely. No immediate issues found.

examples/embedding/embedding.cpp (4)

163-163: Initialization of batch with context seems fine.
This change follows the new extended batch structure design.


190-195: Potential off-by-zero.
After calling batch_decode and then batch.clear(), batch.n_tokens() is guaranteed to be zero, so line 194 always increments e by 0 for the NONE pooling path. Verify if you intended to use the token count before clearing.


199-199: Sequence addition appears correct.
You set the sequence ID as s to distinguish each prompt in the batch.


204-204: Final decode invocation aligns with new API usage.
Decoding after adding remaining tokens ensures embeddings are produced.

examples/retrieval/retrieval.cpp (5)

209-209: Good: adopting new extended batch for retrieval.
No problems spotted.


225-227: Flow for capacity check is correct.
When the batch is full, you decode first and then clear. This is valid.


235-235: Sequence addition looks straightforward.
Token list is added with a unique sequence ID.


240-240: Decode at final batch boundary.
Decoding here ensures the last chunk is processed before storing embeddings.


249-249: Separate query batch.
Initializing a new batch for queries makes sense for the retrieval process.

Comment on lines 86 to 93
// Wrapper to add a single token to the batch, support multiple sequence IDs
int32_t add_text(llama_token token, llama_pos pos, const std::vector<llama_seq_id> & seq_id, bool output_last) {
int32_t output_id = llama_batch_ext_add_text(this->get(), token, pos, seq_id.data(), seq_id.size(), false);
if (output_last) {
llama_batch_ext_set_output_last(this->get());
}
return output_id;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Check for add_text failure.
Although you store output_id, you do not verify if the addition was successful. A negative return indicates an error.

Comment on lines +258 to +263
batch.add_seq(query_tokens, 0, 0, true);

std::vector<float> query_emb(n_embd, 0);
batch_decode(ctx, query_batch, query_emb.data(), 1, n_embd);
batch_decode(ctx, query_batch.get(), query_emb.data(), 1, n_embd);

common_batch_clear(query_batch);
query_batch.clear();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Critical mismatch between “batch” vs. “query_batch”.
You add query tokens to batch (line 258) but then decode query_batch (line 261), which remains empty. This results in incorrect or empty embeddings.

- batch.add_seq(query_tokens, 0, 0, true);
+ query_batch.add_seq(query_tokens, 0, 0, true);

 batch_decode(ctx, query_batch.get(), query_emb.data(), 1, n_embd);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
batch.add_seq(query_tokens, 0, 0, true);
std::vector<float> query_emb(n_embd, 0);
batch_decode(ctx, query_batch, query_emb.data(), 1, n_embd);
batch_decode(ctx, query_batch.get(), query_emb.data(), 1, n_embd);
common_batch_clear(query_batch);
query_batch.clear();
query_batch.add_seq(query_tokens, 0, 0, true);
std::vector<float> query_emb(n_embd, 0);
batch_decode(ctx, query_batch.get(), query_emb.data(), 1, n_embd);
query_batch.clear();

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
include/llama-cpp.h (4)

42-61: ⚠️ Potential issue

Missing error handling for batch token addition

The implementation doesn't check the return value of llama_batch_ext_add_text(), which can return negative values to indicate errors (-1 if batch size is exceeded, -2 if embeddings are already set).

 static llama_batch_ext_ptr init_from_text(struct llama_context * ctx,
                                                   llama_token * tokens,
                                                       int32_t   n_tokens,
                                                     llama_pos   pos0,
                                                  llama_seq_id   seq_id,
                                                          bool   output_last) {
     llama_batch_ext * batch = llama_batch_ext_init(ctx);
     for (int32_t i = 0; i < n_tokens; i++) {
-        llama_batch_ext_add_text(batch, tokens[i], pos0 + i, &seq_id, 1, false);
+        int32_t rc = llama_batch_ext_add_text(batch, tokens[i], pos0 + i, &seq_id, 1, false);
+        if (rc < 0) {
+            // Handle error: free batch and return empty pointer or throw exception
+            llama_batch_ext_free(batch);
+            return llama_batch_ext_ptr(nullptr);
+        }
     }
     if (output_last) {
         // TODO: somehow return the output ID
         llama_batch_ext_set_output_last(batch);
     }
     return llama_batch_ext_ptr(batch);
 }

87-95: ⚠️ Potential issue

Error handling missing in add_text with multiple sequence IDs

The method doesn't check the return value from llama_batch_ext_add_text(), which could lead to silent failures.

int32_t add_text(llama_token token, llama_pos pos, const std::vector<llama_seq_id> & seq_id, bool output_last) {
    int32_t output_id = -1;
-    llama_batch_ext_add_text(this->get(), token, pos, seq_id.data(), seq_id.size(), false);
+    int32_t rc = llama_batch_ext_add_text(this->get(), token, pos, seq_id.data(), seq_id.size(), false);
+    if (rc < 0) {
+        return rc; // Return error code
+    }
    if (output_last) {
        output_id = llama_batch_ext_set_output_last(this->get());
    }
    return output_id;
}

97-105: ⚠️ Potential issue

Error handling missing in add_text with single sequence ID

Similar to the multiple sequence ID version, this method doesn't check the return value from llama_batch_ext_add_text().

int32_t add_text(llama_token token, llama_pos pos, llama_seq_id seq_id, bool output_last) {
    int32_t output_id = -1;
-    llama_batch_ext_add_text(this->get(), token, pos, &seq_id, 1, false);
+    int32_t rc = llama_batch_ext_add_text(this->get(), token, pos, &seq_id, 1, false);
+    if (rc < 0) {
+        return rc; // Return error code
+    }
    if (output_last) {
        output_id = llama_batch_ext_set_output_last(this->get());
    }
    return output_id;
}

107-117: ⚠️ Potential issue

Error handling missing in add_seq method

The method doesn't check the return value from llama_batch_ext_add_text() calls, which could lead to silent failures.

int32_t add_seq(std::vector<llama_token> & tokens, llama_pos pos0, llama_seq_id seq_id, bool output_last) {
    int32_t output_id = -1;
    for (size_t i = 0; i < tokens.size(); i++) {
-        llama_batch_ext_add_text(this->get(), tokens[i], pos0 + i, &seq_id, 1, false);
+        int32_t rc = llama_batch_ext_add_text(this->get(), tokens[i], pos0 + i, &seq_id, 1, false);
+        if (rc < 0) {
+            return rc; // Return error code and stop processing
+        }
    }
    if (output_last) {
        output_id = llama_batch_ext_set_output_last(this->get());
    }
    return output_id;
}
🧹 Nitpick comments (1)
include/llama-cpp.h (1)

57-57: Address TODO comment about returning output ID

The TODO comment indicates an incomplete implementation. The output_id from llama_batch_ext_set_output_last() should be captured and returned to the caller.

 static llama_batch_ext_ptr init_from_text(struct llama_context * ctx,
                                                   llama_token * tokens,
                                                       int32_t   n_tokens,
                                                     llama_pos   pos0,
                                                  llama_seq_id   seq_id,
-                                                          bool   output_last) {
+                                                          bool   output_last,
+                                                       int32_t * p_output_id = nullptr) {
     llama_batch_ext * batch = llama_batch_ext_init(ctx);
     for (int32_t i = 0; i < n_tokens; i++) {
         llama_batch_ext_add_text(batch, tokens[i], pos0 + i, &seq_id, 1, false);
     }
     if (output_last) {
-        // TODO: somehow return the output ID
-        llama_batch_ext_set_output_last(batch);
+        int32_t output_id = llama_batch_ext_set_output_last(batch);
+        if (p_output_id != nullptr) {
+            *p_output_id = output_id;
+        }
     }
     return llama_batch_ext_ptr(batch);
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ec0ff9 and c1f4a78.

📒 Files selected for processing (1)
  • include/llama-cpp.h (2 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
include/llama-cpp.h (1)
include/llama.h (6)
  • llama_batch_ext_init (926-926)
  • llama_batch_ext_add_text (951-957)
  • llama_batch_ext_set_output_last (973-973)
  • llama_batch_ext_init_from_embd (935-941)
  • llama_batch_ext_clear (983-983)
  • llama_batch_ext_get_n_tokens (944-944)
⏰ Context from checks skipped due to timeout of 90000ms (20)
  • GitHub Check: ubuntu-latest-cmake-cuda
  • GitHub Check: android-build
  • GitHub Check: ubuntu-22-cmake-musa
  • GitHub Check: ubuntu-22-cmake-vulkan
  • GitHub Check: ubuntu-22-cmake-sycl-fp16
  • GitHub Check: macOS-latest-cmake-tvos
  • GitHub Check: ubuntu-22-cmake-hip
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
  • GitHub Check: ubuntu-latest-llguidance
  • GitHub Check: ubuntu-latest-cmake-rpc
  • GitHub Check: server-windows
  • GitHub Check: server (Release)
  • GitHub Check: server (ADDRESS, RelWithDebInfo)
  • GitHub Check: server (UNDEFINED, RelWithDebInfo)
🔇 Additional comments (5)
include/llama-cpp.h (5)

28-30: LGTM: Clear consistent deleter implementation

The deleter implementation follows the established pattern in the file, consistent with other resource management classes.


37-41: Constructor implementations look good

The constructors properly initialize the base unique_ptr class with appropriate parameters, following C++ best practices.


63-71: LGTM: Clean wrapper for embeddings initialization

The implementation properly wraps the C API function and handles memory management through the smart pointer.


73-85: Good position handling implementation

The method correctly generates sequential positions starting from pos0, which aligns with the migration pattern requirements.


119-125: LGTM: Simple utility wrappers

The clear() and n_tokens() methods are clean wrappers around their C API counterparts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants