Is there a way to cache multiple prompt prefixes? #13488

ghnp5 · 2025-05-12T20:01:26Z

ghnp5
May 12, 2025

Hey!

Is it possible to have multiple caches of prompt prefixes?

In my case, I will have about 8 prompt prefixes that will be rotating all the time. This makes cache_prompt mostly useless.

Is there a way to cache 8 variations of the prompt prefixes? (while still allowing me to inject suffixes that will always be different, and not expected to be cached)

Many thanks!

ggerganov · 2025-05-12T20:20:14Z

ggerganov
May 12, 2025
Maintainer

You can use -np 8. Don't forget to scale the context size. For example:

# before
llama-server -c 1024 ...

# after
llama-server -c 8192 -np 8 ...

This will automatically assign requests to the slot that matches the prefix the most. However the generation speed will be currently impacted negatively due to some implementation limitations.

1 reply

ghnp5 May 12, 2025
Author

Hey! Thanks for the reply.

Apologies - I'm having some trouble to understand how -np can help me here.

Please note I'm not running multiple inferences simultaneously. As I'm running this on a CPU, I'm running only 1 completion at a time. The problem is that once I rotate to the next prompt prefix, the previously cached prompt won't be used.

Does that make sense? Do you think -np will help?
Thanks again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to cache multiple prompt prefixes? #13488

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Is there a way to cache multiple prompt prefixes? #13488

ghnp5 May 12, 2025

Replies: 1 comment · 1 reply

ggerganov May 12, 2025 Maintainer

ghnp5 May 12, 2025 Author

ghnp5
May 12, 2025

Replies: 1 comment 1 reply

ggerganov
May 12, 2025
Maintainer

ghnp5 May 12, 2025
Author