You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running Qwen3-32B (4-bit GGUF, Q4_K_M) with llama.cpp, the model output consistently includes tags, despite setting "extra_body": {"enable_thinking": False} in the API payload and using --reasoning-format none in the server command. According to the ModelScope documentation for Qwen3-32B (link), setting enable_thinking: False should disable tags, aligning the behavior with Qwen2.5-Instruct models, but this does not work in llama.cpp.
Steps to Reproduce
Compile llama.cpp (b5215) with CUDA support:
bash
复制
cmake -B build -DGGML_CUDA=ON
cmake --build build -j $(nproc)
Convert Qwen3-32B to GGUF and quantize to 4-bit (Q4_K_M):
bash
messages = [
{"role": "system", "content": "Answer directly without thinking process or tags like ."},
{"role": "user", "content": "Give me a short introduction to large language model."}
]
payload = {
"messages": messages,
"max_tokens": 100,
"temperature": 0.7,
"top_p": 0.8,
"top_k": 20,
"min_p": 0,
"stream": True,
"repeat_penalty": 1.5,
"extra_body": {"enable_thinking": False}
}
response = requests.post(LLAMA_API_URL, json=payload, stream=True)
role_response = ""
for line in response.iter_lines():
if line and line.startswith(b"data: "):
data = line[6:].decode('utf-8')
if data == "[DONE]": break
chunk = json.loads(data)
delta = chunk["choices"][0]["delta"].get("content", "")
logger.debug(f"Raw delta: {delta}")
delta_cleaned = re.sub(r'<[^>]>', '', delta)
role_response += delta_cleaned
print(role_response.strip())
Observe the raw output containing tags (e.g., Analyzing...Answer...).
Expected Behavior
With "extra_body": {"enable_thinking": False} and --reasoning-format none, the model output should not include tags, as specified in the Qwen3-32B documentation.
The model should behave similarly to Qwen2.5-Instruct, producing direct responses without thinking blocks.
Actual Behavior
The model output includes tags in the raw response (e.g., Reasoning...Answer...), even with enable_thinking: False and --reasoning-format none.
Client-side regex (re.sub(r'<[^>]>', '', delta)) successfully removes the tags, but the goal is to prevent their generation on the server side.
Additional Notes
The --jinja parameter was tested and removed, but it had no impact on tag generation.
A system prompt ("Answer directly without thinking process or tags") was added, but it did not prevent tags.
The GGUF model was converted using convert_hf_to_gguf.py and quantized to Q4_K_M, with no issues during conversion.
The same issue persists when testing with llama-cli:
bash
./build/bin/llama-cli -m /home/models/qwen3-32b-q4_k_m.gguf -p "Give me a short introduction to large language model." -ngl 99
Questions
Does llama.cpp fully support Qwen3-32B's enable_thinking parameter in the API payload?
Is --reasoning-format none sufficient to disable tags for Qwen3-32B, or are additional parameters required?
Could the GGUF conversion process or Qwen3-32B's training data cause persistent tag generation?
Are there known workarounds to prevent Qwen3-32B from generating tags in llama.cpp?
Logs
Client debug log (example from successful run):
text
Please provide guidance on how to disable tag generation server-side for Qwen3-32B in llama.cpp. Thank you!
First Bad Commit
No response
Relevant log output
Raw delta: '<think>Analyzing request...</think>'
Cleaned delta: 'Analyzing request...'
Final response: 'A large language model is a neural network trained on vast text data to generate human-like text.'
Server log (example from successful run):
text
[INFO] Server listening on http://0.0.0.0:7901
[INFO] Loading model '/home/models/qwen3-32b-q4_k_m.gguf'
The text was updated successfully, but these errors were encountered:
Name and Version
llama.cpp Version: b5218 (latest as of April 29, 2025)
Model: Qwen3-32B (4-bit quantized, GGUF format, Q4_K_M)
Hardware: Dual NVIDIA A100 (40GB VRAM each), using single GPU with -ngl 99
OS: Ubuntu 22.04
CUDA: Enabled, detected (compute capability 8.0)
Server Command:
bash
./build/bin/llama-server -m /home/models/qwen3-32b-q4_k_m.gguf --host 0.0.0.0 --port 7901 -c 40960 -ngl 99 -t 24 --reasoning-format none
API Client: Python with requests library, calling /v1/chat/completions
Operating systems
Linux
GGML backends
CUDA
Hardware
A100x2
Models
Qwen3-32B (4-bit GGUF, Q4_K_M)
Problem description & steps to reproduce
When running Qwen3-32B (4-bit GGUF, Q4_K_M) with llama.cpp, the model output consistently includes tags, despite setting "extra_body": {"enable_thinking": False} in the API payload and using --reasoning-format none in the server command. According to the ModelScope documentation for Qwen3-32B (link), setting enable_thinking: False should disable tags, aligning the behavior with Qwen2.5-Instruct models, but this does not work in llama.cpp.
Steps to Reproduce
Compile llama.cpp (b5215) with CUDA support:
bash
复制
cmake -B build -DGGML_CUDA=ON
cmake --build build -j $(nproc)
Convert Qwen3-32B to GGUF and quantize to 4-bit (Q4_K_M):
bash
复制
python convert_hf_to_gguf.py /home/models/Qwen3-32B --outfile /home/models/qwen3-32b-f16.gguf
./build/bin/quantize /home/models/qwen3-32b-f16.gguf /home/models/qwen3-32b-q4_k_m.gguf Q4_K_M
Start the llama.cpp server:
bash
复制
./build/bin/llama-server -m /home/models/qwen3-32b-q4_k_m.gguf --host 0.0.0.0 --port 7901 -c 40960 -ngl 99 -t 24 --reasoning-format none
Call the API using Python:
python
复制
import requests
import re
import logging
logger = logging.getLogger(name)
LLAMA_API_URL = "http://localhost:7901/v1/chat/completions"
messages = [
{"role": "system", "content": "Answer directly without thinking process or tags like ."},
{"role": "user", "content": "Give me a short introduction to large language model."}
]
payload = {
"messages": messages,
"max_tokens": 100,
"temperature": 0.7,
"top_p": 0.8,
"top_k": 20,
"min_p": 0,
"stream": True,
"repeat_penalty": 1.5,
"extra_body": {"enable_thinking": False}
}
response = requests.post(LLAMA_API_URL, json=payload, stream=True)
role_response = ""
for line in response.iter_lines():
if line and line.startswith(b"data: "):
data = line[6:].decode('utf-8')
if data == "[DONE]": break
chunk = json.loads(data)
delta = chunk["choices"][0]["delta"].get("content", "")
logger.debug(f"Raw delta: {delta}")
delta_cleaned = re.sub(r'<[^>]>', '', delta)
role_response += delta_cleaned
print(role_response.strip())
Observe the raw output containing tags (e.g., Analyzing...Answer...).
Expected Behavior
With "extra_body": {"enable_thinking": False} and --reasoning-format none, the model output should not include tags, as specified in the Qwen3-32B documentation.
The model should behave similarly to Qwen2.5-Instruct, producing direct responses without thinking blocks.
Actual Behavior
The model output includes tags in the raw response (e.g., Reasoning...Answer...), even with enable_thinking: False and --reasoning-format none.
Client-side regex (re.sub(r'<[^>]>', '', delta)) successfully removes the tags, but the goal is to prevent their generation on the server side.
Additional Notes
The --jinja parameter was tested and removed, but it had no impact on tag generation.
A system prompt ("Answer directly without thinking process or tags") was added, but it did not prevent tags.
The GGUF model was converted using convert_hf_to_gguf.py and quantized to Q4_K_M, with no issues during conversion.
The same issue persists when testing with llama-cli:
bash
./build/bin/llama-cli -m /home/models/qwen3-32b-q4_k_m.gguf -p "Give me a short introduction to large language model." -ngl 99
Questions
Does llama.cpp fully support Qwen3-32B's enable_thinking parameter in the API payload?
Is --reasoning-format none sufficient to disable tags for Qwen3-32B, or are additional parameters required?
Could the GGUF conversion process or Qwen3-32B's training data cause persistent tag generation?
Are there known workarounds to prevent Qwen3-32B from generating tags in llama.cpp?
Logs
Client debug log (example from successful run):
text
Please provide guidance on how to disable tag generation server-side for Qwen3-32B in llama.cpp. Thank you!
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: