Skip to content

Eval bug: Persistent <think> Tags in Qwen3-32B Output Despite enable_thinking: False and --reasoning-format none in llama.cpp #13189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
shyn01 opened this issue Apr 29, 2025 · 1 comment

Comments

@shyn01
Copy link

shyn01 commented Apr 29, 2025

Name and Version

llama.cpp Version: b5218 (latest as of April 29, 2025)
Model: Qwen3-32B (4-bit quantized, GGUF format, Q4_K_M)
Hardware: Dual NVIDIA A100 (40GB VRAM each), using single GPU with -ngl 99
OS: Ubuntu 22.04
CUDA: Enabled, detected (compute capability 8.0)
Server Command:
bash

./build/bin/llama-server -m /home/models/qwen3-32b-q4_k_m.gguf --host 0.0.0.0 --port 7901 -c 40960 -ngl 99 -t 24 --reasoning-format none
API Client: Python with requests library, calling /v1/chat/completions

Operating systems

Linux

GGML backends

CUDA

Hardware

A100x2

Models

Qwen3-32B (4-bit GGUF, Q4_K_M)

Problem description & steps to reproduce

When running Qwen3-32B (4-bit GGUF, Q4_K_M) with llama.cpp, the model output consistently includes tags, despite setting "extra_body": {"enable_thinking": False} in the API payload and using --reasoning-format none in the server command. According to the ModelScope documentation for Qwen3-32B (link), setting enable_thinking: False should disable tags, aligning the behavior with Qwen2.5-Instruct models, but this does not work in llama.cpp.

Steps to Reproduce
Compile llama.cpp (b5215) with CUDA support:
bash

复制
cmake -B build -DGGML_CUDA=ON
cmake --build build -j $(nproc)
Convert Qwen3-32B to GGUF and quantize to 4-bit (Q4_K_M):
bash

复制
python convert_hf_to_gguf.py /home/models/Qwen3-32B --outfile /home/models/qwen3-32b-f16.gguf
./build/bin/quantize /home/models/qwen3-32b-f16.gguf /home/models/qwen3-32b-q4_k_m.gguf Q4_K_M
Start the llama.cpp server:
bash

复制
./build/bin/llama-server -m /home/models/qwen3-32b-q4_k_m.gguf --host 0.0.0.0 --port 7901 -c 40960 -ngl 99 -t 24 --reasoning-format none
Call the API using Python:
python

复制
import requests
import re
import logging

logger = logging.getLogger(name)
LLAMA_API_URL = "http://localhost:7901/v1/chat/completions"

messages = [
{"role": "system", "content": "Answer directly without thinking process or tags like ."},
{"role": "user", "content": "Give me a short introduction to large language model."}
]
payload = {
"messages": messages,
"max_tokens": 100,
"temperature": 0.7,
"top_p": 0.8,
"top_k": 20,
"min_p": 0,
"stream": True,
"repeat_penalty": 1.5,
"extra_body": {"enable_thinking": False}
}
response = requests.post(LLAMA_API_URL, json=payload, stream=True)
role_response = ""
for line in response.iter_lines():
if line and line.startswith(b"data: "):
data = line[6:].decode('utf-8')
if data == "[DONE]": break
chunk = json.loads(data)
delta = chunk["choices"][0]["delta"].get("content", "")
logger.debug(f"Raw delta: {delta}")
delta_cleaned = re.sub(r'<[^>]>', '', delta)
role_response += delta_cleaned
print(role_response.strip())
Observe the raw output containing tags (e.g., Analyzing...Answer...).
Expected Behavior
With "extra_body": {"enable_thinking": False} and --reasoning-format none, the model output should not include tags, as specified in the Qwen3-32B documentation.
The model should behave similarly to Qwen2.5-Instruct, producing direct responses without thinking blocks.
Actual Behavior
The model output includes tags in the raw response (e.g., Reasoning...Answer...), even with enable_thinking: False and --reasoning-format none.
Client-side regex (re.sub(r'<[^>]
>', '', delta)) successfully removes the tags, but the goal is to prevent their generation on the server side.
Additional Notes
The --jinja parameter was tested and removed, but it had no impact on tag generation.
A system prompt ("Answer directly without thinking process or tags") was added, but it did not prevent tags.
The GGUF model was converted using convert_hf_to_gguf.py and quantized to Q4_K_M, with no issues during conversion.
The same issue persists when testing with llama-cli:
bash

./build/bin/llama-cli -m /home/models/qwen3-32b-q4_k_m.gguf -p "Give me a short introduction to large language model." -ngl 99
Questions
Does llama.cpp fully support Qwen3-32B's enable_thinking parameter in the API payload?
Is --reasoning-format none sufficient to disable tags for Qwen3-32B, or are additional parameters required?
Could the GGUF conversion process or Qwen3-32B's training data cause persistent tag generation?
Are there known workarounds to prevent Qwen3-32B from generating tags in llama.cpp?
Logs
Client debug log (example from successful run):
text

Please provide guidance on how to disable tag generation server-side for Qwen3-32B in llama.cpp. Thank you!

First Bad Commit

No response

Relevant log output

Raw delta: '<think>Analyzing request...</think>'
Cleaned delta: 'Analyzing request...'
Final response: 'A large language model is a neural network trained on vast text data to generate human-like text.'
Server log (example from successful run):
text

[INFO] Server listening on http://0.0.0.0:7901
[INFO] Loading model '/home/models/qwen3-32b-q4_k_m.gguf'
@celsowm
Copy link

celsowm commented Apr 29, 2025

take a look: #13160

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants