How to disable thinking mode?
Deployed model using llama (img ghcr.io/ggml-org/llama.cpp:server-cuda13)
Tried to disable thinking mode, but unsuccesfully.
Request body:
{
"model": "qwopus-27b-q4km",
"messages": [
{
"role": "user",
"content": "Explain in 2 short paragraphs what a reverse proxy is."
}
],
"chat_template_kwargs": {
"enable_thinking": false
},
"stream": false
}
Response:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "A reverse proxy is a server that sits between clients and one or more backend servers, acting as an intermediary that forwards client requests to the appropriate backend server. When a user makes a request to a website or application, the reverse proxy receives it first, determines which backend server should handle it, and then forwards the request. The backend server processes the request and sends the response back to the reverse proxy, which then delivers it to the client. From the client's perspective, they only interact with the reverse proxy, remaining unaware of the actual backend servers.\n\nReverse proxies provide several key benefits including load balancing by distributing traffic across multiple servers, caching frequently accessed content to improve performance, SSL/TLS termination to handle encryption at one point, and security by hiding the internal network structure from external clients. They also enable features like request filtering, compression, and the ability to present a unified interface for multiple backend services running on different servers or ports.",
"reasoning_content": "Let me think about how to approach this coding problem.\n\nProblem: Explain in 2 short paragraphs what a reverse proxy is.\n\nApproach: 1\n\nLet me implement this solution.\n"
}
}
],
"created": 1775039984,
"model": "qwopus-27b-q4km",
"system_fingerprint": "b8606-825eb91a6",
"object": "chat.completion",
"usage": {
"completion_tokens": 230,
"prompt_tokens": 23,
"total_tokens": 253,
"prompt_tokens_details": {
"cached_tokens": 0
}
},
"id": "chatcmpl-52nFPA7e9RnxTXvE78NjiVBWRJZbdGec",
"timings": {
"cache_n": 0,
"prompt_n": 23,
"prompt_ms": 109.052,
"prompt_per_token_ms": 4.741391304347826,
"prompt_per_second": 210.90855738546747,
"predicted_n": 230,
"predicted_ms": 5760.538,
"predicted_per_token_ms": 25.045817391304347,
"predicted_per_second": 39.92682627907324
}
}
What am i doing wrong?
my deployed version is Q4_K_M btw
why would u disable it thinking is smarter
i agree why would you want to disable thinking mode? maybe your reasons might give some insights
i need to disable thinking as my gpu is an old RTX1070
i need to disable thinking as my gpu is an old RTX1070
gosh what a loser who doesnt have atleast 4 rtx 5090 in the big massive 2026 lil bro talm bout some 1070 like son✌️