Quick answer: DeepSeek Thinking Mode is an API mode where the model can return reasoning output before the final answer. As of April 28, 2026, in the DeepSeek V4 Preview API, use deepseek-v4-flash or deepseek-v4-pro, and control thinking explicitly with extra_body={“thinking”: {“type”: “enabled”}} or extra_body={“thinking”: {“type”: “disabled”}} when using the OpenAI Python SDK.
The final user-facing answer is returned in content. Reasoning output is returned separately in reasoning_content, at the same level as content. For most production apps, reasoning_content should be handled carefully, stored separately if needed, and not displayed to end users by default unless your product has a clear policy for exposing reasoning traces.
Independent disclosure: Chat-Deep.ai is an independent DeepSeek-focused guide and browser access site. Chat-Deep.ai is not affiliated with DeepSeek, DeepSeek.com, Hangzhou DeepSeek Artificial Intelligence Co., Ltd., the official DeepSeek app, the official DeepSeek API platform, OpenAI, or the OpenAI Python SDK.
This guide is written for developers who want to understand DeepSeek Thinking Mode in practical API workflows. Always verify production behavior against the official DeepSeek documentation before deploying reasoning, tool-call, or structured-output workflows.
DeepSeek API snapshot — last verified April 28, 2026
- Current API model IDs:
deepseek-v4-flashanddeepseek-v4-pro - Base URL:
https://api.deepseek.com - API format: OpenAI-compatible Chat Completions
- Current API generation: DeepSeek V4 Preview
- Context length: 1M tokens
- Max output: 384K tokens
- Thinking mode: supported
- Non-thinking mode: supported
- JSON Output: supported
- Tool Calls: supported
- FIM Completion: non-thinking mode only
- Thinking default: enabled
- reasoning_effort:
highormax - Legacy aliases: deepseek-chat and deepseek-reasoner are scheduled to be fully retired and inaccessible after July 24, 2026, 15:59 UTC.
Table of Contents
Who this guide is for
This guide is for developers building DeepSeek API workflows that need stronger reasoning, complex coding support, long-context analysis, multi-step planning, math-like reasoning, or tool planning.
It is also for teams that need to decide when to enable Thinking Mode, when to disable it, how to keep reasoning_content separate from normal UI output, and how to avoid mistakes in streaming, JSON Output, and Tool Calls.
If you only need basic Python setup, read the DeepSeek Python SDK guide. If you need a wider overview of API keys, base URLs, and model IDs, read the DeepSeek API guide.
What is DeepSeek Thinking Mode?
DeepSeek Thinking Mode is a reasoning mode where the model can generate intermediate reasoning output before producing the final answer. In the API response, that reasoning output is exposed through the reasoning_content field, while the final answer appears in content.
Thinking Mode is useful when a task benefits from deliberate reasoning: hard coding tasks, complex debugging, long-context document analysis, planning across multiple steps, tool-use decisions, and questions where a short direct answer is likely to miss important constraints.
Thinking Mode should not be treated as a universal default for every route. For simple extraction, classification, formatting, short summaries, and latency-sensitive chat, non-thinking mode is often simpler and easier to operate.
Thinking Mode vs non-thinking mode
The practical difference is how much reasoning behavior you want the model to use before it returns the final answer.
- Thinking Mode: best for complex reasoning, hard coding tasks, long-context analysis, multi-step planning, math-like reasoning, and tool planning.
- Non-thinking mode: best for simple chat, extraction, classification, formatting, short summaries, rewriting, routing, and latency-sensitive endpoints.
For production systems, set thinking explicitly instead of relying only on defaults. That makes each route easier to reason about, test, monitor, and update.
Which DeepSeek models support Thinking Mode?
For new DeepSeek API integrations, use the current V4 model IDs:
deepseek-v4-flashfor fast everyday workloads, lightweight reasoning, summaries, extraction, classification, and high-volume applications.deepseek-v4-profor harder reasoning, complex coding, long-context analysis, multi-step planning, and higher-value production tasks.
Both current V4 models support thinking and non-thinking modes, JSON Output, and Tool Calls.
Do not use deepseek-chat or deepseek-reasoner as primary model IDs in new code. They are legacy compatibility aliases scheduled for discontinuation on 2026-07-24. During the compatibility period, deepseek-chat corresponds to DeepSeek V4 Flash non-thinking mode, while deepseek-reasoner corresponds to DeepSeek V4 Flash thinking mode.
How to enable or disable Thinking Mode
In the OpenAI-compatible DeepSeek API, Thinking Mode is controlled with a thinking object. When using the OpenAI Python SDK, pass that object through extra_body.
Install the Python package
pip install openaiSet your DeepSeek API key
Use environment variables. Do not hard-code API keys in source code.
macOS or Linux
export DEEPSEEK_API_KEY="your_api_key_here"Windows PowerShell
[Environment]::SetEnvironmentVariable("DEEPSEEK_API_KEY", "your_api_key_here", "User")Basic client setup
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["DEEPSEEK_API_KEY"],
base_url="https://api.deepseek.com",
)Enable thinking
extra_body={"thinking": {"type": "enabled"}}Disable thinking
extra_body={"thinking": {"type": "disabled"}}Thinking is enabled by default, but production routes should usually set it explicitly so that behavior stays predictable.
reasoning_effort explained
reasoning_effort controls how much reasoning effort the model should apply in Thinking Mode. The supported values are high and max.
- Use
highfor normal reasoning tasks, coding help, analysis, and tool planning. - Use
maxfor especially complex tasks where additional reasoning may improve the result.
In Thinking Mode, the default effort is high for regular requests. For some complex agent requests, effort may automatically be set to max. For compatibility, low and medium may map to high, while xhigh may map to max.
reasoning_content vs content
In Thinking Mode, the model can return two different kinds of output:
reasoning_content: reasoning output returned separately by the API.content: the final answer that should normally be shown to the end user.
For normal user-facing applications, display content, not reasoning_content. Treat reasoning_content as an API field for continuity, debugging, evaluation, or internal handling only when your product has a clear policy for it.
Minimal Python example: Thinking Mode
This example enables Thinking Mode explicitly with deepseek-v4-pro. It reads the final answer from content and keeps reasoning_content separate.
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["DEEPSEEK_API_KEY"],
base_url="https://api.deepseek.com",
)
response = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[
{
"role": "user",
"content": "Compare two API retry strategies and explain the tradeoffs.",
}
],
reasoning_effort="high",
max_tokens=2000,
extra_body={"thinking": {"type": "enabled"}},
)
message = response.choices[0].message
reasoning = getattr(message, "reasoning_content", None)
if reasoning:
# Keep reasoning separate from normal end-user output.
# Store, inspect, or discard it according to your product policy.
pass
print(message.content)Use this pattern for tasks where the quality benefit of reasoning matters more than keeping the route as short and simple as possible.
Minimal Python example: non-thinking mode
Non-thinking mode is often better for simple chat, extraction, short summaries, classification, rewriting, and latency-sensitive routes.
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["DEEPSEEK_API_KEY"],
base_url="https://api.deepseek.com",
)
response = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[
{"role": "system", "content": "You are a concise classification assistant."},
{"role": "user", "content": "Classify this ticket as billing, technical, or account: I cannot reset my password."},
],
max_tokens=300,
extra_body={"thinking": {"type": "disabled"}},
)
print(response.choices[0].message.content)Use non-thinking mode when the task is direct and does not need deeper multi-step reasoning.
Streaming Thinking Mode responses
In streaming Thinking Mode responses, reasoning and final answer text can arrive separately. A good streaming parser should accumulate delta.reasoning_content and delta.content in separate buffers.
For normal user interfaces, display only the final answer content by default. Do not stream reasoning text directly to users unless your product has a clear policy for doing so.
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["DEEPSEEK_API_KEY"],
base_url="https://api.deepseek.com",
)
stream = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[
{
"role": "user",
"content": "Explain the tradeoffs between caching and retries in an API client.",
}
],
reasoning_effort="high",
stream=True,
stream_options={"include_usage": True},
extra_body={"thinking": {"type": "enabled"}},
)
reasoning_buffer = []
answer_buffer = []
usage = None
for chunk in stream:
if getattr(chunk, "usage", None):
usage = chunk.usage
if not chunk.choices:
continue
delta = chunk.choices[0].delta
reasoning_piece = getattr(delta, "reasoning_content", None)
if reasoning_piece:
reasoning_buffer.append(reasoning_piece)
continue
content_piece = getattr(delta, "content", None)
if content_piece:
answer_buffer.append(content_piece)
print(content_piece, end="", flush=True)
final_answer = "".join(answer_buffer)
reasoning_text = "".join(reasoning_buffer)
# Do not show reasoning_text to end users by default.
# Use it only according to your product policy.
if usage:
print("\n\nUsage object received.")The final usage-bearing chunk may have an empty choices array, so streaming code should handle that case safely.
Multi-turn conversations without tool calls
For ordinary multi-turn conversations without Tool Calls, old reasoning_content does not need to participate in the next turn’s context. Your application can keep the final assistant content and send the next user message.
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["DEEPSEEK_API_KEY"],
base_url="https://api.deepseek.com",
)
messages = [
{"role": "user", "content": "Explain why API retries need backoff."}
]
first_response = client.chat.completions.create(
model="deepseek-v4-pro",
messages=messages,
reasoning_effort="high",
extra_body={"thinking": {"type": "enabled"}},
)
first_message = first_response.choices[0].message
print(first_message.content)
messages.append(
{
"role": "assistant",
"content": first_message.content or "",
}
)
messages.append(
{
"role": "user",
"content": "Now give me a short production checklist.",
}
)
second_response = client.chat.completions.create(
model="deepseek-v4-pro",
messages=messages,
reasoning_effort="high",
extra_body={"thinking": {"type": "enabled"}},
)
print(second_response.choices[0].message.content)This is different from Thinking Mode Tool Calls. When a thinking-mode assistant turn includes tool calls, preserve the full assistant message internally.
Tool Calls in Thinking Mode
DeepSeek Tool Calls can be used in Thinking Mode. This is useful when the model needs to reason, request external information, continue reasoning, and then produce a final answer.
The important rule is stricter than normal multi-turn chat: if a thinking-mode turn includes tool calls, preserve and pass back the full assistant message, including reasoning_content, content, and tool_calls where present. Stripping required reasoning_content in a thinking-mode tool-call loop can cause a 400-level error.
import json
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["DEEPSEEK_API_KEY"],
base_url="https://api.deepseek.com",
)
def lookup_order_status(order_id: str) -> dict:
if not isinstance(order_id, str) or not order_id.startswith("ORD-"):
raise ValueError("Invalid order_id.")
return {
"order_id": order_id,
"status": "processing",
"estimated_ship_date": "tomorrow",
}
tools = [
{
"type": "function",
"function": {
"name": "lookup_order_status",
"description": "Look up the current status of a customer order by order ID.",
"parameters": {
"type": "object",
"properties": {
"order_id": {
"type": "string",
"description": "The order ID, for example ORD-12345.",
}
},
"required": ["order_id"],
},
},
}
]
messages = [
{
"role": "user",
"content": "Check order ORD-12345 and explain what the customer should expect next.",
}
]
for _ in range(4):
response = client.chat.completions.create(
model="deepseek-v4-pro",
messages=messages,
tools=tools,
tool_choice="auto",
reasoning_effort="high",
extra_body={"thinking": {"type": "enabled"}},
)
assistant_message = response.choices[0].message
# Preserve the full assistant message internally.
# In thinking-mode tool-call loops, this may include reasoning_content,
# content, and tool_calls.
messages.append(assistant_message.model_dump(exclude_none=True))
if not assistant_message.tool_calls:
print(assistant_message.content)
break
for call in assistant_message.tool_calls:
if call.function.name != "lookup_order_status":
raise ValueError(f"Unsupported tool requested: {call.function.name}")
try:
arguments = json.loads(call.function.arguments)
except json.JSONDecodeError as exc:
raise ValueError("Tool arguments were not valid JSON.") from exc
order_id = arguments.get("order_id")
result = lookup_order_status(order_id)
messages.append(
{
"role": "tool",
"tool_call_id": call.id,
"content": json.dumps(result),
}
)
else:
raise RuntimeError("Tool loop reached the maximum number of rounds.")The model requests a tool call, but it does not execute the function. Your application validates the arguments, runs the function, appends the tool result with the correct tool_call_id, and sends the next request.
JSON Output in Thinking Mode
DeepSeek JSON Output uses response_format={"type": "json_object"}. The prompt should explicitly mention json, provide an example JSON shape, and set max_tokens reasonably to reduce truncation risk.
For simple extraction, non-thinking JSON Output may be simpler. Use Thinking Mode with JSON Output only when the structured result depends on harder reasoning.
import json
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["DEEPSEEK_API_KEY"],
base_url="https://api.deepseek.com",
)
system_prompt = """
You analyze technical support tickets and return json.
Return only a valid json object with this shape:
{
"category": "billing | technical | account | other",
"priority": "low | medium | high",
"summary": "short summary"
}
"""
response = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[
{"role": "system", "content": system_prompt},
{
"role": "user",
"content": "The dashboard export fails every time I try to download a weekly report.",
},
],
response_format={"type": "json_object"},
reasoning_effort="high",
max_tokens=800,
extra_body={"thinking": {"type": "enabled"}},
)
choice = response.choices[0]
content = choice.message.content or ""
if choice.finish_reason == "length":
raise RuntimeError("The JSON response may have been truncated. Increase max_tokens or shorten the prompt.")
if not content.strip():
raise RuntimeError("The model returned empty content. Make the json instruction more explicit.")
data = json.loads(content)
required_keys = {"category", "priority", "summary"}
missing = required_keys - set(data)
if missing:
raise ValueError(f"Missing required keys: {missing}")
print(data)JSON Output improves parseability, but your application should still validate required keys, allowed values, and field types before trusting the result.
Parameters that do not affect Thinking Mode
In Thinking Mode, these parameters do not affect output even if they are passed for compatibility:
temperaturetop_ppresence_penaltyfrequency_penalty
Do not tune Thinking Mode behavior by changing these parameters. Use the thinking toggle, reasoning_effort, prompt design, model selection, and route-level product logic instead.
FIM Completion and Thinking Mode
FIM Completion is documented as non-thinking mode only. If you are building code completion or fill-in-the-middle workflows, treat that route separately from Thinking Mode routes.
For coding assistants, this means you may use Thinking Mode for harder code reasoning, debugging, and architecture questions, while using non-thinking mode for FIM-style completion workflows where supported.
Token usage, context caching, and cost control without prices
Thinking Mode can increase output length and token usage because the model may produce reasoning output as well as the final answer. Tool-call loops can also add extra assistant and tool messages to the conversation.
usage = response.usage
if usage:
print("Prompt tokens:", getattr(usage, "prompt_tokens", None))
print("Completion tokens:", getattr(usage, "completion_tokens", None))
print("Total tokens:", getattr(usage, "total_tokens", None))
print("Prompt cache hit tokens:", getattr(usage, "prompt_cache_hit_tokens", None))
print("Prompt cache miss tokens:", getattr(usage, "prompt_cache_miss_tokens", None))
completion_details = getattr(usage, "completion_tokens_details", None)
if completion_details:
print("Reasoning tokens:", getattr(completion_details, "reasoning_tokens", None))Context Caching is enabled by default and does not require a code change. It can help repeated-prefix workloads such as stable system prompts, repeated tool definitions, long shared instructions, and repeated document context.
Because DeepSeek API pricing can change, this guide does not copy token prices. Check the official DeepSeek pricing page and Chat-Deep.ai’s pricing guide before making billing decisions.
Cost-control habits without copying prices
- Disable thinking for simple classification, formatting, and extraction routes.
- Use
deepseek-v4-flashwhere speed and volume matter more than deeper reasoning. - Use
deepseek-v4-profor harder reasoning routes where quality matters more. - Set route-specific
max_tokensvalues. - Keep tool results compact.
- Trim or summarize old conversation turns.
- Log token usage by route, model, thinking setting, and feature flag.
Error handling and debugging
Thinking Mode workflows can fail because of invalid request formatting, missing API keys, invalid parameters, rate limits, server errors, malformed JSON Output instructions, or stripped reasoning_content in tool-call loops.
Retry temporary issues carefully. Do not blindly retry invalid requests, authentication failures, account-balance problems, or invalid parameters without fixing the underlying issue.
import os
import time
from openai import APIStatusError, APITimeoutError, OpenAI
client = OpenAI(
api_key=os.environ["DEEPSEEK_API_KEY"],
base_url="https://api.deepseek.com",
timeout=60,
max_retries=0,
)
RETRY_STATUS_CODES = {429, 500, 503}
def create_completion_with_retry(payload: dict, max_attempts: int = 3):
delay_seconds = 2
for attempt in range(1, max_attempts + 1):
try:
return client.chat.completions.create(**payload)
except APITimeoutError:
if attempt == max_attempts:
raise
except APIStatusError as exc:
if exc.status_code not in RETRY_STATUS_CODES:
raise
if attempt == max_attempts:
raise
time.sleep(delay_seconds)
delay_seconds *= 2
raise RuntimeError("Request failed after retries.")
payload = {
"model": "deepseek-v4-pro",
"messages": [
{"role": "user", "content": "Explain how retry backoff works."}
],
"reasoning_effort": "high",
"extra_body": {"thinking": {"type": "enabled"}},
}
response = create_completion_with_retry(payload)
print(response.choices[0].message.content)Thinking Mode debugging checklist
- Missing thinking toggle: set
extra_body={"thinking": {"type": "enabled"}}orextra_body={"thinking": {"type": "disabled"}}explicitly. - Wrong model names: use
deepseek-v4-flashordeepseek-v4-profor new code. - Temperature tuning has no effect:
temperatureandtop_pdo not affect Thinking Mode output. - Streaming parser misses reasoning: handle
delta.reasoning_contentseparately fromdelta.content. - UI displays reasoning unexpectedly: show
contentby default and keepreasoning_contentseparate. - Tool-call loop fails: preserve the full assistant message, including
reasoning_content, in thinking-mode tool-call loops. - FIM route fails: FIM Completion is non-thinking mode only.
- JSON Output fails: include the word
json, provide an example JSON shape, and setmax_tokensreasonably.
Security, privacy, and UI guidance for reasoning_content
reasoning_content is not the same as the final answer. It can be useful for API continuity, controlled debugging, evaluation, and advanced internal workflows, but it should be handled carefully.
- Display
contentas the normal user-facing answer. - Do not show
reasoning_contentto end users by default. - Do not log sensitive user data unnecessarily.
- Apply your normal data-retention policy to reasoning traces if you store them.
- Keep reasoning traces separate from public UI output.
- Preserve
reasoning_contentinternally when required for thinking-mode tool-call loops. - Do not treat reasoning text as a source of truth; validate important claims and tool arguments separately.
For most applications, the safest default is simple: use reasoning_content only where the API workflow requires it, and show users the final content.
Common mistakes
- Using legacy aliases as primary model IDs: use
deepseek-v4-flashordeepseek-v4-proin new integrations. - Relying on defaults in production: set thinking explicitly per route.
- Displaying reasoning by default: show
content, notreasoning_content. - Expecting temperature to tune thinking output:
temperatureandtop_pdo not affect Thinking Mode output. - Ignoring streaming reasoning chunks: parse
delta.reasoning_contentanddelta.contentseparately. - Stripping reasoning in tool-call loops: preserve the full assistant message internally when Tool Calls are involved.
- Using Thinking Mode for every task: disable it for simple extraction, classification, and formatting routes where deeper reasoning is unnecessary.
- Using FIM while thinking is enabled: FIM Completion is non-thinking mode only.
- Copying prices into evergreen documentation: link to the pricing pages instead of hard-coding values.
When this guide is not the right page
This page focuses on DeepSeek Thinking Mode. Use a more specific page if your goal is different:
- For basic Python setup, read the DeepSeek Python SDK guide.
- For API keys, base URLs, and model overview, read the DeepSeek API guide.
- For current V4 model details, read the DeepSeek V4 guide.
- For function calling, read the DeepSeek Tool Calls guide.
- For structured responses, read the DeepSeek JSON Output guide.
- For cache behavior, read the DeepSeek Context Caching guide.
- For token accounting, read the DeepSeek Token Usage guide.
- For troubleshooting, read the DeepSeek Error Codes guide.
- For migration from OpenAI-style code, read the OpenAI SDK to DeepSeek guide.
- For JavaScript and TypeScript, read the DeepSeek Node.js TypeScript guide.
FAQ
What is DeepSeek Thinking Mode?
DeepSeek Thinking Mode is a reasoning mode where the model can produce reasoning output before the final answer. The reasoning output is returned in reasoning_content, while the final answer is returned in content.
Is DeepSeek Thinking Mode enabled by default?
Yes. Thinking Mode defaults to enabled, but production applications should set it explicitly with the thinking parameter so each route behaves predictably.
How do I disable Thinking Mode?
With the OpenAI Python SDK, pass extra_body={"thinking": {"type": "disabled"}} in the Chat Completions request.
Which DeepSeek models support Thinking Mode?
The current DeepSeek V4 API models deepseek-v4-flash and deepseek-v4-pro support Thinking Mode and non-thinking mode.
Should I use deepseek-v4-flash or deepseek-v4-pro for thinking?
Use deepseek-v4-flash for faster everyday reasoning and high-volume workloads. Use deepseek-v4-pro for harder reasoning, complex coding, long-context analysis, and multi-step planning.
What is reasoning_content?
reasoning_content is the API field that contains reasoning output in Thinking Mode. It is separate from content, which contains the final answer.
Should I show reasoning_content to users?
Usually no. For normal user-facing apps, show content and keep reasoning_content separate unless your product has a clear and safe policy for exposing reasoning traces.
What does reasoning_effort do?
reasoning_effort controls the reasoning effort used in Thinking Mode. Supported values are high and max.
Do temperature and top_p work in Thinking Mode?
No. In Thinking Mode, temperature, top_p, presence_penalty, and frequency_penalty do not affect output even if they are passed.
Can I stream reasoning_content?
Yes. Streaming responses can include delta.reasoning_content and delta.content separately. Your parser should keep them in separate buffers.
Can I use Tool Calls in Thinking Mode?
Yes. Tool Calls are supported in Thinking Mode. The model can request function calls, but your application must validate arguments, execute the function, and return the tool result.
Do I need to pass reasoning_content back in tool-call loops?
Yes. In thinking-mode tool-call loops, preserve and pass back the full assistant message internally, including reasoning_content where present. Stripping it can cause a 400-level error.
Can I use JSON Output in Thinking Mode?
Yes. Use response_format={"type": "json_object"}, include the word json in the prompt, provide an example shape, set max_tokens reasonably, and validate the parsed result.
Is FIM Completion supported in Thinking Mode?
No. FIM Completion is documented as non-thinking mode only.
Should I still use deepseek-chat or deepseek-reasoner?
For new code, use deepseek-v4-flash or deepseek-v4-pro. deepseek-chat and deepseek-reasoner are legacy compatibility aliases scheduled for discontinuation on 2026-07-24.
Where can I check DeepSeek API pricing?
Because DeepSeek API pricing can change, this guide does not copy token prices. Check the official DeepSeek pricing page and Chat-Deep.ai’s pricing guide before making billing decisions.
