DeepSeek API Guide - Chat-Deep.ai

DeepSeek API Guide explains the current official DeepSeek API surface for developers. As of April 26, 2026, DeepSeek’s official API uses the current V4 model IDs deepseek-v4-flash and deepseek-v4-pro. The API supports an OpenAI-compatible chat-completions format at https://api.deepseek.com and an Anthropic-compatible format at https://api.deepseek.com/anthropic.

Last verified against official DeepSeek public documentation: April 26, 2026.

V4 update: DeepSeek-V4 Preview is now live and available through DeepSeek Chat and the API. DeepSeek states that deepseek-chat and deepseek-reasoner are legacy compatibility names that currently route to deepseek-v4-flash non-thinking and thinking modes, and will be retired after July 24, 2026, 15:59 UTC. New integrations should use deepseek-v4-flash or deepseek-v4-pro.

Quickstart (5–10 Minutes)

The DeepSeek API uses an API format compatible with OpenAI and Anthropic SDK ecosystems. For OpenAI-style chat completions, set the base URL to https://api.deepseek.com. For Anthropic-style integrations, use https://api.deepseek.com/anthropic.

Important: Use the official DeepSeek API docs as the source of truth for current behavior. Beta features such as Chat Prefix Completion and strict tool schemas may require https://api.deepseek.com/beta instead of the default OpenAI-format base URL.

Current Official API Snapshot

OpenAI-format base URL: https://api.deepseek.com
Anthropic-format base URL: https://api.deepseek.com/anthropic
Current API model IDs: deepseek-v4-flash and deepseek-v4-pro
Legacy compatibility names: deepseek-chat and deepseek-reasoner, scheduled for retirement after July 24, 2026
Current model family: DeepSeek-V4 Preview
Context length: 1M tokens
Maximum output: 384K tokens
Thinking mode: Both V4 API models support thinking and non-thinking modes; thinking mode is enabled by default unless disabled.
Core features: JSON Output, Tool Calls, Chat Prefix Completion (Beta), and FIM Completion (Beta). FIM is available in non-thinking mode only.
Official pricing source: Use the official DeepSeek Models & Pricing page for the latest public API prices.

To begin, create an API key on the official DeepSeek Platform and store it securely in an environment variable. DeepSeek bills usage against account balance, and the API also offers GET /models and GET /user/balance to sanity-check your integration before shipping.

For current token rates, cache-hit/cache-miss billing, promotions, and deduction rules, use the official DeepSeek Models & Pricing page. DeepSeek states that product prices may vary, so the official pricing page should be treated as the source of truth.

Basic steps:

Set the base URL: Use https://api.deepseek.com for normal OpenAI-format production requests.
Authenticate: Send your API key as a Bearer token in the Authorization header.
Call the main endpoint: For chat interactions, use POST /chat/completions.
Provide the required body: At minimum, send a valid model and messages array.
Choose the mode: Use thinking: {"type": "enabled"} for reasoning or thinking: {"type": "disabled"} for non-thinking responses.
Add optional controls only when needed: Common additions include stream, reasoning_effort, response_format, tools, and tool_choice.

Minimal cURL Request

export DEEPSEEK_API_KEY="sk-YourDeepSeekAPIKey"

curl https://api.deepseek.com/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${DEEPSEEK_API_KEY}" \
  -d '{
        "model": "deepseek-v4-flash",
        "messages": [
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "Hello!"}
        ],
        "thinking": {"type": "disabled"},
        "stream": false
      }'

This example uses deepseek-v4-flash in non-thinking mode for a fast everyday chat response. For more difficult reasoning, coding, or agentic workflows, switch to deepseek-v4-pro and enable thinking mode.

Minimal Python Request

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("DEEPSEEK_API_KEY"),
    base_url="https://api.deepseek.com",
)

response = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the DeepSeek API in one paragraph."},
    ],
    reasoning_effort="high",
    extra_body={"thinking": {"type": "enabled"}},
    stream=False,
)

print(response.choices[0].message.content)

Tip: When using the OpenAI SDK, pass DeepSeek-specific body fields such as thinking through extra_body. The official quickstart shows deepseek-v4-pro with reasoning_effort="high" and thinking enabled for reasoning-oriented examples.

Optional Sanity Checks

# List currently available model IDs
curl https://api.deepseek.com/models \
  -H "Authorization: Bearer ${DEEPSEEK_API_KEY}"

# Check whether your balance is available for API calls
curl https://api.deepseek.com/user/balance \
  -H "Authorization: Bearer ${DEEPSEEK_API_KEY}"

The official /models response currently lists deepseek-v4-flash and deepseek-v4-pro. Before you scale usage, estimate token volume, monitor cache-hit and cache-miss usage, and verify current rates on the official DeepSeek Models & Pricing page.

Authentication

Every DeepSeek API request must include an Authorization: Bearer YOUR_API_KEY header. Replace YOUR_API_KEY with the secret key created in your DeepSeek Platform dashboard.

Authorization: Bearer YOUR_API_KEY

DeepSeek’s API reference describes the authentication scheme as HTTP Bearer Auth. The quick-start error list separates authentication problems from billing problems: a wrong or missing key leads to 401, while exhausted balance leads to 402.

Common Authentication and Billing Issues

401 Authentication Fails: The key is missing, malformed, incorrect, or no longer valid.
402 Insufficient Balance: Your account balance is not sufficient for the request.
Rotated or replaced key: After generating a new key, make sure your app actually uses the updated secret.
Client-side exposure: Never expose a live API key in browser JavaScript, mobile binaries, public Git repositories, or screenshots.

If requests suddenly stop working, verify the exact Bearer header first, then confirm your platform balance. Those two checks solve many first-run integration issues.

Chat Completions

The primary DeepSeek endpoint for conversational generation is POST https://api.deepseek.com/chat/completions. This endpoint accepts a list of messages and returns the model’s next assistant response.

Endpoint and Required Fields

URL: POST https://api.deepseek.com/chat/completions
Required field 1 — model: Use deepseek-v4-flash or deepseek-v4-pro.
Required field 2 — messages: A non-empty array describing the conversation so far.

Supported message roles:

system: Global instructions for behavior, style, policy, or output format.
user: The user request.
assistant: Prior model replies that you want to keep in context.
tool: Tool results returned by your application after the model requests a tool call.

The assistant message schema also supports advanced fields such as prefix and reasoning_content for Beta prefix-completion and thinking-mode workflows. Those fields are not required for normal chat.

Core Parameters

model: Must be deepseek-v4-flash or deepseek-v4-pro for new integrations.
messages: The conversation array.
thinking: Controls thinking mode with {"type": "enabled"} or {"type": "disabled"}. The default is enabled.
reasoning_effort: In thinking mode, use high or max. DeepSeek maps low and medium to high, and xhigh to max for compatibility.
max_tokens: Maximum number of output tokens. Input plus output remains bounded by the model context length.
temperature: Default 1.0. DeepSeek’s guidance suggests 0.0 for coding/math, 1.0 for data cleaning/analysis, 1.3 for general conversation and translation, and 1.5 for creative writing.
top_p: Nucleus sampling alternative to temperature. Use one or the other in most cases, not both.
presence_penalty / frequency_penalty: Repetition-control parameters in the range -2.0 to 2.0.
stop: Up to 16 stop sequences.
stream: Enables Server-Sent Events streaming.
stream_options: Supports include_usage when stream=true.
response_format: Set {"type": "json_object"} to enable JSON Output.
tools / tool_choice: Enables Tool Calls. DeepSeek currently supports function tools and up to 128 functions.
logprobs / top_logprobs: Optional token-probability outputs for supported non-thinking requests.

Thinking mode caveat: In thinking mode, temperature, top_p, presence_penalty, and frequency_penalty are accepted for compatibility but have no effect. Treat sampling controls as non-thinking-mode controls.

Model Selection — V4-Flash vs V4-Pro

DeepSeek currently offers two V4 API model IDs. Both support a 1M context window, thinking and non-thinking modes, JSON Output, Tool Calls, and Chat Prefix Completion (Beta). The practical difference is faster, more economical everyday usage versus stronger capability for harder work. For current API prices, always use the official pricing source linked in the table.

Attribute	deepseek-v4-flash	deepseek-v4-pro
Best for	Everyday chat, low-latency tasks, cost-sensitive apps, routine coding, extraction, summarization	Hard reasoning, agentic coding, complex analysis, long-context workflows, high-stakes production tasks
Model version	DeepSeek-V4-Flash	DeepSeek-V4-Pro
Parameter note from release	284B total / 13B active parameters	1.6T total / 49B active parameters
Context length	1M	1M
Maximum output	384K	384K
Thinking mode	Supported; enabled by default unless disabled	Supported; enabled by default unless disabled
JSON Output	Yes	Yes
Tool Calls	Yes	Yes
Chat Prefix Completion (Beta)	Yes	Yes
FIM Completion (Beta)	Non-thinking mode only	Non-thinking mode only
Pricing source	Official DeepSeek Models & Pricing	Official DeepSeek Models & Pricing

Legacy Names and Migration

DeepSeek keeps deepseek-chat and deepseek-reasoner as legacy compatibility aliases during the V4 transition. These aliases currently route to DeepSeek V4 models but are not the primary model IDs for current API integrations.

Legacy name	Current behavior	Current equivalent
`deepseek-chat`	Routes to `deepseek-v4-flash` in non-thinking mode	`deepseek-v4-flash` with `thinking: {"type": "disabled"}`
`deepseek-reasoner`	Routes to `deepseek-v4-flash` in thinking mode	`deepseek-v4-flash` or `deepseek-v4-pro` with `thinking: {"type": "enabled"}`

To use the current DeepSeek API models, update the model value to deepseek-v4-flash or deepseek-v4-pro while keeping the same API base URL and endpoint. Use the thinking parameter when you need to explicitly control reasoning behavior.

Thinking vs Non-Thinking Examples

# Non-thinking mode for fast everyday responses
response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role": "user", "content": "Summarize this in three bullets."}],
    extra_body={"thinking": {"type": "disabled"}},
)

# Thinking mode for harder reasoning
response = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[{"role": "user", "content": "Analyze this architecture tradeoff."}],
    reasoning_effort="high",
    extra_body={"thinking": {"type": "enabled"}},
)

Using the Response Correctly

In standard requests, your main answer is usually response.choices[0].message.content. With thinking mode enabled, the same message may also include reasoning_content before the final answer.

Important implementation rule: In ordinary multi-turn chat without tool calls, you can keep the assistant’s final content in history and do not need to carry old reasoning_content forward. In a thinking-mode tool-call loop for the same question, DeepSeek’s current docs require you to pass the assistant message, including reasoning_content, back to the API so the model can continue reasoning across sub-turns.

Example Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1718345013,
  "model": "deepseek-v4-pro",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I assist you today?",
        "reasoning_content": null
      },
      "finish_reason": "stop",
      "logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 17,
    "prompt_cache_hit_tokens": 0,
    "prompt_cache_miss_tokens": 17,
    "completion_tokens": 9,
    "completion_tokens_details": {
      "reasoning_tokens": 0
    },
    "total_tokens": 26
  }
}

DeepSeek’s official schema includes prompt_cache_hit_tokens and prompt_cache_miss_tokens so you can track caching benefits, and completion_tokens_details.reasoning_tokens so thinking-heavy generations can be inspected more precisely.

Official finish_reason values currently include stop, length, content_filter, tool_calls, and insufficient_system_resource.

JSON Output

To request structured JSON, set response_format={"type": "json_object"}. DeepSeek’s official JSON Output guide adds practical rules: include the word “json” in the prompt, show the model the schema or example you want, and set max_tokens high enough to avoid truncation.

import json

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[
        {
            "role": "system",
            "content": "Return the answer in json with keys: answer and confidence."
        },
        {"role": "user", "content": "What is the capital of Egypt?"}
    ],
    response_format={"type": "json_object"},
    extra_body={"thinking": {"type": "disabled"}},
)

print(json.loads(response.choices[0].message.content))

Without a clear JSON instruction in the prompt, the API can appear stuck because the model may continue emitting whitespace until it reaches the token limit. DeepSeek also notes that JSON Output may occasionally return empty content, so production code should validate and retry safely.

Tool Calls

DeepSeek uses the term Tool Calls for structured function invocation. The model can decide whether to call a tool, return natural language, or continue a multi-step tool loop. The model proposes the tool call, but your application executes the function and sends the result back as a tool message.

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather for a location.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g. San Francisco, CA"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[{"role": "user", "content": "How is the weather in Hangzhou?"}],
    tools=tools,
    reasoning_effort="high",
    extra_body={"thinking": {"type": "enabled"}},
)

If you need strict schema compliance, DeepSeek documents a strict mode (Beta). To use it, switch to base_url="https://api.deepseek.com/beta", set strict: true on each function, and follow DeepSeek’s supported JSON Schema subset. In strict mode, every object property must be listed in required, and additionalProperties should be false.

Context Caching

DeepSeek Context Caching on Disk technology is enabled by default for all users. If later requests share an overlapping prefix with earlier requests, the repeated prefix can count as a cache hit. Cache-hit and cache-miss input are billed differently, so verify the current rates on the official DeepSeek Models & Pricing page.

What can hit the cache: repeated prefixes such as the same system prompt, the same long document prefix, or repeated few-shot examples.
How to inspect it: check usage.prompt_cache_hit_tokens and usage.prompt_cache_miss_tokens in the response.
Important limit: the cache works on a best-effort basis and does not guarantee a 100% hit rate.
Practical pattern: put stable instructions and reusable context at the beginning of the message sequence so repeated prefixes are easier to reuse.

Streaming

Set "stream": true to receive data-only Server-Sent Events (SSE) as the model generates output. A streaming response ends with data: [DONE].

If you set stream_options={"include_usage": true}, DeepSeek sends one extra chunk before [DONE] where choices is empty and usage contains totals for the full request.

Custom Parser Example

import json
import requests

resp = requests.post(url, headers=headers, json=payload, stream=True, timeout=(10, 300))
resp.raise_for_status()

for raw_line in resp.iter_lines(decode_unicode=True):
    if not raw_line:
        continue

    if raw_line.startswith(":"):
        # Ignore SSE keep-alive comments
        continue

    if raw_line.startswith("data: "):
        data = raw_line[len("data: "):].strip()
        if data == "[DONE]":
            break

        chunk = json.loads(data)
        delta = chunk["choices"][0].get("delta", {})
        if delta.get("reasoning_content"):
            print(delta["reasoning_content"], end="", flush=True)
        elif delta.get("content"):
            print(delta["content"], end="", flush=True)

In thinking mode, streamed chunks may contain delta.reasoning_content before final delta.content. Parse them separately if you need to inspect reasoning output distinctly from the user-facing answer.

Keep-Alive Behavior and Timeouts

DeepSeek’s rate-limit documentation states that under scheduling pressure:

Non-streaming requests may return empty lines while waiting.
Streaming requests may return : keep-alive comments while waiting.
If inference has not started after 10 minutes, the server closes the connection.

Use explicit connect/read timeouts in production, and make sure your reverse proxies, serverless runtime, or gateway layer do not kill long-running streamed responses too early.

Rate Limits & Retries

DeepSeek currently describes API rate limiting as a dynamic concurrency limit based on server load. When you reach the concurrency limit, the API immediately returns HTTP 429. The FAQ also says the exposed limit on each account is adjusted dynamically according to real-time traffic pressure and short-term historical usage.

In practice, moderate usage usually works without manual tuning, but aggressive bursts can still produce 429 responses and long waits during busy periods. DeepSeek also says it does not currently raise the dynamic limit for individual accounts and does not offer tiered plans that unlock a higher fixed cap.

Recommended Retry Pattern

Retry: 429, 500, and 503
Do not blindly retry unchanged: 400, 401, 402, and 422
Use exponential backoff with jitter: for example 1s, 2s, 4s, 8s with a small random component

for attempt in range(1, 6):
    try:
        return call_deepseek()
    except RetryableError:
        sleep = min(2 ** (attempt - 1), 16) + random_jitter()
        time.sleep(sleep)
    except FatalRequestError:
        raise

If failures look widespread rather than request-specific, check the official DeepSeek Service Status page before you keep retrying.

Error Codes & Troubleshooting

The current official DeepSeek quick-start error list includes the following API-facing codes:

Code	Official meaning	What to do
400	Invalid Format	Fix the request body according to the error message and official schema.
401	Authentication Fails	Check the API key and Bearer header.
402	Insufficient Balance	Top up the account or verify available balance.
422	Invalid Parameters	Correct unsupported or malformed parameter values.
429	Rate Limit Reached	Slow down, back off, and retry later.
500	Server Error	Retry after a brief wait.
503	Server Overloaded	Retry after a brief wait and check status if it persists.

This guide deliberately uses the official current error list above. For day-to-day development you may still see generic HTTP behaviors caused by proxy, DNS, or URL path mistakes, but those are not part of DeepSeek’s current documented quick-start error table.

Anthropic API Format

DeepSeek also supports the Anthropic API ecosystem through https://api.deepseek.com/anthropic. This is useful for tools and coding agents that expect Anthropic-style messages and environment variables.

export ANTHROPIC_BASE_URL=https://api.deepseek.com/anthropic
export ANTHROPIC_API_KEY=${DEEPSEEK_API_KEY}

import anthropic

client = anthropic.Anthropic()
message = client.messages.create(
    model="deepseek-v4-pro",
    max_tokens=1000,
    system="You are a helpful assistant.",
    messages=[
        {
            "role": "user",
            "content": [{"type": "text", "text": "Hi, how are you?"}],
        }
    ],
)

print(message.content)

DeepSeek’s Anthropic API compatibility page notes that unsupported model names in the Anthropic API backend are automatically mapped to deepseek-v4-flash. For predictable production behavior, set deepseek-v4-flash or deepseek-v4-pro explicitly.

Security & Production Notes

Keep API keys server-side: Do not expose a live DeepSeek key in browser JavaScript or untrusted mobile code.
Minimize sensitive data: Send only the user content you actually need for the task, and redact personal or regulated data where possible.
Validate tool-call arguments: The model may output malformed or unsafe arguments. Validate before executing any function.
Use explicit timeouts and retries: DeepSeek requests can remain open while the platform waits for inference scheduling.
Watch balance and usage: The platform supports billing checks and usage exports by API key according to the current FAQ.
Separate hosted API from self-hosting: DeepSeek V4 is open-sourced, but self-hosting is a different deployment path from the official hosted API documented here.
Avoid legacy model IDs: Update examples, dashboards, calculators, and SDK wrappers from deepseek-chat/deepseek-reasoner to deepseek-v4-flash/deepseek-v4-pro.

Note: This guide is provided by Chat-Deep.ai as an independent reference. It summarizes the official DeepSeek API documentation, but it is not the official DeepSeek documentation itself. For production decisions, verify model names, pricing, limits, and endpoint behavior against official DeepSeek sources.