DeepSeek Coder

DeepSeek-Coder is an open-source series of large language models (LLMs) specialized for programming and code-related tasks. Developed by DeepSeek AI, these models are trained on a massive code-centric corpus to excel at code generation, completion, and understanding across many programming languages. DeepSeek-Coder models range from billions to hundreds of billions of parameters and report strong results on standard coding benchmarks among open models.

They support long context windows (enabling entire project files as input) and include instruction-tuned variants for improved usability. All DeepSeek-Coder models are released under a permissive license that allows free research and commercial use. The models can be accessed via the DeepSeek platform (API and web interface) – see the DeepSeek API documentation for integration details.

Architecture and Training Corpus

DeepSeek-Coder uses a decoder-only Transformer architecture (similar to GPT-style models) configured for enhanced long-context handling and code infilling. The first-generation DeepSeek-Coder models (1.3B, 6.7B, and 33B parameters) were trained from scratch on an extremely large code-focused dataset (~2 trillion tokens). About 87% of the pre-training corpus is source code from 87 programming languages, with the remainder being natural language (10% English technical content from sources like GitHub README/StackExchange and 3% Chinese).

Notably, the training data was collected at the repository level with deduplication and filtering to ensure high-quality, multi-file code context. During pre-training, the model learns through standard next-token prediction and a Fill-in-the-Middle (FIM) objective (approximately 50% of the time) to improve its ability to generate code in the middle of files and handle code insertion tasks. The base models support a context window of up to 16,384 tokens (16K) thanks to RoPE positional encoding and other optimizations, allowing them to consider long code files or multiple files as input.

After pre-training the base models, DeepSeek AI fine-tuned them on ~2 billion tokens of instructional data to create DeepSeek-Coder-Instruct variants. These instruction-tuned models are optimized to follow natural language prompts and provide helpful code solutions, making them more suitable for interactive use (similar to “chat” or assistant behavior). They were tuned using supervised instruction fine-tuning on high-quality Q&A, code instructions, and explanations. Users can prompt the instruct models in natural language or with code comments to get code completions, explanations, or debugging help.

Second-Generation (V2) Updates: In mid-2024, DeepSeek introduced DeepSeek-Coder-V2, which pushes the architecture further with a Mixture-of-Experts (MoE) design for greater scale. DeepSeek-Coder-V2 is initialized from a DeepSeek-V2 general LLM checkpoint and then further pre-trained on an additional 6 trillion tokens of code and math data, substantially improving coding performance (especially on math-heavy coding tasks) while maintaining strong general language capability. The V2 models expand support from 87 up to 338 programming languages and extend the context length from 16K to 128K tokens for handling even larger codebases.

Two MoE model sizes are offered in V2: a “Lite” 16B model (with 2.4B active parameters per token) and a full 236B model (21B active params), each available in Base and Instruct forms. Despite the enormous total parameter count, the MoE architecture means only a fraction of the experts are active at a time, keeping inference tractable. These V2 models demonstrate strong performance on standard code benchmarks, with results reported alongside advanced closed models in the official evaluations. (For details, see the full DeepSeek-Coder benchmark report and our [model benchmark page] which comprehensively document architecture and results.)

Model Variants and Sizes

DeepSeek-Coder is available in multiple model configurations to suit different needs. Below is a summary of the available variants, model sizes, and context lengths:

DeepSeek-Coder-Base (v1 series): Pre-trained foundation models for code. Available in 1.3B, 6.7B, and 33B parameter versions. These support up to 16K tokens context. They are trained on diverse code (87 languages) and provide strong coding capabilities out-of-the-box, but they respond literally to prompts (not instruction-tuned). Used for raw code completion, generation given some prefix, or as base models for further fine-tuning.

DeepSeek-Coder-Instruct (v1 series): Instruction-tuned versions of the above base models (1.3B, 6.7B, 33B). These have the same architecture and context length (16K) but have been further fine-tuned on instruction-following data. They are better at understanding natural language prompts, following user instructions, and producing helpful responses (e.g. explaining code, following a spec) in a zero-shot setting. For example, DeepSeek-Coder-Instruct-33B was shown to outperform OpenAI’s GPT-3.5 Turbo on code generation tasks like HumanEval after this fine-tuning.

DeepSeek-Coder-v1.5 (7B Instruct): (Experimental) An intermediate model in the v1 family that integrates general LLM training for improved breadth. This ~6.9B model was initialized from a DeepSeek-LLM 7B checkpoint and then further pre-trained on a mixture of natural language, code, and math (2B tokens). The goal was to enhance the base 6.7B model’s understanding of natural language without sacrificing coding ability. The DeepSeek-Coder-v1.5 model retains similar coding performance to the original 6.7B but with better general comprehension. (This model is also instruction-tuned and has 16K context, but is less commonly used now that v2 is available.)

DeepSeek-Coder-V2-Lite (16B MoE): Second-generation code model using a Mixture-of-Experts transformer architecture. It has 16 billion total parameters with ~2.4B active per token (effectively comparable to a 2.4B dense model per inference step). Despite the “Lite” name, it delivers very strong performance due to continued training on 6T tokens. Context length is 128K tokens, far exceeding most open models, which enables reading large code repositories or lengthy notebooks in one go. Released in both Base and Instruct versions. The Instruct variant is recommended for most use cases as it is tuned to follow prompts. This 16B-Lite model is relatively resource-friendly (it can run on a single high-end GPU in bfloat16 precision) while still matching or outperforming much larger dense models on code tasks.

DeepSeek-Coder-V2 (236B MoE): A flagship large MoE model with 236 billion total parameters (with ~21B active parameters on each token prediction). It also supports a 128K context window. This model represents one of the most powerful open-source code-focused LLMs available. In evaluations, DeepSeek-Coder-V2 has demonstrated performance on par with or exceeding top closed-source models in several coding and mathematical reasoning benchmarks.

However, it requires very high computational resources to run (the official recommendation is 8×80GB GPUs for full inference). For most users, the 16B-Lite model may offer a better balance of cost and performance, but the 236B model is available for research and applications that demand maximum accuracy. Both Base and Instruct variants of the 236B are provided. (In practice, the instruct-tuned 236B model currently represents the peak accuracy for code generation in the DeepSeek suite.)

All DeepSeek-Coder models use transformer decoder networks with improved attention mechanisms (FlashAttention, multi-query attention in some variants) and Rotary positional embeddings for long context. They are designed to handle code completion, code infilling (thanks to FIM training, one can insert code into a placeholder in the middle of a file), multi-language coding (covering mainstream languages like Python, Java, C++, JS, as well as less common ones), and even some mathematical reasoning within code. The weights for these models are openly available (e.g. on [Hugging Face] and via the DeepSeek website), and developers can also use them through DeepSeek’s cloud API or self-host them. (For a hands-on example, see the quick-start code below using the Hugging Face Transformers library.)

Performance and Benchmarks

DeepSeek-Coder has been extensively evaluated on standard code generation benchmarks, with strong results reported across multiple tasks and model sizes in the official evaluations.

Notably, the largest DeepSeek-Coder-Base models significantly outperform previous open code LLMs like CodeLlama and StarCoder, and even approach or exceed certain closed-source models on code tasks. Key benchmark results include:

HumanEval (Python) is a benchmark measuring functional correctness on 164 Python coding problems using pass@1 accuracy. In the reported evaluations, DeepSeek-Coder-Base-33B achieves results in the mid-50% range on HumanEval, outperforming several other open code models under the same evaluation setup. Smaller variants also show competitive results relative to their size. Instruction-tuned versions further improve HumanEval performance, according to the reported benchmark results.

MBPP (Mostly Basic Python Problems) – a benchmark of 500 simple Python tasks. DeepSeek-Coder-Base-33B reaches 66% pass@1, versus ~55% for CodeLlama-34B. The 6.7B base model gets 60.6%, again outperforming CodeLlama-34B. The 33B instruct model scores ~70%, comparable to GPT-3.5 (which is ~71% on MBPP).

Multi-Language Code Generation: DeepSeek-Coder’s strength is not limited to Python. The team evaluated the models on a multilingual HumanEval suite (problems translated into C++, Java, JavaScript, Go, etc.). The 33B model averaged 50.3% accuracy across 8 languages, outperforming CodeLlama-34B’s 41.0% average. This demonstrates robust multi-language coding ability, likely due to the diverse training in 80+ languages. Instruct tuning further improved results in several languages.

DS-1000 – DeepSeek introduced a new benchmark of 1,000 data science coding tasks (across libraries like NumPy, Pandas, PyTorch, etc.) to test practical code generation in realistic scenarios. On this DS-1000 benchmark, DeepSeek-Coder-Base-33B achieved ~40.2% overall pass@1, significantly outperforming CodeLlama (34B scored ~34.3%). The model showed competence across all tested libraries, indicating an ability to generate code that uses popular frameworks correctly.

Competition-Level Coding (LeetCode) – On a reported set of competitive programming problems inspired by recent LeetCode-style tasks, DeepSeek-Coder-Instruct-33B achieves strong pass@1 results in the official evaluation. These results indicate competitive performance among open-source code models on challenging algorithmic problems. GPT-3.5 (without CoT) scored 23.3%, while GPT-4 was ~40% on the same test. This highlights that DeepSeek-Coder can tackle complex algorithmic problems better than any open model of similar size, though there is still a gap to GPT-4 on the hardest tasks.

To summarize, DeepSeek-Coder demonstrates strong accuracy on coding tasks in standard evaluations, with results reported alongside other leading open models. Larger variants show competitive performance relative to their parameter size, reflecting the impact of large-scale training and long-context support. Instruction-tuned versions further improve performance on code generation benchmarks, as reported in the official evaluations.

(For detailed benchmark numbers and comparisons, refer to the [DeepSeek-Coder technical report] and the [full benchmark report] for all metrics.)

In general, users can expect DeepSeek-Coder to produce correct and efficient code for many tasks, often rivaling the output of top-tier closed models in quality. Moreover, the long context window allows it to incorporate extensive code context (e.g. multiple files or long functions) when generating solutions, which many other models cannot do as effectively. For use cases focused on programming – such as generating code from specifications, completing a partially written program, translating code between languages, or providing fix suggestions – DeepSeek-Coder provides cutting-edge performance among open-source options.

Known Limitations

While DeepSeek-Coder is a powerful coding assistant, it has several limitations to be aware of:

Possible Inaccuracies and Bugs: Like all large language models, DeepSeek-Coder can sometimes produce incomplete or incorrect code, or make logic mistakes that a human programmer wouldn’t. It does not guarantee that the generated code is bug-free or optimal. The model may miss edge cases or use deprecated methods if such patterns appeared in training data. Human review and testing of generated code is essential, especially for critical software. Users should carefully verify outputs before deploying them.

Non-local Error Handling: The model may struggle with detecting or reasoning about errors that span multiple files or distant parts of code. For instance, it might not catch a subtle bug that results from the interaction of code in different modules. The architecture was trained with some repository-level context, which helps, but it’s not infallible. It may sometimes fail to detect issues that depend on interactions across files). It can also be overly confident (false positives) when given very short prompts or conversely, when dealing with extremely long inputs where relevant details get lost.

Long-Context Performance: Although the model supports very long inputs (16K in v1, 128K in v2), its effective use of extremely long context may degrade for very lengthy prompts. The attention mechanisms make it capable of handling long code, but in practice ultra-long contexts (tens of thousands of tokens) can introduce performance bottlenecks and some drop in accuracy or consistency. Prompting strategies (like focusing the model on the most relevant parts of the code) may be needed to get the best results on huge projects.

General Knowledge and Reasoning: DeepSeek-Coder is primarily trained on code and related technical text. It has some general natural language understanding, but it may be less knowledgeable about non-coding topics compared to DeepSeek’s general models or OpenAI’s models. For example, it might not perform as well on open-ended Q&A about history or creative writing. The instruct variants have improved language skills, but this specialization means trade-offs. DeepSeek’s own analysis suggests that combining a strong general base (DeepSeek-LLM) with code training leads to the best results – which is the approach they took for v1.5 and v2 – but pure code models can still occasionally misinterpret instructions that fall outside the coding domain. Users whose tasks mix coding with extensive non-coding dialogue might consider hybrid approaches (or using DeepSeek’s general model alongside Coder).

Lack of Reinforcement Fine-Tuning for Safety: The open release of DeepSeek-Coder did not involve RLHF on human feedback for alignment (beyond the supervised instruction fine-tune). As a result, the model may not have comprehensive safety guardrails. It might produce insecure coding suggestions (e.g. using outdated cryptography, if such patterns were in training data) or follow instructions too literally even if they could be harmful. There is Users should apply their own safety and security checks. This means users should apply their own checks when using it (for instance, when asking the model to generate code that could affect security or when integrating it into end-user applications). DeepSeek-Coder can also reflect biases present in training data – though primarily trained on code, comments and documentation may carry stereotypical assumptions that could surface in outputs. These are common issues with LLMs noted by observers.

Resource Requirements: The larger DeepSeek-Coder models (e.g. 33B, and especially the MoE 236B) are resource-intensive. Running the 33B model for inference typically requires a GPU with ~40GB memory (or 2×24GB GPUs with sharding), and the 236B model requires multi-GPU server setups. Even the 6.7B and 16B models, while smaller, benefit from GPU acceleration for reasonable latency. This might limit deployment on edge devices or small-scale environments. For many users, the solution is to use DeepSeek’s hosted API or to use quantization techniques to run models on lower-end hardware.

Despite these limitations, DeepSeek-Coder remains an extremely useful tool. Many of its weaknesses (e.g. occasional errors, need for human oversight) are shared by all current code-generation models. The DeepSeek team and community are actively researching improvements, such as better handling of multi-file projects and integrating static analysis tools to catch errors in the model’s output. Future versions may incorporate semantic error feedback or other techniques to mitigate these issues. Users of DeepSeek-Coder should keep the model’s constraints in mind and use it as an assistant that augments human developers, rather than a fully autonomous coder.

Licensing

DeepSeek-Coder is released under a permissive open-source license, which allows unrestricted commercial use of the model. The code repository for DeepSeek-Coder is under an MIT License, and the model weights are provided under the DeepSeek Model License which explicitly permits usage, distribution, and modification for both research and commercial purposes. In other words, individuals and organizations can freely integrate DeepSeek-Coder into their products or services without needing a special agreement, as long as they comply with the basic open-source license terms (e.g. attribution, if applicable, and not holding the authors liable for outcomes). This is a notable distinction from many proprietary code models and means companies can leverage DeepSeek-Coder for coding assistance, automated code generation, etc., with no licensing fees. (Always review the exact license text – DeepSeek License – for details, but it has been designed to be enterprise-friendly. DeepSeek AI’s intent is to foster broad adoption by making the model as accessible as possible.)

When to Choose DeepSeek-Coder

DeepSeek-Coder vs other DeepSeek models: DeepSeek AI has a portfolio of models optimized for different purposes – choosing the right one depends on your use case. DeepSeek-Coder should be your go-to model if your application is primarily about programming or software development tasks. This includes use cases like generating code from specs or comments, completing code in an IDE, answering programming questions, debugging and fixing code, or even writing technical documentation for code. DeepSeek-Coder is specialized for programming, with its training skewed heavily toward code, making it excel at those tasks. It understands programming languages and can produce syntactically correct, plausibly efficient solutions in a range of languages. It also supports long contexts, which is especially useful if you need to feed in large codebases or multiple files for the model to analyze.

On the other hand, if your needs involve general reasoning, complex problem-solving in natural language, or multi-modal tasks, you might consider other models in the DeepSeek family instead of (or alongside) Coder. For example, DeepSeek-V3 (and the refined R1 series) are “reasoning-first” models built for agents and step-by-step logical analysis. They are trained to handle a broad range of tasks with advanced chain-of-thought reasoning. If you require an AI to, say, solve a non-coding analytical problem, provide a detailed explanation or proof, or interact with tools in an agentic workflow, the DeepSeek-V3/R1 models would likely perform better. Those models tend to show their work in reasoning tasks, breaking down problems methodically (they even excel at tasks like math proofs and logic puzzles, where showing reasoning is key). While DeepSeek-Coder can also follow logical steps (especially the instruct version, if prompted to explain its code), it doesn’t naturally produce long reasoning traces unless asked, and its knowledge outside of programming/dev ops is not as comprehensive.

Comparison example: If a user asks a very general question or something requiring world knowledge (“Explain the causes of a historical event” or “Draft a marketing email”), a general-purpose DeepSeek model (like V3.2) would be more appropriate. If the user asks for a Python function to parse JSON or help with a segmentation fault in C code, DeepSeek-Coder will produce a more direct and relevant answer. Similarly, DeepSeek has other specialized models – e.g. DeepSeek-Math focused on mathematical problem solving, DeepSeek-VL for vision-and-language tasks, etc. – and those should be chosen for their respective domains. DeepSeek-Coder specifically shines in software development contexts. It has been reported to handle code generation and debugging like an expert developer, even explaining its reasoning in code where needed, but it may not have the conversational finesse of the chat-oriented models on non-coding queries.

In summary, choose DeepSeek-Coder when your project involves writing or understanding code, automating programming tasks, or building developer assistants (it will provide the best code completion and generation quality). Choose DeepSeek’s general or reasoning models for tasks that require broad knowledge, extensive natural language interaction, or complex reasoning outside of strictly coding domains. In many cases, these models can also be combined – for instance, using a general model to interpret an ambiguous user query and then calling DeepSeek-Coder to produce a code snippet as the answer. DeepSeek’s platform and API support all these models, so you can integrate the one that fits your needs (see the [Code Generation use case page] for examples practices on using DeepSeek-Coder in applications).

Quick Start Example (Python)

If you want to experiment with DeepSeek-Coder locally, you can load an open-source checkpoint using Hugging Face Transformers. For example, here’s how to load the 6.7B instruction-tuned model and generate a simple code completion:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "deepseek-ai/DeepSeek-Coder-6.7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True)
model.to("cuda")  # move to GPU if available for faster inference

prompt = "# Task: Write a Python function to check if a number is prime.\ndef is_prime(n):"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
completion = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(completion)

Running the above should output a Python function implementation for is_prime. In practice, you can swap model_name with any available DeepSeek-Coder variant on Hugging Face or use DeepSeek’s own API. Refer to the DeepSeek API docs for more on deploying these models in production. For additional examples of coding tasks (like code infilling, debugging assistance, etc.), see our [Code Generation use case guide].

DeepSeek-Coder provides developers a powerful AI coding assistant under an open license, combining high performance on code tasks with the flexibility of open-source deployment. With proper integration and oversight, it can greatly accelerate software development workflows by letting “the code write itself.”