Kubernetes Deployment for DeepSeek: Production-Ready Guide

Last updated: June 1, 2026

A Kubernetes Deployment for DeepSeek is not just a normal web-service deployment with a larger container image. DeepSeek models can range from small distilled checkpoints that are practical for development environments to full-size Mixture-of-Experts models that require serious GPU, memory, networking, and scheduling planning. DeepSeek-R1 is available with full-size and distilled variants, and the DeepSeek-R1 model card notes that the R1 repository and model weights are MIT licensed while some distilled variants inherit considerations from their Qwen or Llama base models.

In this guide, you will build a Kubernetes-based DeepSeek inference endpoint using three deployment paths: a production-oriented vLLM deployment, a simpler Ollama deployment for demos and smaller models, and an advanced Ray Serve + vLLM option for distributed serving. vLLM provides an OpenAI-compatible HTTP server for endpoints such as /v1/chat/completions, while Ray Serve LLM provides a production framework for distributed LLM serving with OpenAI API compatibility.

The examples use a smaller placeholder model such as deepseek-ai/DeepSeek-R1-Distill-Qwen-7B so the Kubernetes YAML is realistic for testing. Replace it with your chosen DeepSeek model only after validating GPU memory, precision, context length, concurrency, and license requirements.

Quick Answer: How Do You Deploy DeepSeek on Kubernetes?

To deploy DeepSeek on Kubernetes, prepare GPU-enabled nodes, install the NVIDIA GPU Operator or device plugin, create a namespace, store your Hugging Face token in a Kubernetes Secret, mount a PersistentVolumeClaim for model cache, and deploy a model-serving container such as vLLM. Expose the Pod with a Kubernetes Service, then test the OpenAI-compatible endpoint using curl or the OpenAI Python client. Use Ollama for quick demos, vLLM for production inference, and Ray Serve + vLLM for large or multi-node DeepSeek deployments. Kubernetes schedules NVIDIA GPUs through the nvidia.com/gpu resource after the relevant device plugin is installed.

Why Kubernetes Deployment for DeepSeek Requires Careful GPU Planning

DeepSeek is a family of models, not a single fixed runtime profile. The DeepSeek-R1 repository lists full DeepSeek-R1 and R1-Zero as 671B total parameter models with 37B activated parameters and a 128K context length, while also listing distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen and Llama series models.

This matters because a small distilled model can be deployed as a single Kubernetes Pod on one GPU, while full-size DeepSeek-R1, V3, V3.1, or V3.2-class models require multi-GPU or multi-node planning. DeepSeek-V3 is described as a 671B total parameter MoE model with 37B activated parameters per token, and DeepSeek-V3.1 lists 671B total parameters, 37B activated parameters, and 128K context length.

Do not treat any single GPU recommendation as universal. Hardware depends on the model variant, precision, quantization, context length, batch size, KV cache, tensor parallelism, pipeline parallelism, and traffic pattern. Google Cloud’s DeepSeek-V3.1-Base tutorial, for example, uses vLLM on a GKE Autopilot cluster with an A4 VM containing 8 B200 GPUs; that is a specific reference architecture, not a general minimum for every DeepSeek deployment.

Architecture Overview

A typical DeepSeek model serving architecture on Kubernetes looks like this:

Client / App
    |
    v
Ingress / API Gateway / Service Mesh
    |
    v
Kubernetes Service
    |
    v
vLLM, Ollama, or Ray Serve Pod(s)
    |
    v
GPU Node Pool
    |
    +--> PersistentVolumeClaim / model cache
    |
    +--> Monitoring: Prometheus, Grafana, DCGM Exporter, vLLM metrics

The main production goal is to keep model weights cached, route traffic only through authenticated internal or gateway-controlled endpoints, and monitor both application-level latency and GPU-level saturation. NVIDIA’s GPU Operator can automate several NVIDIA software components needed for Kubernetes GPU nodes, including drivers, the NVIDIA Kubernetes device plugin, NVIDIA Container Toolkit, GPU Feature Discovery, and DCGM-based monitoring.

Choosing the Right DeepSeek Deployment Option

Option	Best use case	Complexity	GPU support	Production readiness	Multi-node support	OpenAI-compatible API	Recommended model size
Ollama	Quick start, local-style demos, internal experiments	Low	Yes, with suitable runtime/device access	Limited for high-concurrency production	Not the main design goal	Partial OpenAI compatibility	Small/distilled models
vLLM	Production inference endpoint, batching, higher throughput	Medium	Yes	Strong production foundation	Single-node tensor parallelism and integrations	Yes	Small to mid-size models; larger with careful parallelism
Ray Serve + vLLM	Distributed serving, large models, advanced scaling	High	Yes	Strong for distributed production serving	Yes	Yes	Large DeepSeek models and multi-node workloads

Use Ollama when the team needs a fast demo or a private endpoint for smaller distilled DeepSeek models. Ollama provides an API for programmatic model interaction and documents OpenAI compatibility for parts of the OpenAI API.

Use vLLM when you need a production DeepSeek inference server with an OpenAI-compatible API. vLLM’s documentation describes Kubernetes deployment options and an OpenAI-compatible server that implements Completions and Chat APIs.

Use Ray Serve + vLLM when the model or workload exceeds a simple single-Pod deployment. Ray Serve LLM supports multi-node inference patterns, autoscaling, load balancing, OpenAI-compatible APIs, and distributed deployment capabilities.

Prerequisites

You need a working Kubernetes cluster and permissions to create Deployments, Services, Secrets, PVCs, and optionally Ingress or RayService resources.

Minimum platform requirements:

Requirement	Notes
Kubernetes cluster	Managed Kubernetes or self-managed cluster
`kubectl`	Configured for the target cluster
Helm	Useful for NVIDIA GPU Operator, KubeRay, Prometheus, Grafana
GPU node pool	Required for practical LLM serving
NVIDIA GPU Operator or NVIDIA device plugin	Required so Kubernetes can expose `nvidia.com/gpu`
StorageClass and PVC support	Needed for model cache
Container registry access	Use pinned images, preferably from a trusted registry
Hugging Face token	Required for gated/private models and useful for authenticated pulls
Basic Kubernetes knowledge	Deployments, Services, Secrets, PVCs, probes, requests, limits

Kubernetes GPU support is based on device plugins. After the GPU vendor plugin is installed, the cluster exposes resources such as nvidia.com/gpu, and Pods can request GPUs in their container resource limits. Kubernetes documents that GPUs should be specified in limits; if requests are also specified, requests and limits must be equal.

Hardware and Model Sizing

There is no universal “one GPU fits DeepSeek” answer. Sizing depends on:

Factor	Why it matters
Model variant	Distilled models are much smaller than full DeepSeek-R1/V3-class models
Precision	BF16, FP8, and quantized formats change memory and performance behavior
Context length	Longer context increases KV cache memory pressure
Concurrency	More simultaneous requests require more memory and scheduling headroom
Batch size	Higher batch sizes can improve throughput but increase latency and memory use
KV cache	Often the limiting factor for long-context serving
Tensor parallelism	Splits model computation across multiple GPUs
Pipeline parallelism	Splits layers/stages across devices or nodes
Storage speed	Slow model downloads can cause long startup times
Interconnect	Multi-node workloads can be limited by network/NCCL/RDMA configuration

Practical categories:

Category	Typical use
Small distilled model	Development, demos, CI validation, lightweight internal tools
Mid-size model	Team workloads, internal assistants, batch-like low-concurrency inference
Full-size DeepSeek-R1/V3/V3.1/V3.2 class	Multi-GPU or multi-node architecture with careful serving stack validation

Ray’s official DeepSeek R1 on Kubernetes example states that its full DeepSeek model guide requires two nodes, each with 8 H100 80GB GPUs. Treat that as a reference architecture for that specific example, not as a universal rule for every DeepSeek model or quantized variant.

Prepare GPU Support in Kubernetes

First, confirm that your GPU nodes are visible:

kubectl get nodes -o wide
kubectl describe nodes | grep -i -A5 "nvidia.com/gpu"

A GPU node should report allocatable GPU resources similar to:

Allocatable:
  cpu:                32
  memory:             250Gi
  nvidia.com/gpu:     1

To schedule a DeepSeek inference Pod on GPU nodes, request the GPU resource in the container limits:

resources:
  limits:
    nvidia.com/gpu: "1"

If your GPU nodes are labeled, add a nodeSelector:

nodeSelector:
  accelerator: nvidia-gpu

If GPU nodes are tainted, add tolerations:

tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"

Kubernetes also supports node labels and node selectors for clusters with different GPU types, which is important when you have mixed A10, L4, A100, H100, B200, or other GPU pools.

Create Namespace, Secret, and PVC

Create a namespace, Secret, and model cache PVC. The Secret stores your Hugging Face token and an API key for the vLLM endpoint. Kubernetes Secrets are designed for small sensitive values such as passwords, tokens, and keys, so do not bake tokens into images or commit them to Git.

apiVersion: v1
kind: Namespace
metadata:
  name: deepseek
---
apiVersion: v1
kind: Secret
metadata:
  name: deepseek-secrets
  namespace: deepseek
type: Opaque
stringData:
  HF_TOKEN: "<YOUR_HUGGING_FACE_TOKEN>"
  VLLM_API_KEY: "<YOUR_INTERNAL_API_KEY>"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: deepseek-model-cache
  namespace: deepseek
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: "<YOUR_STORAGE_CLASS>"
  resources:
    requests:
      storage: 200Gi

Apply it:

kubectl apply -f deepseek-base.yaml

For GitOps, use an external secret manager or sealed secret workflow instead of committing plaintext Secret manifests.

Production secret warning: Kubernetes Secrets are not a complete secret-management solution by themselves. Enable encryption at rest for Secrets, restrict access with least-privilege RBAC, avoid exposing Secrets to unnecessary Pods, and prefer an external secret manager or Secrets Store CSI workflow for production environments.

Primary Deployment Path: DeepSeek with vLLM on Kubernetes

This is the recommended path for most production-oriented DeepSeek vLLM Kubernetes deployments. vLLM’s Kubernetes documentation shows native Kubernetes deployment patterns, and its OpenAI-compatible server allows existing OpenAI-style clients to call local or self-hosted models with minimal client changes.

The following manifest deploys a single-replica DeepSeek inference server using vLLM. It uses a smaller distilled model placeholder for safety. Replace the model only after testing memory and throughput.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deepseek-vllm
  namespace: deepseek
  labels:
    app.kubernetes.io/name: deepseek-vllm
    app.kubernetes.io/component: inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: deepseek-vllm
  template:
    metadata:
      labels:
        app.kubernetes.io/name: deepseek-vllm
        app.kubernetes.io/component: inference
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      terminationGracePeriodSeconds: 120
      nodeSelector:
        accelerator: nvidia-gpu
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"
      securityContext:
        fsGroup: 1000
      containers:
        - name: vllm
          image: "vllm/vllm-openai:<PINNED_VERSION_TAG>"
          imagePullPolicy: IfNotPresent
          command: ["vllm", "serve"]
          args:
            - "$(MODEL_ID)"
            - "--host"
            - "0.0.0.0"
            - "--port"
            - "8000"
            - "--served-model-name"
            - "deepseek-r1-distill"
            - "--dtype"
            - "auto"
            - "--api-key"
            - "$(VLLM_API_KEY)"
            # Tune these after load testing:
            # - "--max-model-len"
            # - "8192"
            # - "--gpu-memory-utilization"
            # - "0.90"
          env:
            - name: MODEL_ID
              value: "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
            - name: HF_HOME
              value: "/models/huggingface"
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: deepseek-secrets
                  key: HF_TOKEN
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: deepseek-secrets
                  key: HF_TOKEN
            - name: VLLM_API_KEY
              valueFrom:
                secretKeyRef:
                  name: deepseek-secrets
                  key: VLLM_API_KEY
          ports:
            - name: http
              containerPort: 8000
          resources:
            requests:
              cpu: "4"
              memory: "32Gi"
            limits:
              cpu: "8"
              memory: "64Gi"
              nvidia.com/gpu: "1"
          volumeMounts:
            - name: model-cache
              mountPath: /models/huggingface
            - name: shm
              mountPath: /dev/shm
          startupProbe:
            httpGet:
              path: /health
              port: http
            failureThreshold: 120
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 10
            periodSeconds: 10
            failureThreshold: 6
          livenessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 60
            periodSeconds: 30
            failureThreshold: 5
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: ["ALL"]
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: deepseek-model-cache
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: "8Gi"
            # Increase shm size for larger models, tensor parallelism, or high-concurrency workloads after load testing.
---
apiVersion: v1
kind: Service
metadata:
  name: deepseek-vllm
  namespace: deepseek
  labels:
    app.kubernetes.io/name: deepseek-vllm
spec:
  type: ClusterIP
  selector:
    app.kubernetes.io/name: deepseek-vllm
  ports:
    - name: http
      port: 8000
      targetPort: http

        # Increase shm size for larger models, tensor parallelism, or high-concurrency workloads after load testing.

Important parts of the manifest:

Field	Purpose
`MODEL_ID`	Hugging Face model identifier. Start with a distilled model.
`HF_HOME`	Places model cache on the mounted PVC.
`HF_TOKEN`	Authenticates model downloads where required.
`VLLM_API_KEY`	Enables basic API-key protection at the vLLM layer.
`nvidia.com/gpu`	Requests a GPU from Kubernetes.
`startupProbe`	Gives large model downloads enough time before Kubernetes restarts the container.
`readinessProbe`	Prevents traffic before the model server is ready.
`prometheus.io/*` annotations	Allows Prometheus scraping in clusters that use annotation discovery.
`ClusterIP` Service	Keeps the model endpoint internal by default.

For DeepSeek-R1-family models, follow the model card’s usage recommendations. The DeepSeek-R1 model card recommends avoiding a system prompt and putting instructions in the user prompt for expected behavior.

Test the DeepSeek Endpoint

Apply the vLLM manifest:

kubectl apply -f deepseek-vllm.yaml

Check status:

kubectl -n deepseek get pods
kubectl -n deepseek describe pod -l app.kubernetes.io/name=deepseek-vllm
kubectl -n deepseek logs -l app.kubernetes.io/name=deepseek-vllm -f

Port-forward the Service:

kubectl -n deepseek port-forward svc/deepseek-vllm 8000:8000

Test the OpenAI-compatible Chat Completions endpoint:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <YOUR_INTERNAL_API_KEY>" \
  -d '{
    "model": "deepseek-r1-distill",
    "messages": [
      {
        "role": "user",
        "content": "Explain the steps to deploy a GPU workload on Kubernetes. Keep the answer concise."
      }
    ],
    "temperature": 0.6,
    "max_tokens": 512
  }'

You can also call the endpoint with the OpenAI Python client because vLLM exposes an OpenAI-compatible server.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="<YOUR_INTERNAL_API_KEY>",
)

response = client.chat.completions.create(
    model="deepseek-r1-distill",
    messages=[
        {
            "role": "user",
            "content": "Give me a Kubernetes checklist for serving DeepSeek with GPUs."
        }
    ],
    temperature=0.6,
    max_tokens=512,
)

print(response.choices[0].message.content)

Alternative Path: DeepSeek with Ollama on Kubernetes

A DeepSeek Ollama Kubernetes deployment is useful for demos, local-style workflows, and smaller distilled models. Ollama documents programmatic model interaction through its API, a Docker workflow, and OpenAI compatibility for parts of the OpenAI API.

Use Ollama when the goal is fast experimentation, not maximum production concurrency. For high-throughput production inference, vLLM is usually the better default.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deepseek-ollama
  namespace: deepseek
  labels:
    app.kubernetes.io/name: deepseek-ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: deepseek-ollama
  template:
    metadata:
      labels:
        app.kubernetes.io/name: deepseek-ollama
    spec:
      nodeSelector:
        accelerator: nvidia-gpu
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"
      containers:
        - name: ollama
          image: "ollama/ollama:<PINNED_VERSION_TAG>"
          imagePullPolicy: IfNotPresent
          ports:
            - name: http
              containerPort: 11434
          resources:
            requests:
              cpu: "2"
              memory: "16Gi"
            limits:
              cpu: "6"
              memory: "48Gi"
              nvidia.com/gpu: "1"
          volumeMounts:
            - name: ollama-data
              mountPath: /root/.ollama
      volumes:
        - name: ollama-data
          persistentVolumeClaim:
            claimName: ollama-model-cache
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-model-cache
  namespace: deepseek
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: "<YOUR_STORAGE_CLASS>"
  resources:
    requests:
      storage: 100Gi
---
apiVersion: v1
kind: Service
metadata:
  name: deepseek-ollama
  namespace: deepseek
spec:
  type: ClusterIP
  selector:
    app.kubernetes.io/name: deepseek-ollama
  ports:
    - name: http
      port: 11434
      targetPort: http

Apply it:

kubectl apply -f deepseek-ollama.yaml

Pull a DeepSeek model into the Ollama PVC:

kubectl -n deepseek exec deploy/deepseek-ollama -- ollama pull deepseek-r1:8b

Test through port-forwarding:

kubectl -n deepseek port-forward svc/deepseek-ollama 11434:11434

curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-r1:8b",
    "messages": [
      {
        "role": "user",
        "content": "Write a short Kubernetes GPU readiness checklist."
      }
    ],
    "stream": false
  }'

Ollama is simpler, but for production concurrency you should validate request queuing, memory pressure, model loading time, API authentication, and autoscaling behavior before exposing it to internal users.

Advanced Path: Ray Serve + vLLM for Large DeepSeek Models

Use Ray Serve DeepSeek Kubernetes architecture when a single Pod is no longer enough. Ray Serve LLM specializes Ray Serve for distributed LLM serving and includes production features such as autoscaling, load balancing, multi-node deployments, OpenAI-compatible APIs, metrics, and Grafana dashboards.

Ray’s Kubernetes DeepSeek example uses KubeRay, Ray Serve, and vLLM to deploy deepseek-ai/DeepSeek-R1 on Kubernetes and expose an efficient OpenAI-compatible LLM service.

Below is a conceptual RayService manifest. Treat it as a starting template, not a drop-in production manifest. Exact values depend on GPU type, node count, model variant, vLLM version, Ray version, storage, and network topology.

Image note: For Ray Serve + vLLM, use a pinned custom image that includes a Ray version compatible with KubeRay, Ray Serve LLM dependencies, vLLM, CUDA/NCCL libraries, and any model-specific runtime packages. A plain base Ray image may not include everything required for production LLM serving.

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: deepseek-rayserve
  namespace: deepseek
spec:
  serveConfigV2: |
    applications:
      - name: deepseek
        import_path: ray.serve.llm:build_openai_app
        route_prefix: "/"
        args:
          llm_configs:
            - model_loading_config:
                model_id: deepseek-r1
                model_source: deepseek-ai/DeepSeek-R1
              engine_kwargs:
                dtype: bfloat16
                max_model_len: 8192
                tensor_parallel_size: <GPUS_PER_REPLICA>
                pipeline_parallel_size: <PIPELINE_PARALLEL_SIZE>
                gpu_memory_utilization: 0.90
              deployment_config:
                autoscaling_config:
                  min_replicas: 1
                  max_replicas: 2
                  target_ongoing_requests: 64
                max_ongoing_requests: 128
  rayClusterConfig:
    rayVersion: "<PINNED_RAY_VERSION>"
    headGroupSpec:
      serviceType: ClusterIP
      rayStartParams:
        dashboard-host: "0.0.0.0"
      template:
        spec:
          containers:
            - name: ray-head
              image: "rayproject/ray:<PINNED_RAY_IMAGE_TAG>"
              ports:
                - containerPort: 6379
                  name: gcs
                - containerPort: 8265
                  name: dashboard
                - containerPort: 8000
                  name: serve
              env:
                - name: HUGGING_FACE_HUB_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: deepseek-secrets
                      key: HF_TOKEN
              resources:
                requests:
                  cpu: "4"
                  memory: "16Gi"
                limits:
                  cpu: "8"
                  memory: "32Gi"
    workerGroupSpecs:
      - groupName: gpu-workers
        replicas: <GPU_WORKER_REPLICAS>
        minReplicas: <GPU_WORKER_MIN_REPLICAS>
        maxReplicas: <GPU_WORKER_MAX_REPLICAS>
        rayStartParams: {}
        template:
          spec:
            nodeSelector:
              accelerator: nvidia-gpu
            tolerations:
              - key: "nvidia.com/gpu"
                operator: "Exists"
                effect: "NoSchedule"
            containers:
              - name: ray-worker
                image: "rayproject/ray:<PINNED_RAY_IMAGE_TAG>"
                env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: deepseek-secrets
                        key: HF_TOKEN
                resources:
                  requests:
                    cpu: "16"
                    memory: "128Gi"
                  limits:
                    cpu: "32"
                    memory: "256Gi"
                    nvidia.com/gpu: "<GPUS_PER_WORKER_POD>"

Ray’s documented serveConfigV2 format uses ray.serve.llm:build_openai_app, llm_configs, model_loading_config, engine_kwargs, and deployment_config to configure an OpenAI-compatible LLM application.

For large DeepSeek deployments, validate NCCL, RDMA or high-performance networking, topology-aware scheduling, image size, shared cache, and rollout strategy before production. Multi-node LLM serving failures are often infrastructure failures, not model-code failures.

Expose the Service Safely

Do not expose an unauthenticated DeepSeek inference endpoint directly to the public internet.

Recommended exposure patterns:

Pattern	Use case
`ClusterIP`	Internal services inside the cluster
Private Ingress	Internal platform users or private VPC
API Gateway	Auth, quotas, rate limits, request logging
Service mesh	mTLS, policy, internal traffic control
Public Ingress	Only behind strong auth, TLS, and rate limiting

A minimal internal Ingress may look like this:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: deepseek-vllm
  namespace: deepseek
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - deepseek.internal.example.com
      secretName: deepseek-tls
  rules:
    - host: deepseek.internal.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: deepseek-vllm
                port:
                  number: 8000

Add authentication before the request reaches the model server. For production, enforce API keys or identity-aware proxy at the gateway layer, not just in application code.

Scaling DeepSeek on Kubernetes

Scaling LLMs is different from scaling stateless REST APIs. A normal CPU-based HPA can be misleading because the bottleneck is often GPU memory, KV cache pressure, waiting queue depth, tokens per second, time-to-first-token, or time per output token.

Kubernetes HPA autoscaling/v2 supports memory and custom metrics, which is important for LLM inference workloads where CPU is not the main saturation signal.

Example HPA using a custom metric exposed through Prometheus Adapter:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: deepseek-vllm
  namespace: deepseek
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: deepseek-vllm
  minReplicas: 1
  maxReplicas: 4
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_num_requests_waiting
        target:
          type: AverageValue
          averageValue: "8"

This assumes you have created a Prometheus recording rule or adapter mapping called vllm_num_requests_waiting. vLLM’s metrics documentation notes that autoscaling and load balancing are common use cases for vLLM metrics, but also warns that identifying saturation for model serving is non-trivial.

For smaller models, replica scaling may work well if each replica owns a GPU. For large models using tensor or pipeline parallelism, scaling often means adding a full replica group, not simply adding one Pod.

Monitoring and Observability

Monitor the model server, the GPU, the Kubernetes workload, and the user experience.

Key metrics:

Layer	Metrics
GPU	utilization, memory used, memory temperature, power, ECC errors
vLLM	request latency, queue depth, tokens/sec, time-to-first-token, time per output token
Kubernetes	Pod restarts, Pending Pods, probe failures, OOMKilled events
API	request rate, error rate, timeout rate, status code distribution
Storage	model download duration, PVC latency, cache hit behavior
Ray	Serve deployment status, actor health, object store memory, worker failures

NVIDIA DCGM Exporter exposes GPU metrics at an HTTP /metrics endpoint for monitoring systems such as Prometheus and can run as a DaemonSet on GPU nodes.

For vLLM, scrape the model server metrics endpoint and create dashboards for queue length, request latency, throughput, and GPU utilization. For Ray Serve, also monitor the Ray dashboard and Ray Serve deployment status.

Security Best Practices

A production DeepSeek inference endpoint can process sensitive prompts, internal code, customer data, or operational logs. Treat it as a sensitive service.

Security checklist:

Area	Recommendation
Secrets	Store tokens in Kubernetes Secrets or external secret managers
RBAC	Grant only the permissions required for deployment and runtime
NetworkPolicy	Restrict ingress to API gateways and trusted namespaces
TLS	Use TLS at Ingress or gateway
Authentication	Require API keys, OIDC, mTLS, or gateway-based auth
Rate limiting	Prevent runaway spend and GPU saturation
Image security	Pin images, scan images, avoid `latest` tags
Registry	Prefer private or trusted registries
Prompt logging	Define retention, redaction, and privacy rules
Data privacy	Avoid logging raw prompts unless explicitly approved
License review	Review each model variant and base model license
Supply chain	Scan manifests, images, and dependencies

Kubernetes RBAC guidance emphasizes least privilege and warns against overly permissive wildcard permissions.

NetworkPolicies control what network traffic is allowed for selected Pods, including ingress and egress behavior, so use them to limit which workloads can call your DeepSeek inference server.

Example NetworkPolicy allowing ingress only from an API gateway namespace:

This example uses Kubernetes’ built-in namespace label kubernetes.io/metadata.name. If your cluster uses custom namespace labels, replace the selector with your approved label strategy.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-gateway-to-deepseek
  namespace: deepseek
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: deepseek-vllm
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: api-gateway
      ports:
        - protocol: TCP
          port: 8000

Performance Optimization

Start with correctness, then tune performance.

High-impact tuning areas:

Tuning area	Practical guidance
Model choice	Use the smallest model that satisfies quality requirements
Serving engine	Use vLLM for production batching and OpenAI-compatible serving
Precision	Test BF16, FP8, or quantized formats depending on model support
Context length	Lower `max_model_len` if you do not need long context
KV cache	Plan memory for concurrency and context length
Tensor parallelism	Use when a model needs multiple GPUs in one replica
Pipeline parallelism	Consider for larger distributed deployments
Model cache	Use PVC, node-local SSD, or fast storage to avoid repeated downloads
Startup	Use startup probes long enough for model loading
Concurrency	Increase gradually and measure TTFT/TPOT
Scheduling	Pin workloads to compatible GPU node pools
Rollouts	Avoid evicting all model replicas at once

Do not jump directly to maximum context length and high concurrency. Long context increases KV cache pressure, and the best values are usually discovered through load testing with prompts that look like your real workload.

Troubleshooting

Problem	Likely cause	How to fix
Pod stuck `Pending`	No GPU node available, wrong node selector, insufficient resources	Check `kubectl describe pod`, GPU node labels, taints, and allocatable `nvidia.com/gpu`
GPU not visible	NVIDIA device plugin or GPU Operator not installed correctly	Verify node drivers, GPU Operator status, and `kubectl describe node` GPU resources
CUDA mismatch	Image CUDA stack incompatible with host driver/runtime	Use a compatible pinned image and validate NVIDIA runtime configuration
Model download fails	Missing token, network egress blocked, wrong model ID	Check `HF_TOKEN`, egress policy, DNS, and Hugging Face model name
`401` from Hugging Face	Invalid token or insufficient access	Recreate Secret and confirm token permissions
CUDA out of memory	Model too large, context too long, concurrency too high	Use smaller model, quantization, lower context, lower utilization, or more GPUs
Slow startup	Large model download or slow PVC	Pre-warm cache, use faster storage, or bake approved weights into internal artifact storage
Service not reachable	Wrong Service selector, port mismatch, NetworkPolicy	Check endpoints with `kubectl get endpoints -n deepseek`
Readiness probe fails	Model still loading or `/health` unavailable	Increase startup probe window and inspect logs
Poor throughput	GPU underutilization, queue bottleneck, small batch, slow tokenization	Check vLLM metrics, tune concurrency, and benchmark realistic traffic
Multi-node NCCL/RDMA issues	Network, driver, or topology misconfiguration	Validate NCCL tests, node networking, firewall rules, and GPU topology
Model too large for GPU memory	Full model used on insufficient hardware	Use distilled variant, quantization, tensor parallelism, or Ray Serve + vLLM

Cleanup

Delete the vLLM deployment:

kubectl -n deepseek delete deployment deepseek-vllm
kubectl -n deepseek delete service deepseek-vllm

Delete the Ollama deployment:

kubectl -n deepseek delete deployment deepseek-ollama
kubectl -n deepseek delete service deepseek-ollama
kubectl -n deepseek delete pvc ollama-model-cache

Delete shared resources:

kubectl -n deepseek delete pvc deepseek-model-cache
kubectl delete namespace deepseek

For Ray Serve, delete the RayService:

kubectl -n deepseek delete rayservice deepseek-rayserve

FAQ

Can I deploy DeepSeek on Kubernetes?

Yes. You can deploy DeepSeek on Kubernetes using vLLM, Ollama, Ray Serve, SGLang, or other serving stacks. For production, vLLM and Ray Serve + vLLM are strong options because they expose OpenAI-compatible APIs and support production inference patterns.

What is the best way to deploy DeepSeek on Kubernetes?

For most production use cases, the best starting point is vLLM on GPU-enabled Kubernetes nodes. Use Ollama for quick demos and Ray Serve + vLLM for larger distributed deployments.

Should I use Ollama or vLLM for DeepSeek?

Use Ollama for simple experiments, small distilled models, and low-complexity internal demos. Use vLLM when you need a more production-oriented DeepSeek inference server with OpenAI-compatible endpoints, batching, metrics, and stronger Kubernetes deployment patterns.

Does DeepSeek need GPUs on Kubernetes?

For practical LLM serving, yes. Small models may run on CPU for testing, but production DeepSeek inference should use GPUs. Kubernetes supports GPU scheduling through vendor device plugins and exposes resources such as nvidia.com/gpu after the plugin is installed.

Can I run DeepSeek-R1 on a single GPU?

A small distilled DeepSeek-R1 model may run on a single suitable GPU depending on precision, context length, and concurrency. Full DeepSeek-R1-class models are not single small-GPU workloads and require serious multi-GPU or multi-node planning.

How do I expose DeepSeek as an OpenAI-compatible API?

Deploy a serving stack such as vLLM or Ray Serve LLM and expose the Kubernetes Service internally or through an authenticated gateway. vLLM supports OpenAI-compatible endpoints such as /v1/chat/completions, and Ray Serve LLM aligns with vLLM’s OpenAI-compatible API.

How do I scale DeepSeek on Kubernetes?

For small models, run multiple replicas where each replica has its own GPU. For larger models, use tensor parallelism, pipeline parallelism, or Ray Serve + vLLM. Use custom metrics such as queue depth, tokens/sec, latency, and GPU utilization instead of relying only on CPU-based HPA.

Is self-hosting DeepSeek better than using the hosted API?

Self-hosting gives you more control over data locality, network boundaries, model variants, and infrastructure. The hosted API is usually simpler operationally. Choose self-hosting when compliance, customization, private networking, or cost control at scale justifies the operational burden.

Conclusion

A successful Kubernetes Deployment for DeepSeek starts with the right model choice. Use Ollama for quick starts and small distilled models. Use vLLM for production Kubernetes inference with an OpenAI-compatible DeepSeek API. Use Ray Serve + vLLM when the model or traffic pattern requires distributed multi-node serving.

The most important production decisions are GPU planning, model-size selection, cache strategy, authentication, monitoring, autoscaling signals, and security controls. Full-size DeepSeek models require careful multi-GPU or multi-node architecture, while distilled models are much more practical for demos and smaller internal workloads.