DeepSeek is a family of open-weight large language models released by DeepSeek-AI, with official checkpoints published publicly (for example, on Hugging Face). This open-weights approach is a major reason developers look to DeepSeek in the first place: it lets you download the exact model weights and run them on your own hardware—locally or on private servers—instead of relying only on hosted chat services.
In practice, many DeepSeek variants are large enough that GPU acceleration is the difference between a usable local setup and an unusably slow one. The challenge is that most local LLM tooling has historically been optimized around NVIDIA CUDA, where kernels, runtimes, and community testing tend to concentrate. When you run DeepSeek outside CUDA—on AMD GPUs through ROCm or on Apple Silicon through Metal-enabled runtimes—you often encounter platform-specific differences: which runtimes work reliably, which model formats are practical, what precision/quantization you can use, and what failures show up first (driver detection, unsupported ops, memory pressure, or “GPU not used” symptoms).
That’s why this guide focuses specifically on running DeepSeek on AMD ROCm and on Mac Metal: a quick decision path, a compatibility matrix that maps your hardware to the most realistic runtime + model format, and a troubleshooting section for the common DeepSeek deployment errors you’re most likely to hit in these non-CUDA environments.
Section 1: Quick Decision Guide
If you’re using an AMD GPU (ROCm): Use an AMD-supported PyTorch stack and consider frameworks like vLLM or Hugging Face Transformers that have been validated on ROCm. Install the ROCm-enabled PyTorch (e.g. via AMD’s Docker or pip wheel) and load a DeepSeek model in half precision (FP16) or a suitable quantized format. Use a recent ROCm release officially supported by your specific GPU model. If you encounter CUDA-specific code, replace or patch it with HIP/ROCm equivalents. (See Section 3 for detailed fixes.)
If you’re on Apple Silicon (Mac): Leverage llama.cpp with the Metal backend or an app like Ollama for a user-friendly experience. Convert DeepSeek model weights to the GGUF format optimized for llama.cpp.
Apple’s unified memory can make it easier to load larger models than a typical discrete-VRAM GPU, but you’ll still usually want 4-bit or 5-bit quantization for larger DeepSeek variants to keep memory pressure manageable.
Compile or install llama.cpp with Metal support, and ensure you enable GPU offloading (e.g. set n_gpu_layers). (Troubleshooting tips in Section 4.)
If you must run on CPU-only: This is a fallback for when no supported GPU is available or if compatibility issues persist. Use the smallest DeepSeek variants or heavily quantized models (e.g. a 7B distilled model in 4-bit) to get any usable speed. Running a large model like DeepSeek on CPU will be very slow – expect responses in minutes. Consider using DeepSeek Quantization Guide to reduce model size, or use this option only for development/testing. Whenever possible, prefer GPU (even integrated Apple GPU) for inference.
Section 2: DeepSeek Compatibility Matrix
Below is a compatibility matrix summarizing recommended runtimes and formats for DeepSeek on different platforms, plus typical issues and where to find fixes in this guide:
| Platform & Stack | Recommended Runtime | Supported Formats | Best Use Case | Common Issues | Fix Section |
|---|---|---|---|---|---|
| AMD ROCm + vLLM | vLLM (ROCm-enabled) | HF format (FP16); Experimental: GPTQ/AWQ 4-bit | High-throughput serving on AMD Instinct or multi-GPU setups. Efficient for large DeepSeek (MoE) models with multiple GPUs. | “No HIP GPUs” error; ROCm driver mismatches; MoE quantization patches needed. | See Section 3 (ROCm) |
| AMD ROCm + Transformers | PyTorch (ROCm) with HF Transformers/Accelerate | HF format (FP16/BF16); 8-bit (bitsandbytes-rocm); 4-bit AWQ/GPTQ (with plugins) | Single-GPU local development or fine-tuning. Leverages familiar HF API on AMD. | Package import errors looking for CUDA; out-of-memory on large models; unsupported ops. | See Section 3 (ROCm) |
| Mac Metal + llama.cpp | llama.cpp CLI / library | GGUF format (prefer 4-bit or 5-bit quantized) | Running DeepSeek locally on M1/M2 GPUs with maximum performance. Good for chat interfaces and experiments on Mac. | GPU not utilized (slow inference); “Metal not enabled” build issues; memory swapping with long prompts. | See Section 4 (Mac) |
| Mac Metal + Ollama | Ollama app / CLI | GGUF format (quantized) via Modelfile | Easiest deployment on Mac – managed API and UI for local LLMs. Great for end-users who want a chat UI with DeepSeek. | Model import difficulties; high memory use causing slowdowns; limited model size on 16GB RAM Macs. | See Section 4 (Mac) |
| Mac (CPU-only fallback) | llama.cpp (CPU mode) or HF on CPU | GGUF (4-bit highly recommended) | Only for small models or testing. Use if no Metal GPU support (e.g. Intel Macs or very large model beyond GPU memory). | Extremely slow generation; potential out-of-memory on large models even quantized. | See Section 4 (Mac) |
(HF = Hugging Face; FP16 = half-precision floating point)
Section 3: Running DeepSeek on AMD ROCm
Running DeepSeek on AMD GPUs requires the ROCm software stack and often minor tweaks to avoid CUDA-only assumptions. AMD’s ROCm enables GPU acceleration via the HIP platform, which is largely compatible with PyTorch and Transformer models. Below we outline recommended runtime options and common issues with their fixes.
Recommended AMD Runtime Stacks: If you have a suitable AMD GPU (one with ROCm support – e.g., Radeon RX 6000/7000 series or Instinct series), you have two primary ways to load DeepSeek:
- vLLM on ROCm: vLLM is a high-performance inference engine originally for VRAM-efficient serving. vLLM includes experimental and community-driven support for ROCm/HIP execution. Always verify compatibility in the official vLLM documentation. Using vLLM on AMD can yield excellent multi-GPU scaling and fast token generation for DeepSeek, especially for the massive MoE (Mixture-of-Experts) models. Check that you use a ROCm-enabled build of vLLM. Launch vLLM with
VLLM_USE_V1=0or appropriate flags if needed (some DeepSeek-specific optimizations may require V1 mode off). If using quantized models (GPTQ/AWQ), be aware that vLLM might need patching for MoE layers – e.g., a customgptq_marlin.pywas provided to handle per-expert quantization. Always verify vLLM’s documentation for the latest AMD support. - Transformers (Hugging Face) on ROCm: This is the “traditional” approach – load the model with
AutoModelForCausalLM.from_pretrained. Make sure to install the PyTorch build for ROCm (e.g.,pip install torch==2.x.y+rocmmatching your ROCm version) and a compatible Transformers version. On PyTorch for ROCm, the device string is often still exposed as “cuda” even though execution is via HIP/ROCm. Use the ROCm-enabled PyTorch build and verify that the GPU is detected before moving the model to the accelerator device. Transformers supports half-precision (settorch.set_default_dtype(torch.float16)or usemodel.half()if needed) and even some 8-bit or 4-bit flows. Important: The popular 8-bit library BitsAndBytes now has a ROCm-compatible fork – if you want to useload_in_8bit=True, install the ROCm-enabled bitsandbytes (see AMD’s instructions for version 0.44+ with HIP support). Alternatively, consider AMD’s own Quark tool to quantize models (Quark can output models in native PyTorch or Hugging Face AWQ format). For 4-bit, you might use theAutoGPTQlibrary or Hugging Face’sAutoAWQto load an AWQ/GPTQ model on AMD. These allow you to run smaller, quantized DeepSeek models on consumer-grade VRAM (e.g., 20GB) at some accuracy cost.
Now let’s address common issues you might encounter on AMD and how to fix them, each with symptoms, causes, and solutions:
“No HIP GPUs are available” error (on AMD)
Symptom: When launching DeepSeek (e.g., through vLLM or PyTorch), the process fails with an error like RuntimeError: No HIP GPUs are available. In some cases it might simply report that no GPU/accelerator is found.
Likely Cause: This usually indicates that the AMD GPU isn’t accessible to the program. Common causes are missing permissions or drivers. On bare-metal Linux, your user might not have access to the AMD GPU device files (/dev/kfd and /dev/dri), especially if not in the “video” group. In containerized environments, it could mean the container wasn’t launched with the required flags to pass through the GPU. It could also happen if ROCm isn’t installed or initialized correctly, but if you’ve gotten this far, it’s likely a permission or environment issue rather than a code bug.
Fix Steps:
Verify ROCm installation: Run rocminfo or rocm-smi on the host to ensure your GPU is recognized. If these don’t detect a GPU, you may have a driver issue (install the correct ROCm version for your OS and GPU model).
Check user permissions: On a bare-metal setup, add your user to the video group which grants access to AMD GPU devices. For example: sudo usermod -aG video $USER (then log out/in). This is recommended by AMD to allow PyTorch (HIP) to see the GPU without root.
Container setup (if applicable): If you’re using Docker, launch the container with the proper flags. At minimum, include --device=/dev/kfd --device=/dev/dri --group-add=video and security options to allow GPU use. AMD provides ROCm-enabled container images (e.g., rocm/pytorch), which you should run with those flags.
Test in Python: In a Python REPL or script inside your environment, run:
import torch
print(torch.cuda.is_available())
print(torch.cuda.device_count()) This uses PyTorch’s CUDA interface (which maps to HIP on AMD) to check GPU availability. It should return True and at least 1 device count. If it returns False or 0, the GPU still isn’t accessible – re-check steps above (and ensure you installed the ROCm-enabled PyTorch, not a CPU-only version).
Fallback Option: If after troubleshooting the GPU remains unusable (e.g., due to unsupported GPU model or drivers), you can only run DeepSeek on CPU as a fallback. This will be very slow and only feasible for small models or short prompts. As an alternative, you might try using a cloud instance with an AMD GPU to run your model. But ideally, resolve the environment issues – AMD GPUs can run DeepSeek effectively once recognized.
ROCm version mismatch or “invalid device function”
Symptom: DeepSeek fails to run, possibly with errors deep in AMD’s stack (e.g., hipErrorInvalidDeviceFunction or mysterious segfaults). This can happen immediately at model load or during the first inference. Another clue is if you built a custom extension and see errors about GFX code or unsupported ISA.
Likely Cause: These errors often arise from a mismatch between your installed ROCm stack (and GPU architecture) and the software expecting a certain support level. For example, using a newer GPU with an older ROCm can yield invalid device function errors. Or using a precompiled PyTorch that doesn’t include your GPU’s code generation. Essentially, the kernel (compiled GPU code) can’t execute on your GPU.
Fix Steps:
- Align ROCm version with hardware: Check AMD’s documentation for which ROCm versions support your GPU model. Upgrade or downgrade ROCm to match that. For instance, if you have an RX 7800 XT (Navi 3x), ensure you’re on ROCm 5.6+ where that architecture is supported.
- Use official Docker images: AMD’s ROCm Docker containers often bundle the correct drivers and libraries. Running DeepSeek inside
rocm/pytorch:<version>container can avoid host mismatches. - Reinstall/compile PyTorch for your GPU: If using an uncommon or very new GPU, you might need to compile PyTorch from source with ROCm, specifying your GPU’s architecture in the build. This ensures the JIT or precompiled kernels cover your device.
- Check env variables: Occasionally, setting
HSA_OVERRIDE_GFX_VERSION(for very new GPUs not officially supported) can help, but use this with caution and AMD’s guidance. - After changes, re-run a simple test (as above) to confirm
torch.cuda.is_available()is True and noinvalid deviceerrors occur.
Fallback Option: If you cannot resolve a device function error due to an unsupported GPU, your options are limited: either switch to a supported GPU (NVIDIA or an older AMD that works with ROCm), or use CPU. AMD’s ROCm primarily supports GCN/RDNA-based discrete GPUs; some consumer GPUs (especially mobile or older GCN1.0 cards) are not supported. In those cases, CPU or cloud might be the only route.
“CUDA-only” package errors (e.g., NCCL or bitsandbytes issues)
Symptom: You encounter errors when importing or running DeepSeek code that reference CUDA libraries or NVIDIA-specific symbols – for example: RuntimeError: CUDA error: no kernel image is available for execution or an immediate crash stating that libcudart.so or nccl.dll is not found. With bitsandbytes 8-bit loading, you might see a message that it was compiled without GPU support. Essentially, something is trying to use CUDA on an AMD setup.
Likely Cause: Some libraries in the LLM ecosystem are NVIDIA-centric. For instance, NCCL (Nvidia’s collective communication lib) may be pulled in by Transformers or Accelerate for multi-GPU support, and it won’t work on AMD (ROCm uses RCCL as an equivalent). Bitsandbytes by default is CUDA-only and needs the ROCm fork. If these packages are in your environment, they might throw errors or just fail silently and run on CPU. Another example is if a DeepSeek utility script calls torch.cuda.some_function() that isn’t implemented for HIP.
Fix Steps:
Avoid NCCL-only paths when possible: If you’re not doing distributed multi-GPU runs, avoid configurations that initialize NCCL. On ROCm, use ROCm-compatible distributed backends where applicable (e.g., RCCL-backed workflows) and follow the framework’s ROCm guidance. If you do need multi-GPU, ensure the required ROCm communication components are installed and matched to your ROCm/PyTorch stack.
Use ROCm forks of key libraries: For bitsandbytes, install the ROCm-enabled version as described earlier. AMD documentation and community resources describe installing ROCm-compatible bitsandbytes builds for 8-bit inference (either via a special wheel or building the rocm_enabled branch). Once properly installed, import bitsandbytes should no longer complain about missing GPU support.
Patch code expecting CUDA: If the DeepSeek pipeline code explicitly calls torch.cuda.set_device or torch.cuda.is_available() in ways that break under HIP, you might have to modify it. In most cases, PyTorch’s CUDA API is compatible with HIP (aliasing to ROCm), but if something checks for NVIDIA GPU count or specific capability, adjust that logic. For example, if code checks for torch.cuda.get_device_properties().name containing “NVIDIA”, that will obviously not match AMD GPUs. Such checks can usually be safely skipped or altered.
Install AMD-compatible Transformers: The latest Transformers library is generally GPU-agnostic, but ensure you have a version that supports accelerate on CPU or AMD. Upgrading to a newer version can help if the error was due to a bug (for instance, older Transformers tried to load CUDA kernels for BlooM block format which failed on AMD).
Fallback Option: After addressing the above, most “CUDA-only” issues can be resolved. If not, an interim fallback is to run in CPU mode by disabling GPU usage in code (e.g., pass device_map={'': 'cpu'} to from_pretrained to force CPU). This avoids the error but at significant performance cost. Long-term, contributing a fix or using updated forks for AMD is the better solution.
Kernel launch failures or out-of-memory (OOM) errors
Symptom: The model loads, but during inference you get a runtime error like RuntimeError: HIP error: invalid kernel launch or a message about “out of memory” or “Memory access fault” on the GPU. In a multi-GPU setting, you might see one GPU hang or fail while others continue. These often happen when generating long outputs or using long context with DeepSeek on AMD.
Likely Cause: This can happen either because you exhausted VRAM or hit a limitation/bug in the kernel implementations. DeepSeek models are huge; if the memory allocator can’t find contiguous space for an operation (like attention on a long sequence), ROCm may throw an OOM even if total VRAM isn’t fully used. Alternatively, there may be specific kernels that aren’t as robust on ROCm and fail under certain conditions (e.g., very large matrix ops might not have an optimized HIP path and could crash). Another cause is if using multiple GPUs without proper tensor parallel partitioning – one GPU might try to handle more than it can.
Fix Steps:
Reduce memory load: Try lowering the batch size or prompt length. For example, if you were doing a 32k → “very long contexts”, test with 8k to see if it passes. Reducing max_new_tokens for generation also limits memory used for KV cache.
Use fp16/bf16: Ensure you didn’t accidentally load the model in full precision (fp32). FP16 uses half the memory. If your GPU and ROCm/PyTorch build support BF16, it can be a good option alongside FP16. It’s also memory-efficient and may provide better numeric range in some workloads.
Try ROCm-friendly optimizations (when supported): In some workflows, graph/compile-based approaches and alternative runtimes can improve stability or reduce overhead on AMD. For example, ONNX Runtime with the ROCm execution provider may help for models and operators that export cleanly to ONNX, and some inference stacks provide ROCm-compatible optimization paths. Support varies by model architecture, operators, and library versions—so treat this as an optional path and validate it with a small test workload before relying on it for critical workloads.
Update to latest ROCm: Newer ROCm releases (and PyTorch versions) include improved memory management and kernel performance. For instance, ROCm 5 to 6 brought many fixes for large AI models. Upgrading might resolve certain kernel launch failures as bugs get fixed.
Explicit garbage collection: In a long-running loop, occasionally call torch.cuda.empty_cache() (which on AMD clears HIP cache) between iterations to free cached blocks. This can help avoid fragmentation in long sessions.
Fallback Option: If you consistently hit OOMs even after trying smaller loads, you might need to offload some weights to CPU. The Hugging Face Accelerate device_map can split the model across GPU and CPU (or multiple GPUs). This slows things down but can get the model running. Another fallback is to use a quantized model – a 4-bit DeepSeek uses only 1/4 the memory of FP16, which can be the difference between crashing and succeeding.
Section 4: Running DeepSeek on Mac Metal
Apple’s M1/M2 chips (Apple Silicon) offer a capable environment for running LLMs like DeepSeek, thanks to high memory bandwidth and unified memory. However, you must use software that targets the Apple GPU (Metal API) rather than CUDA. llama.cpp with Metal support enables GPU acceleration on Apple Silicon via the Metal API. Another option is Ollama, which internally uses llama.cpp/Metal but provides a nicer interface.
Recommended Mac Workflow: Use a DeepSeek model in GGUF format and run it with llama.cpp (or a llama.cpp-based app such as Ollama). GGUF is widely used for local inference because it supports efficient loading and native quantization. You can either download a community-converted DeepSeek GGUF build from a reputable model hub (and verify the base model, license, and conversion notes before using it), or convert the original Hugging Face checkpoints to GGUF yourself using the official llama.cpp conversion tools. Once you have a .gguf file, you can run it via the llama.cpp CLI for maximum control, or use Ollama for a simpler, managed local experience.
llama.cpp (Metal): Compile llama.cpp on your Mac with Metal support enabled. The exact build flags and binary names may vary depending on the version of llama.cpp you are using, so always refer to the official repository for current instructions. In most cases, Metal support is enabled through a CMake build option that activates the Metal backend.
If using the Python bindings (llama-cpp-python), ensure that Metal support is enabled during installation according to the project’s documentation. Once installed, you can run the model with GPU offloading enabled (often exposed as an “n-gpu-layers” style flag in the CLI). Even partial GPU offloading can significantly improve performance compared to CPU-only execution.
Example (command-line flags may vary by llama.cpp version):
./main -m DeepSeek.gguf --n-gpu-layers <N>The exact binary name and GPU offloading flags may differ depending on the llama.cpp version. Refer to the project’s documentation for the current syntax.
The --n_gpu_layers flag tells llama.cpp how many layers of the model to offload to GPU. Setting a non-zero value is crucial – if you omit it, the model will run on CPU only. A practical approach is to start with a small number of GPU layers to confirm Metal offloading is active, then increase gradually until you hit memory pressure or instability. Start with a small number to confirm Metal is active, then increase gradually until you hit memory pressure or instability. An M2 Pro/Max with 32GB can offload much more (and unified memory means it can swap to RAM if needed, albeit slowly).
Ollama: Install Ollama (which is a Mac application/CLI) and use it to run DeepSeek. If an official DeepSeek model is not in their registry, you can create a custom Modelfile. The Modelfile will simply reference your downloaded GGUF file (e.g., FROM ./DeepSeek.gguf in the file). Then run ollama run your-model-name or load it in the Ollama app. Ollama automatically uses Metal for supported models; just ensure you have Apple Silicon (it won’t work on Intel Macs). It provides nice features like caching and a GUI.
Now, for Mac-specific issues and their fixes:
Metal GPU not being used (DeepSeek is very slow)
Symptom: You launch a DeepSeek inference on Mac and get extremely slow generation and high CPU utilization. In llama.cpp, you might notice it says “using CPU” or doesn’t mention Metal at startup. In Ollama, you observe all CPU cores maxed out and the GPU idle. Essentially, the model is running on CPU instead of the Apple GPU.
Likely Cause: The Metal backend isn’t enabled. This could be because llama.cpp wasn’t compiled with Metal support, or you didn’t instruct it to use the GPU. By default, llama.cpp will use CPU unless compiled otherwise and given the n_gpu_layers parameter. Another possibility is running on an unsupported Mac (Intel Macs have no Metal inference for LLMs). If you did compile with Metal but forgot to offload layers, it will still use CPU for all computations.
Fix Steps:
- Compile with Metal: Ensure your build supports Metal. If you use llama-cpp-python, install a recent build compiled with Metal enabled (set the required CMake flag during installation) and verify that GPU offloading works. If you built llama.cpp manually, check startup logs/help output to confirm Metal is enabled, then test with a small model.
- Use
--n_gpu_layers: When running, always specify a positive number of layers to offload. Even--n_gpu_layers 1will confirm the Metal backend is active (and offload at least the first layer). For optimal speed, you’d offload as many layers as can fit in VRAM – but even a few layers will drastically improve throughput by utilizing the GPU for those parts. - Monitor usage: Use macOS Activity Monitor or
metalperftools to see if the GPU is getting load. If not, double-check the above steps. - Ollama specifics: Ollama should handle Metal automatically. If it’s slow, ensure you’re on an Apple Silicon Mac. If using Rosetta or an Intel binary by mistake, that could cause CPU execution. Download the correct Ollama version for ARM64. For custom models, ensure your Modelfile is correct and that the
.ggufis a Metal-supported quant.
Fallback Option: If for some reason you cannot get Metal working (e.g., on an older Mac or due to configuration issues), you’ll be stuck with CPU inference. In that case, use the smallest possible DeepSeek model (maybe a 7B or a distilled variant) to make it somewhat tolerable. Alternatively, consider running the model on an external server and using an API to interface from your Mac.
Out-of-memory or swapping slowdowns on Mac
Symptom: The Mac starts using a lot of RAM and things slow down, or the process crashes when trying a long prompt or a larger model. You might not see a neat error message – instead the generation becomes extremely sluggish after a point, or macOS might kill the process. Sometimes, you might get a log from llama.cpp about failing to allocate memory for KV cache or context.
Likely Cause: Apple Silicon uses unified memory (shared between the CPU and GPU), which can make local deployment convenient—but it still has practical limits. Only part of system memory is effectively usable for GPU workloads, and when a model (or its KV cache for long prompts) pushes memory pressure too high, macOS may start swapping. Once swapping begins, performance can drop sharply and the process may become unstable or get terminated under heavy memory load.
Long context windows also increase memory usage significantly because attention and KV cache requirements grow with the amount of text you keep in context. As a result, a model that runs fine with short prompts may slow down dramatically or fail when you use very long prompts, large batch sizes, or aggressive generation limits.
Fix Steps:
- Quantize smaller: Use a lower precision GGUF. If you tried Q4_K or Q5_0 and it’s OOM, drop to Q4_0 (the simplest 4-bit) which is smaller. A 5-bit model can be ~25% larger than a 4-bit one, so that difference matters at the edge of memory.
- Reduce context length: Don’t use an extreme context size unless needed. Try running with 2048 or 4096 tokens context to see if it stays within memory. If you need long context for DeepSeek, consider that very long prompts will slow down dramatically on Mac anyway – you might be better off with a server-class GPU in those cases.
- Monitor memory: Keep an eye on the macOS memory pressure graph. If it turns red, you’re swapping. You can also check if the Neural Engine (ANE) is being engaged; llama.cpp offloads some ops to ANE when GPU runs out of space, which is slower than GPU but faster than CPU. This might manifest as the first part of generation (prompt ingestion) being slow, then speeding up—because the prompt was handled by CPU/ANE due to size. To confirm, test with a shorter prompt and see if the initial latency improves.
- Use smaller model variants: DeepSeek has variants (e.g. a 32B distilled model or a 7B chat model). On a Mac with 16GB RAM, even a 32B 4-bit model is likely too much. Opt for the 7B or 13B versions if available.
Fallback Option: If you continue to hit memory issues, you may have to run on CPU with paging or look into offloading some layers back to CPU. In llama.cpp, you control this by n_gpu_layers. For example, if 50 layers was too many, try --n_gpu_layers 30 so that the remaining layers stay on CPU. This hybrid mode uses CPU for the rest, which is slower but prevents outright failures. It’s similar to how one might distribute layers across devices. In the worst case, use an even smaller quant (like 3-bit if an experimental one exists) or a different machine with more RAM.
Build and installation issues (Metal or Xcode)
Symptom: You had trouble getting llama.cpp or its Python wrapper installed with Metal enabled. Perhaps pip install llama-cpp-python failed with a cryptic error, or CMake complained about missing Metal frameworks. This stops you from even running the model.
Likely Cause: Building for Metal on Mac requires Xcode Command Line Tools and the correct CMake flags. If those aren’t in place, the build fails. For instance, if you didn’t have Xcode CLI tools, you might see errors about metal library not found or similar. The pip installer needs a working C++ compiler environment (Apple Clang) with developer tools. Missing those will cause an install failure.
Fix Steps:
- Install Xcode Command Line Tools: Open Terminal and run
xcode-select --install. This will install the compilers and Apple development frameworks needed. Verify withxcode-select -p– it should point to an Xcode developer path. - Update Homebrew libs if needed: If CMake is having trouble, ensure you have the latest CMake (
brew install cmake) and possibly pkg-config. However, for Metal, it mostly depends on Xcode. - Specify flags properly: As noted, use
-DGGML_METAL=ONwith the llama.cpp CMake. If manually using CMake, also ensure the Metal framework is being linked. The official llama.cpp GitHub lists “Metal (Apple Silicon)” as a supported backend, so follow their instructions. In some cases, you might need to target a minimum macOS version – setMACOSX_DEPLOYMENT_TARGETenv var to 12 or 13 if needed to avoid symbol issues. - Use pre-built alternatives: If building is too troublesome, you can use the conda package (which is pre-compiled with Metal support). The llama-cpp-python docs suggest using a Conda environment with a known good build. Conda Forge often has a compiled version for Mac so you don’t need to compile from source.
- Ollama note: Installing Ollama is straightforward (Homebrew or .dmg), but if you encounter an error, make sure you’re on macOS 12+ which is required for M1 support.
Fallback Option: Should building still fail, a last resort is running a non-Metal version (CPU-only). But before that, consider using an alternate library: for example, Core ML conversions (Apple has a Core ML conversion for Llama 2, and perhaps DeepSeek could be converted similarly). Core ML might not support DeepSeek out-of-the-box, but it’s an area to watch. For now, ensure Xcode is set up – that solves most build issues on Mac.
Section 5: Format Guidance by Platform
Different platforms benefit from different model formats and quantization strategies. Here’s how to choose the right format for DeepSeek on your hardware:
- Apple Silicon (Mac) – use GGUF: The GGUF format (successor to GGML) is the de facto standard for running LLMs with llama.cpp and compatible apps on Mac. Always convert your DeepSeek model to GGUF for Metal inference. You can convert models to GGUF using llama.cpp’s conversion tools. Some third-party toolchains may also export GGUF-compatible weights, but always verify the output format against the latest llama.cpp documentation. Why GGUF? It’s optimized for fast CPU/GPU memory mapping and supports quantization natively. Specifically, use 4-bit quantization if the model is large – e.g., DeepSeek 30B in Q4_0.gguf. This drastically reduces memory usage so it can fit in Mac RAM. Use Run DeepSeek with GGUF for detailed steps on converting and running models with llama.cpp. In most llama.cpp-based Mac deployments, GGUF is the most commonly used local format. Choose a quantized GGUF build when memory is limited or when you want a more practical local experience, and quantize as much as you can tolerate (Q4 or Q5 usually). FP16 GGUF models are possible but generally too large for Mac GPU and will run on CPU.
- AMD ROCm (AMD GPUs) – FP16 or AWQ/GPTQ: On AMD, you have more flexibility with formats since you’re using PyTorch or similar. If your AMD GPU has plenty of VRAM (e.g. 48GB+ Instinct MI210 or MI300X), you can run DeepSeek in half precision (FP16 or BF16) which offers the best accuracy. However, most users will consider quantized formats to reduce memory. AWQ (Activation-aware Weight Quantization) is a 4-bit weight quant method that AMD has embraced – it’s integrated in tools like Hugging Face
autoawqand AMD’s Quark produces AWQ models. AWQ is a good choice on AMD because it doesn’t require custom CUDA kernels; the quantized model runs with standard operations (some in int8). GPTQ, another popular 4-bit method, can offer even faster inference if supported, because it typically uses a single INT4 matrix multiply kernel. In some experimental setups, GPTQ models may offer lower latency compared to AWQ due to more compact weight packing. Actual performance depends heavily on the runtime implementation and your hardware configuration. One reason GPTQ can be faster in some setups is more efficient weight packing and kernel implementations, but results vary by runtime and HIP/ROCm support. So which to choose? If using Transformers or Accelerate on ROCm, AWQ might be easier – you can quantize with Quark orAutoAWQand load the model, as there is built-in support. GPTQ on ROCm might require using theAutoGPTQlibrary withtrust_remote_code=True, which should work but may not be as optimized (possibly running some ops in FP16 under the hood). If you have an environment like vLLM or text-generation-inference that supports GPTQ with Triton kernels, those Triton kernels might not yet support HIP – check their documentation. It’s worth noting that some projects (like exllama) that speed up GPTQ are CUDA-specific and won’t run on AMD without modification. Therefore, AWQ could be considered the more plug-and-play 4-bit solution on ROCm right now, whereas GPTQ might yield better speed if you put in the effort to get it working. In either case, quantization can drastically lower VRAM requirements. - Mixed-precision and others: DeepSeek’s Mixture-of-Experts architecture particularly benefits from mixed precision. If you quantize, consider keeping the most critical expert layers at higher precision. The DeepSeek community found that an Int4 + Int8 mix (with sensitive layers in 8-bit) preserved quality much better than pure 4-bit. This strategy can be applied on AMD by selectively quantizing layers – some tools allow you to exclude certain layers from 4-bit quant. On Mac, this level of control isn’t readily available (llama.cpp quantizes everything uniformly), so you get either full 4-bit or not. On AMD, you can load a mixed model (as done in the QuantTrio GPTQ Compact model) if your framework supports it. Keep this in mind if you notice quality issues with an aggressively quantized model: you might need a “hybrid” quant or just use a smaller model with higher precision.
Finally, remember that quantization is version-sensitive. Always use the quant format that your runtime expects. For example, GGUF has versioning – use the latest llama.cpp for the newest format. For AWQ/GPTQ, ensure the libraries (Transformers, vLLM, etc.) are up to date to handle those quantized weights. Refer to the DeepSeek Quantization Guide for deeper discussion on quantization methods and their trade-offs.
Section 6: Troubleshooting Index
Below is a quick index of common DeepSeek deployment problems and where to find solutions in this guide:
- “No HIP GPUs are available” (AMD ROCm) – This error means the AMD GPU isn’t being detected. See Running DeepSeek on AMD ROCm under the No HIP GPUs symptom for fixes (Section 3). Likely a permissions or setup issue.
- DeepSeek running on CPU instead of GPU (Mac Metal) – If you observe extremely slow token generation on Mac, the model might not be using the Metal GPU. See Running DeepSeek on Mac Metal, first issue in Section 4, about Metal GPU not being used. Solution: compile with Metal and use
--n_gpu_layers. - Out of memory / GPU memory errors – Both on AMD and Mac you might hit memory limits. For AMD, see Section 3 under Kernel launch failures or OOM. For Mac, see Section 4 under Out-of-memory or swapping slowdowns. Suggestions include quantization and reducing context.
- “invalid device function” or similar (AMD) – Indicates a GPU/ROCm version mismatch. Refer to the ROCm version mismatch portion in Section 3 for how to align your software with your GPU.
- Build/installation failure (Mac) – If llama.cpp or Ollama installation fails on Mac, see the build issues item in Section 4. Ensure Xcode CLI tools are installed and the proper flags are used.
- Incorrect outputs after quantization – If DeepSeek’s answers are gibberish or low-quality after using a 4-bit model, you may have over-quantized. See Section 5 (Format Guidance) regarding mixed precision; pure Int4 can degrade accuracy. Try a hybrid format or higher precision for critical layers.
By following this guide and the references to fixes, you should be able to deploy DeepSeek on AMD ROCm GPUs or Apple Metal (Mac) with relative ease. The key is to use the right toolchain for your hardware and to not panic when hitting an error – in most cases, it’s a known issue with a documented solution.
Hardware compatibility and performance can change depending on driver updates, ROCm releases, and llama.cpp improvements. Always consult official documentation before deploying production systems.
Conclusion:
Running DeepSeek outside of the NVIDIA ecosystem is technically feasible with the appropriate runtime configuration: AMD ROCm offers a powerful platform for those with Radeon/Instinct GPUs, and Apple’s M-series Macs can handle reasonably sized models with the Metal-accelerated llama.cpp. Always ensure your software versions (drivers, PyTorch, etc.) align with the tips above, and don’t hesitate to use quantization to get the model to a workable size. DeepSeek’s performance and compatibility ultimately depend on matching the model to your hardware capabilities. For an overview of official DeepSeek models and usage options, visit our main DeepSeek guide. Understanding model variants and deployment formats will help you choose the correct configuration for AMD ROCm or Apple Metal environments.
Frequently Asked Questions (FAQ)
Can DeepSeek run on AMD without CUDA?
Yes. DeepSeek models can run on AMD GPUs using the ROCm software stack instead of NVIDIA CUDA. ROCm provides GPU acceleration through the HIP platform, which is compatible with PyTorch and several inference frameworks. To run DeepSeek on AMD hardware, you must install a ROCm-enabled PyTorch build and use a runtime that supports HIP execution, such as Hugging Face Transformers or vLLM with ROCm support.
Does DeepSeek support Apple Silicon natively?
DeepSeek models do not include a native macOS or Metal-specific runtime. However, they can be deployed on Apple Silicon Macs using compatible inference engines such as llama.cpp with Metal support or applications like Ollama. These tools enable GPU acceleration through Apple’s Metal API when models are converted into supported formats.
What format is required for Mac deployment?
For Apple Silicon Macs, the recommended format is GGUF. GGUF is optimized for use with llama.cpp and other Metal-compatible runtimes. In most local deployment scenarios, a quantized GGUF model (such as 4-bit or 5-bit) is preferred to reduce memory usage and improve performance on unified memory systems.







