LLM VRAM Calculator
Calculate GPU VRAM requirements for running local LLMs. Check if your GPU can run any model.
0.56 bytes/param
VRAM Estimation
This model does not fit. You need 156 GB more VRAM.
Recommendation
This model is too large even at the lowest quantization for your current GPU setup. Consider using more GPUs or a GPU with more VRAM.
Quantization Comparison
| Quantization | Model Size | + KV Cache | Total VRAM | Fits? |
|---|---|---|---|---|
| FP16 (16-bit) | 130 GB | 143 GB | 274 GB | No |
| Q8_0 (8-bit) | 72 GB | 143 GB | 216 GB | No |
| Q6_K (6-bit) | 54 GB | 143 GB | 198 GB | No |
| Q5_K_M (5-bit) | 45 GB | 143 GB | 189 GB | No |
| Q4_K_M (4-bit) | 37 GB | 143 GB | 180 GB | No |
| Q3_K_M (3-bit) | 29 GB | 143 GB | 173 GB | No |
| Q2_K (2-bit) | 22 GB | 143 GB | 166 GB | No |
| GPTQ 4-bit | 37 GB | 143 GB | 180 GB | No |
| AWQ 4-bit | 37 GB | 143 GB | 180 GB | No |
| GGUF IQ2_XS (2-bit) | 20 GB | 143 GB | 164 GB | No |
Note: These are estimates. Actual VRAM usage varies based on model architecture, inference engine (llama.cpp, vLLM, etc.), batch size, and system configuration. KV cache uses a simplified GQA estimation. For MoE models, all expert weights must reside in VRAM even though only a subset is active per token.
How to Use This Tool
- Select your GPU from the dropdown — it includes NVIDIA consumer cards (RTX 30/40 series), data center GPUs (A100, H100), and Apple Silicon Macs.
- Choose the number of GPUs if you're using a multi-GPU setup (1-8 GPUs supported).
- Select the LLM model you want to run, or choose 'Custom' to enter parameter count manually.
- Pick a quantization level — Q4_K_M is recommended for most users as a balance of quality and VRAM savings.
- Set your desired context length — longer contexts require more VRAM for the KV cache.
- Review the results: the VRAM bar shows usage vs available memory, with a detailed breakdown and recommendation.
Running LLMs Locally: VRAM Requirements Explained
Running large language models locally on your own GPU gives you complete privacy, zero API costs, and offline access. However, the biggest constraint is VRAM (Video RAM) — the memory on your graphics card. Every model parameter must fit in VRAM, along with the KV cache for your conversation context.
Model size in VRAM depends primarily on two factors: the number of parameters and the quantization level. A 70B parameter model in FP16 (full precision) requires ~140GB of VRAM, far more than any single consumer GPU. But with Q4_K_M quantization (4-bit), the same model fits in ~39GB — achievable with a single RTX 3090/4090 pair or an M-series Mac with 64GB+ unified memory.
Quantization reduces model precision to save VRAM at the cost of some quality loss. The most popular quantization methods are GGUF (via llama.cpp) and GPTQ/AWQ (via vLLM, ExLlamaV2). For GGUF, Q4_K_M and Q5_K_M are the community sweet spots — minimal quality loss with significant VRAM savings. Below Q3_K_M, quality degrades noticeably.
Mixture of Experts (MoE) models like DeepSeek R1 (671B parameters) are special: all expert parameters must be loaded into VRAM even though only ~37B are active per token. This means the full 671B model needs to fit in memory. MoE models are generally more cost-effective than dense models of similar quality but require substantially more VRAM.
The KV cache grows with context length and can consume significant VRAM for long conversations. At 128K context with a 70B model, the KV cache alone can use 3-4GB. This calculator accounts for KV cache in its estimates. If VRAM is tight, consider reducing context length.
Last updated: February 2026
FAQ
How is VRAM calculated?
VRAM = Model Weights (parameters × bytes per param based on quantization) + KV Cache (scales with context length) + Runtime Overhead (~500MB). For MoE models like DeepSeek R1, all expert parameters must be loaded even though only a fraction are active per token.
What quantization should I use?
Q4_K_M is the sweet spot for most users — it offers a good balance of quality and VRAM savings. Q5_K_M is better quality with slightly more VRAM. Q3_K_M and below start to noticeably affect output quality. Use Q8_0 or FP16 if you have enough VRAM.
Can I run models across multiple GPUs?
Yes! Select the GPU count to see how much total VRAM you have. Models can be split across GPUs using frameworks like llama.cpp, vLLM, or Ollama. However, multi-GPU setups have some overhead for inter-GPU communication.
Can Apple Silicon Macs run large LLMs?
Yes! Apple Silicon Macs use unified memory shared between CPU and GPU, giving them an advantage for large models. An M4 Max with 128GB can run 70B models in Q4_K_M quantization comfortably. An M3 Pro with 36GB can handle 14B models. Performance is lower than dedicated NVIDIA GPUs but the larger memory pool is a unique advantage.
What's the difference between GGUF, GPTQ, and AWQ?
GGUF (GPT-Generated Unified Format) is the standard for llama.cpp and Ollama — best for CPU/GPU hybrid inference. GPTQ and AWQ are GPU-only quantization methods used by vLLM and ExLlamaV2 — typically faster on dedicated GPUs. All three achieve similar quality at the same bit width. GGUF is the most user-friendly choice for most setups.
Why does my model use more VRAM than calculated?
This calculator estimates the minimum VRAM for the model weights, KV cache, and runtime overhead. Actual usage can be higher due to: CUDA context (~200-500MB), activation memory during inference, and framework-specific overhead. If the calculator shows a tight fit, you may need a smaller quantization or context length in practice.