DevTk.AI

LLM VRAM Calculator

Calculate GPU VRAM requirements for running local LLMs. Check if your GPU can run any model.

Total: 24 GB VRAM

0.56 bytes/param

VRAM Estimation

VRAM Usage180 GB / 24 GB (751.5%)
EXCEEDS VRAM by 156 GB
Model Weights
37 GB
70B x 0.56 B/param
KV Cache
143 GB
4,096 ctx length
Overhead
0.50 GB
Runtime + CUDA kernels
NO

This model does not fit. You need 156 GB more VRAM.

Recommendation

This model is too large even at the lowest quantization for your current GPU setup. Consider using more GPUs or a GPU with more VRAM.

Quantization Comparison

QuantizationModel Size+ KV CacheTotal VRAMFits?
FP16 (16-bit)130 GB143 GB274 GBNo
Q8_0 (8-bit)72 GB143 GB216 GBNo
Q6_K (6-bit)54 GB143 GB198 GBNo
Q5_K_M (5-bit)45 GB143 GB189 GBNo
Q4_K_M (4-bit)37 GB143 GB180 GBNo
Q3_K_M (3-bit)29 GB143 GB173 GBNo
Q2_K (2-bit)22 GB143 GB166 GBNo
GPTQ 4-bit37 GB143 GB180 GBNo
AWQ 4-bit37 GB143 GB180 GBNo
GGUF IQ2_XS (2-bit)20 GB143 GB164 GBNo
GPU: RTX 4090 (24 GB)Model: Llama 3.3 70BQuant: Q4_K_M (4-bit)Context: 4,096 tokens

Note: These are estimates. Actual VRAM usage varies based on model architecture, inference engine (llama.cpp, vLLM, etc.), batch size, and system configuration. KV cache uses a simplified GQA estimation. For MoE models, all expert weights must reside in VRAM even though only a subset is active per token.

How to Use This Tool

  1. Select your GPU from the dropdown — it includes NVIDIA consumer cards (RTX 30/40 series), data center GPUs (A100, H100), and Apple Silicon Macs.
  2. Choose the number of GPUs if you're using a multi-GPU setup (1-8 GPUs supported).
  3. Select the LLM model you want to run, or choose 'Custom' to enter parameter count manually.
  4. Pick a quantization level — Q4_K_M is recommended for most users as a balance of quality and VRAM savings.
  5. Set your desired context length — longer contexts require more VRAM for the KV cache.
  6. Review the results: the VRAM bar shows usage vs available memory, with a detailed breakdown and recommendation.

Running LLMs Locally: VRAM Requirements Explained

Running large language models locally on your own GPU gives you complete privacy, zero API costs, and offline access. However, the biggest constraint is VRAM (Video RAM) — the memory on your graphics card. Every model parameter must fit in VRAM, along with the KV cache for your conversation context.

Model size in VRAM depends primarily on two factors: the number of parameters and the quantization level. A 70B parameter model in FP16 (full precision) requires ~140GB of VRAM, far more than any single consumer GPU. But with Q4_K_M quantization (4-bit), the same model fits in ~39GB — achievable with a single RTX 3090/4090 pair or an M-series Mac with 64GB+ unified memory.

Quantization reduces model precision to save VRAM at the cost of some quality loss. The most popular quantization methods are GGUF (via llama.cpp) and GPTQ/AWQ (via vLLM, ExLlamaV2). For GGUF, Q4_K_M and Q5_K_M are the community sweet spots — minimal quality loss with significant VRAM savings. Below Q3_K_M, quality degrades noticeably.

Mixture of Experts (MoE) models like DeepSeek R1 (671B parameters) are special: all expert parameters must be loaded into VRAM even though only ~37B are active per token. This means the full 671B model needs to fit in memory. MoE models are generally more cost-effective than dense models of similar quality but require substantially more VRAM.

The KV cache grows with context length and can consume significant VRAM for long conversations. At 128K context with a 70B model, the KV cache alone can use 3-4GB. This calculator accounts for KV cache in its estimates. If VRAM is tight, consider reducing context length.

Last updated: February 2026

FAQ

How is VRAM calculated?

VRAM = Model Weights (parameters × bytes per param based on quantization) + KV Cache (scales with context length) + Runtime Overhead (~500MB). For MoE models like DeepSeek R1, all expert parameters must be loaded even though only a fraction are active per token.

What quantization should I use?

Q4_K_M is the sweet spot for most users — it offers a good balance of quality and VRAM savings. Q5_K_M is better quality with slightly more VRAM. Q3_K_M and below start to noticeably affect output quality. Use Q8_0 or FP16 if you have enough VRAM.

Can I run models across multiple GPUs?

Yes! Select the GPU count to see how much total VRAM you have. Models can be split across GPUs using frameworks like llama.cpp, vLLM, or Ollama. However, multi-GPU setups have some overhead for inter-GPU communication.

Can Apple Silicon Macs run large LLMs?

Yes! Apple Silicon Macs use unified memory shared between CPU and GPU, giving them an advantage for large models. An M4 Max with 128GB can run 70B models in Q4_K_M quantization comfortably. An M3 Pro with 36GB can handle 14B models. Performance is lower than dedicated NVIDIA GPUs but the larger memory pool is a unique advantage.

What's the difference between GGUF, GPTQ, and AWQ?

GGUF (GPT-Generated Unified Format) is the standard for llama.cpp and Ollama — best for CPU/GPU hybrid inference. GPTQ and AWQ are GPU-only quantization methods used by vLLM and ExLlamaV2 — typically faster on dedicated GPUs. All three achieve similar quality at the same bit width. GGUF is the most user-friendly choice for most setups.

Why does my model use more VRAM than calculated?

This calculator estimates the minimum VRAM for the model weights, KV cache, and runtime overhead. Actual usage can be higher due to: CUDA context (~200-500MB), activation memory during inference, and framework-specific overhead. If the calculator shows a tight fit, you may need a smaller quantization or context length in practice.

Related Tools