LLM RAM Calculator

Estimate Memory Requirements for Running LLMs Locally

Select a model, quantization level, and context length to calculate RAM/VRAM needs.

Popular Configurations — Quick Select
Best balance of quality and memory. Standard for local inference.
Memory Estimate
Model Weights
--
KV Cache Key-Value cache stores intermediate attention values for the context window. Scales linearly with context length.
--
Total RAM / VRAM
--
Memory usage 0%
016 GB32 GB64 GB128 GB

Side-by-Side Model Comparison

RAM requirements at Q4_K_M quantization with 4K context, sorted by memory.

Understanding LLM RAM Requirements

Running Large Language Models locally requires significant memory resources. Here's what you need to know:

Key Factors Affecting Memory Usage

  • Model Size: More parameters = more memory. Each billion parameters at FP16 ≈ 2 GB.
  • Quantization levels: FP32Full precision. Best quality but 2× memory vs FP16. Rarely needed for inference. FP16/BF16Half precision. Standard for GPU inference. Near-lossless quality. INT8/Q88-bit integer. Minimal quality loss, ~50% memory vs FP16. Supported by llama.cpp, ExLlama. Q4_K_M ★4-bit with K-quant mixing. Best quality-per-byte ratio. The de-facto standard for local inference. Q3/Q2Very aggressive quantization. Noticeable quality degradation. Only for memory-constrained scenarios.
  • Context Length: KV cache grows linearly with context. 128K context can add several GB overhead.

Recommended Hardware (2026)

Entry (8–16 GB)
1–4B models at 4-bit — Llama 3.2 1B/3B, Phi-4 Mini, Gemma 3 4B
Mid-range (16–32 GB)
7–14B with Q4 — Llama 3.1 8B, Phi-4 14B, Gemma 3 12B/27B
High-end (48–64 GB)
32–70B at Q4 — Qwen 3 32B MoE, Llama 3.3 70B (RTX 4090/5090)
Server (128 GB+)
70B+ at FP16 or 200B+ — A100/H100 GPUs, multi-node clusters

Frequently Asked Questions