How do you calculate VRAM requirements?

We compute VRAM as Model Weights + KV Cache + Framework Overhead. Model weights = parameters × bits-per-weight ÷ 8. KV cache depends on the model's attention architecture (number of KV heads, head dimension, and layers). When architecture data is available, we compute the precise KV cache size; otherwise we fall back to a 10% overhead estimate. Measured values from Ollama or llama.cpp always take priority.

How accurate are the tok/s estimates?

They're based on the theoretical bandwidth formula: tok/s ≈ (bandwidth GB/s ÷ model size GB) × efficiency. The efficiency factor varies by platform: 65% for NVIDIA (CUDA), 55% for AMD (ROCm), 50% for Intel (oneAPI), and 70% for Apple Silicon (Metal). In practice, results typically land within ±20% of these figures.

What does the S–F grade mean?

It summarises speed and VRAM headroom into one letter. S means the model runs fast with plenty of room to spare. F means your hardware can't load the model at all.

Why does VRAM headroom matter?

A model that barely fits in VRAM leaves nothing for the KV cache, which grows with conversation length. Models with Grouped-Query Attention (GQA) are more memory-efficient for long contexts. We recommend at least 1–2 GB above the model weight size for comfortable use.

What is the GGUF format?

GGUF is the standard file format for quantized models used by llama.cpp, Ollama, and LM Studio. It packages weights, tokenizer, and metadata into a single ready-to-run file.

How It Works

llmrun checks whether your GPU or device is powerful enough to run a given LLM locally — and predicts how fast it will be. This page explains every number and grade you see on the site.

Parameters

A model's parameter count (measured in billions) is the number of learned weights it contains. Larger models tend to be more capable but require more memory and generate text more slowly.

Size → capability trade-off

1–3 B

Basic — Simple Q&A, autocomplete

7–8 B

Good — Everyday tasks, coding help

13–14 B

Better — Reasoning, summarisation

27–34 B

Great — Complex instructions

70 B

Excellent — Near-frontier quality

405 B+

Frontier — Best open-weight models

↑ Smarter & more capable · ↓ Faster & less VRAM

Quantization

Quantization reduces the numerical precision of a model's weights, making the file smaller and inference faster at the cost of some quality. The format name tells you roughly how many bits each weight uses.

For example, a full-precision (FP16) 7 B model weighs ~13 GB. At Q4_K_M it shrinks to ~4 GB — small enough for an 8 GB GPU.

Quality retention vs file size (7 B model)

F16~13 GB

Original

Q8_0~6.7 GB

≈ 99%

Q6_K~5.3 GB

≈ 95%

Q4_K_M ★~4.1 GB

≈ 88%

Q4_K_S~3.8 GB

≈ 85%

Q2_K~2.5 GB

≈ 60%

★ Best balance of quality and file size — the most popular choice for local inference.

GGUF Format

GGUF is the standard file format for quantized models, used by llama.cpp, Ollama, LM Studio, and other local inference tools. It bundles the model weights, tokenizer vocabulary, and metadata into a single file that runs on CPU, GPU, or both.

📦

Single file

One .gguf file contains everything needed to run the model — no separate config or tokenizer downloads.

⚡

Ready to run

Download from HuggingFace, point Ollama or LM Studio at the file, and inference starts immediately.

🔀

CPU + GPU splits

GGUF supports partial GPU offloading — load as many layers into VRAM as you can, the rest runs on CPU.

VRAM & Estimation

VRAM (Video RAM) is the dedicated memory on your GPU. To run a model entirely on GPU, the quantized weights need to fit in VRAM — if they don't, inference falls back to the CPU, which is dramatically slower.

llmrun estimates the VRAM footprint by breaking it into three components:

VRAM = Model Weights + KV Cache + ~0.3 GB

Model Weights = parameters × bitsPerWeight ÷ 8

KV Cache = 2 × kv_heads × head_dim × layers × context_tokens × 2 bytes

The ~0.3 GB accounts for framework overhead (llama.cpp, Ollama, etc.).

When we have the model's architecture details (number of KV heads, hidden size, layers), we compute the KV cache precisely. Models using Grouped-Query Attention (GQA) — like Llama 3, Qwen 2.5, and Mistral — have fewer KV heads than attention heads, significantly reducing their KV cache VRAM. When architecture data is unavailable, we fall back to a simple 10% overhead estimate.

📐

Measured values win

Whenever a model has been benchmarked in Ollama or llama.cpp, we display the observed VRAM figure instead of the estimate.

🧠

Headroom matters

A model that barely fits leaves no room for the KV cache. Longer conversations need more cache. We recommend 1–2 GB of headroom above the model size for interactive use.

Tokens per Second

During text generation (the "decode" phase), the model reads its entire weight tensor once per output token. This makes inference memory-bandwidth bound — speed is dictated by how fast data can stream from VRAM, not by raw compute power.

tok/s = (bandwidth GB/s ÷ modelSize GB) × efficiency

The efficiency factor accounts for memory contention, software stack overhead, and batch-size-1 conditions. It varies by platform because each vendor's inference stack has different levels of optimisation.

Efficiency factor by platform

NVIDIA (CUDA)65%

Most mature LLM inference stack

Apple (Metal)70%

Efficient unified memory access

AMD (ROCm)55%

Improving but higher overhead

Intel (oneAPI)50%

Least mature for LLM workloads

These factors are calibrated against community benchmarks from llama.cpp and Ollama. They may change as software stacks improve.

How fast does it feel?

60+ tok/s

Instant — great for interactive chat

30–60 tok/s

Fast and comfortable

15–30 tok/s

Usable, slight wait on long replies

5–15 tok/s

Workable for batch tasks

< 5 tok/s

Painful for interactive use

Memory Bandwidth

Memory bandwidth (GB/s) is the throughput between a GPU's processor and its VRAM. Because LLM decode reads the full model on every token, bandwidth is the primary bottleneck for generation speed — more bandwidth means more tokens per second for the same model.

Bandwidth comparison (GB/s)

RTX 4060

272

M4 Pro

273

RTX 4070 Ti S

672

RX 7900 XTX

960

RTX 4090

1,008

RTX 5090

1,792

Higher bandwidth = faster tok/s at the same model size. This is why Apple Silicon Macs with unified high-bandwidth memory can outperform discrete GPUs that have more VRAM but lower bandwidth.

Explore bandwidth comparisons on any hardware listing page or VRAM tier page.

Dense vs Mixture of Experts (MoE)

Most LLMs are dense — every parameter is used on every token. A Mixture of Experts (MoE) model splits its parameters into groups called experts and only activates a subset per token. The result: higher quality with better speed, but the entire model still has to fit in memory.

Dense Architecture

All parameters active on every token. VRAM = total params. Speed scales with total params.

Example: Llama 3 70B — 70 B total, 70 B active per token.

MoE Architecture

Only 2 of 8 experts active per token. VRAM = total params. Speed ≈ active params only.

Example: Mixtral 8×7B — 46.7 B total, ~12.9 B active per token.

On llmrun, MoE models show both the total parameter count (for VRAM sizing) and the active count (for speed intuition).

Context Length

Context length defines the maximum number of tokens (input + output) the model can handle in a single conversation. A "128K context" model can process roughly 100,000 words at once — enough for entire codebases or long documents.

The catch: longer contexts consume additional VRAM through the KV cache, which grows with every token in the conversation.

💬

Default (2K–4K)

Minimal KV cache overhead — typically under 0.5 GB. Well within headroom for most chat and coding tasks.

📄

Extended (16K+)

KV cache scales linearly with context length. Models with GQA (fewer KV heads) use significantly less context VRAM than full MHA models.

📚

Max context

Limited by available headroom after model load. Context VRAM depends on the model's KV head configuration — check the "+Context" column on model pages.

Our compatibility grades assume a 2K-token context. If you need long-context inference, look for extra VRAM headroom or check the "+Context" VRAM on model pages.

Compatibility Grades

Every hardware–model combination on llmrun receives a letter grade from S to F that encodes VRAM headroom and estimated generation speed into a single at-a-glance signal.

S≥ 90Runs great

A≥ 70Runs well

B≥ 55Decent

C≥ 40Usable

D≥ 25Barely runs

F< 25Too heavy

The score is a weighted combination of VRAM fit (headroom as a percentage of model size) and speed (estimated tok/s normalised against a target of 60 tok/s). VRAM fit is the dominant factor — a model that won't physically fit always receives F regardless of bandwidth.

Frequently Asked Questions

How do you calculate VRAM requirements?: We compute VRAM as Model Weights + KV Cache + Framework Overhead. Model weights = parameters × bits-per-weight ÷ 8. KV cache depends on the model's attention architecture (number of KV heads, head dimension, and layers). When architecture data is available, we compute the precise KV cache size; otherwise we fall back to a 10% overhead estimate. Measured values from Ollama or llama.cpp always take priority.
How accurate are the tok/s estimates?: They're based on the theoretical bandwidth formula: tok/s ≈ (bandwidth GB/s ÷ model size GB) × efficiency. The efficiency factor varies by platform: 65% for NVIDIA (CUDA), 55% for AMD (ROCm), 50% for Intel (oneAPI), and 70% for Apple Silicon (Metal). In practice, results typically land within ±20% of these figures.
What does the S–F grade mean?: It summarises speed and VRAM headroom into one letter. S means the model runs fast with plenty of room to spare. F means your hardware can't load the model at all.
Why does VRAM headroom matter?: A model that barely fits in VRAM leaves nothing for the KV cache, which grows with conversation length. Models with Grouped-Query Attention (GQA) are more memory-efficient for long contexts. We recommend at least 1–2 GB above the model weight size for comfortable use.
What is the GGUF format?: GGUF is the standard file format for quantized models used by llama.cpp, Ollama, and LM Studio. It packages weights, tokenizer, and metadata into a single ready-to-run file.

Data Sources

Hardware specs come from manufacturer datasheets and are cross-referenced with community benchmarks. Model VRAM figures are collected from:

Ollama — model library and runtime measurements
llama.cpp — community benchmarks and perplexity data
Hugging Face — model cards and architecture metadata

Spot an error? Open an issue — community corrections are always welcome.