How It Works
llmrun checks whether your GPU or device is powerful enough to run a given LLM locally — and predicts how fast it will be. This page explains every number and grade you see on the site.
Parameters
A model's parameter count (measured in billions) is the number of learned weights it contains. Larger models tend to be more capable but require more memory and generate text more slowly.
Size → capability trade-off
↑ Smarter & more capable · ↓ Faster & less VRAM
Quantization
Quantization reduces the numerical precision of a model's weights, making the file smaller and inference faster at the cost of some quality. The format name tells you roughly how many bits each weight uses.
For example, a full-precision (FP16) 7 B model weighs ~13 GB. At Q4_K_M it shrinks to ~4 GB — small enough for an 8 GB GPU.
Quality retention vs file size (7 B model)
★ Best balance of quality and file size — the most popular choice for local inference.
GGUF Format
GGUF is the standard file format for quantized models, used by llama.cpp, Ollama, LM Studio, and other local inference tools. It bundles the model weights, tokenizer vocabulary, and metadata into a single file that runs on CPU, GPU, or both.
Single file
.gguf file contains everything needed to run the model — no separate config or tokenizer downloads.Ready to run
CPU + GPU splits
VRAM & Estimation
VRAM (Video RAM) is the dedicated memory on your GPU. To run a model entirely on GPU, the quantized weights need to fit in VRAM — if they don't, inference falls back to the CPU, which is dramatically slower.
llmrun estimates the VRAM footprint by breaking it into three components:
VRAM = Model Weights + KV Cache + ~0.3 GB
Model Weights = parameters × bitsPerWeight ÷ 8
KV Cache = 2 × kv_heads × head_dim × layers × context_tokens × 2 bytes
The ~0.3 GB accounts for framework overhead (llama.cpp, Ollama, etc.).
When we have the model's architecture details (number of KV heads, hidden size, layers), we compute the KV cache precisely. Models using Grouped-Query Attention (GQA) — like Llama 3, Qwen 2.5, and Mistral — have fewer KV heads than attention heads, significantly reducing their KV cache VRAM. When architecture data is unavailable, we fall back to a simple 10% overhead estimate.
Measured values win
Headroom matters
Tokens per Second
During text generation (the "decode" phase), the model reads its entire weight tensor once per output token. This makes inference memory-bandwidth bound — speed is dictated by how fast data can stream from VRAM, not by raw compute power.
tok/s = (bandwidth GB/s ÷ modelSize GB) × efficiency
The efficiency factor accounts for memory contention, software stack overhead, and batch-size-1 conditions. It varies by platform because each vendor's inference stack has different levels of optimisation.
Efficiency factor by platform
These factors are calibrated against community benchmarks from llama.cpp and Ollama. They may change as software stacks improve.
How fast does it feel?
Memory Bandwidth
Memory bandwidth (GB/s) is the throughput between a GPU's processor and its VRAM. Because LLM decode reads the full model on every token, bandwidth is the primary bottleneck for generation speed — more bandwidth means more tokens per second for the same model.
Bandwidth comparison (GB/s)
Higher bandwidth = faster tok/s at the same model size. This is why Apple Silicon Macs with unified high-bandwidth memory can outperform discrete GPUs that have more VRAM but lower bandwidth.
Explore bandwidth comparisons on any hardware listing page or VRAM tier page.
Dense vs Mixture of Experts (MoE)
Most LLMs are dense — every parameter is used on every token. A Mixture of Experts (MoE) model splits its parameters into groups called experts and only activates a subset per token. The result: higher quality with better speed, but the entire model still has to fit in memory.
Dense Architecture
All parameters active on every token. VRAM = total params. Speed scales with total params.
Example: Llama 3 70B — 70 B total, 70 B active per token.
MoE Architecture
Only 2 of 8 experts active per token. VRAM = total params. Speed ≈ active params only.
Example: Mixtral 8×7B — 46.7 B total, ~12.9 B active per token.
On llmrun, MoE models show both the total parameter count (for VRAM sizing) and the active count (for speed intuition).
Context Length
Context length defines the maximum number of tokens (input + output) the model can handle in a single conversation. A "128K context" model can process roughly 100,000 words at once — enough for entire codebases or long documents.
The catch: longer contexts consume additional VRAM through the KV cache, which grows with every token in the conversation.
Default (2K–4K)
Extended (16K+)
Max context
Our compatibility grades assume a 2K-token context. If you need long-context inference, look for extra VRAM headroom or check the "+Context" VRAM on model pages.
Compatibility Grades
Every hardware–model combination on llmrun receives a letter grade from S to F that encodes VRAM headroom and estimated generation speed into a single at-a-glance signal.
The score is a weighted combination of VRAM fit (headroom as a percentage of model size) and speed (estimated tok/s normalised against a target of 60 tok/s). VRAM fit is the dominant factor — a model that won't physically fit always receives F regardless of bandwidth.
Frequently Asked Questions
- How do you calculate VRAM requirements?
- We compute VRAM as Model Weights + KV Cache + Framework Overhead. Model weights = parameters × bits-per-weight ÷ 8. KV cache depends on the model's attention architecture (number of KV heads, head dimension, and layers). When architecture data is available, we compute the precise KV cache size; otherwise we fall back to a 10% overhead estimate. Measured values from Ollama or llama.cpp always take priority.
- How accurate are the tok/s estimates?
- They're based on the theoretical bandwidth formula: tok/s ≈ (bandwidth GB/s ÷ model size GB) × efficiency. The efficiency factor varies by platform: 65% for NVIDIA (CUDA), 55% for AMD (ROCm), 50% for Intel (oneAPI), and 70% for Apple Silicon (Metal). In practice, results typically land within ±20% of these figures.
- What does the S–F grade mean?
- It summarises speed and VRAM headroom into one letter. S means the model runs fast with plenty of room to spare. F means your hardware can't load the model at all.
- Why does VRAM headroom matter?
- A model that barely fits in VRAM leaves nothing for the KV cache, which grows with conversation length. Models with Grouped-Query Attention (GQA) are more memory-efficient for long contexts. We recommend at least 1–2 GB above the model weight size for comfortable use.
- What is the GGUF format?
- GGUF is the standard file format for quantized models used by llama.cpp, Ollama, and LM Studio. It packages weights, tokenizer, and metadata into a single ready-to-run file.
Data Sources
Hardware specs come from manufacturer datasheets and are cross-referenced with community benchmarks. Model VRAM figures are collected from:
- Ollama — model library and runtime measurements
- llama.cpp — community benchmarks and perplexity data
- Hugging Face — model cards and architecture metadata
Spot an error? Open an issue — community corrections are always welcome.