Best LLMs for 8 GB VRAM
Entry-level for LLMs (RTX 4060, RX 7600, Apple M-series base) — 7B models at Q4, small models at Q8
8 GB is an entry-level tier for local AI. You can run small 7B models at lower quantization levels, which is great for experimenting but comes with quality and speed trade-offs.
With 8 GB, you're limited to smaller models and lower quantization levels, but it's still enough for a meaningful local AI experience. Phi 3 Mini (3.8B) and similar compact models run well at Q4_K_M. For 7B models like Mistral 7B and Llama 3 8B, you'll need Q2_K or Q3_K_M quantization, which reduces output quality. Think of this tier as ideal for learning and experimentation rather than production workloads.
Runs Well
- 3B–4B models at Q4–Q5 quality
- 7B models at Q2–Q3 (usable but reduced quality)
- Quick experiments and learning
Challenging
- 7B models at Q4+ (VRAM too tight)
- Any model above 7B parameters
- Long context windows even with small models
GPUs with ~8.0 GB VRAM
All 13 GPUsNVIDIA GeForce RTX 5060 Ti 8GB
NVIDIA · Blackwell
NVIDIA GeForce RTX 5060
NVIDIA · Blackwell
NVIDIA GeForce RTX 3060 8GB
NVIDIA · Ampere
NVIDIA GeForce RTX 3050 8GB
NVIDIA · Ampere
AMD Radeon RX 7600
AMD · RDNA 3
Intel Arc A750
Intel · Alchemist
Models That Fit in 8 GB VRAM
Speed estimated for NVIDIA GeForce RTX 3080
66 models · 6 excellent · 19 good
| Model | Quant | VRAM | Speed | Context | Status | Grade |
|---|---|---|---|---|---|---|
Q4_K_M·89.5 t/s tok/s·41K ctx·GREAT FIT | Q4_K_M | 5.5 GB | 89.5 t/s | 41K | GREAT FIT | S85 |
Q4_K_M·81.0 t/s tok/s·8K ctx·GREAT FIT | Q4_K_M | 6.1 GB | 81.0 t/s | 8K | GREAT FIT | S89 |
Q4_K_M·82.2 t/s tok/s·33K ctx·GREAT FIT | Q4_K_M | 6.0 GB | 82.2 t/s | 33K | GREAT FIT | S90 |
Q4_K_M·92.9 t/s tok/s·131K ctx·GOOD FIT | Q4_K_M | 5.3 GB | 92.9 t/s | 131K | GOOD FIT | A84 |
Q4_K_M·93.2 t/s tok/s·131K ctx·GOOD FIT | Q4_K_M | 5.3 GB | 93.2 t/s | 131K | GOOD FIT | A83 |
Q4_K_M·85.9 t/s tok/s·66K ctx·GREAT FIT | Q4_K_M | 5.8 GB | 85.9 t/s | 66K | GREAT FIT | S88 |
Q4_K_M·89.5 t/s tok/s·131K ctx·GREAT FIT | Q4_K_M | 5.5 GB | 89.5 t/s | 131K | GREAT FIT | S85 |
Q4_K_M·91.7 t/s tok/s·131K ctx·GOOD FIT | Q4_K_M | 5.4 GB | 91.7 t/s | 131K | GOOD FIT | A84 |
Q4_K_M·91.7 t/s tok/s·131K ctx·GOOD FIT | Q4_K_M | 5.4 GB | 91.7 t/s | 131K | GOOD FIT | A84 |
Q4_K_M·99.0 t/s tok/s·33K ctx·GOOD FIT | Q4_K_M | 5.0 GB | 99.0 t/s | 33K | GOOD FIT | A78 |
Q4_K_M·91.2 t/s tok/s·16K ctx·GOOD FIT | Q4_K_M | 5.4 GB | 91.2 t/s | 16K | GOOD FIT | A84 |
Q4_K_M·85.2 t/s tok/s·4K ctx·GREAT FIT | Q4_K_M | 5.8 GB | 85.2 t/s | 4K | GREAT FIT | S88 |
Q4_K_M·100.4 t/s tok/s·33K ctx·GOOD FIT | Q4_K_M | 4.9 GB | 100.4 t/s | 33K | GOOD FIT | A78 |
Q4_K_M·99.0 t/s tok/s·131K ctx·GOOD FIT | Q4_K_M | 5.0 GB | 99.0 t/s | 131K | GOOD FIT | A78 |
Q4_K_M·95.4 t/s tok/s·GOOD FIT | Q4_K_M | 5.2 GB | 95.4 t/s | — | GOOD FIT | A82 |
Q4_K_M·96.9 t/s tok/s·33K ctx·GOOD FIT | Q4_K_M | 5.1 GB | 96.9 t/s | 33K | GOOD FIT | A81 |
Frequently Asked Questions
- What models can I run with 8.0 GB VRAM?
With 8.0 GB VRAM, you can run 967 LLM models at various quantization levels. Popular models that fit well include Qwen3 8B, Gemma 2 9B IT, Qwen1.5 7B. 88 models achieve excellent performance at this VRAM level. While limited, this tier is enough to get started with local AI and see what small models can do.
- Is 8.0 GB enough for local AI?
8.0 GB is a basic tier for local AI. 967 models are compatible, mostly smaller models and heavily quantized 7B models. It's limited but still useful for learning, experimentation, and lightweight chat tasks.
- What GPU should I get for 8.0 GB VRAM?
Popular GPUs with ~8.0 GB include NVIDIA GeForce RTX 5060 Ti 8GB, NVIDIA GeForce RTX 5060, NVIDIA GeForce RTX 3060 8GB. The NVIDIA GeForce RTX 3080 leads in memory bandwidth at 760.3 GB/s, which translates directly to faster token generation. When choosing a GPU for AI, memory bandwidth matters as much as VRAM capacity — it determines how fast the model can generate text. A newer GPU with the same VRAM but higher bandwidth will produce tokens significantly faster.
Higher memory bandwidth = faster token generation. All these GPUs have approximately 8 GB VRAM, but speed varies significantly by bandwidth.
Memory bandwidth comparison
760.3 GB/s608.3 GB/s512 GB/s448 GB/s448 GB/s- How to choose the right model size for 8.0 GB?
The key rule: your model must fit in VRAM including KV cache overhead. With 8.0 GB, here's a practical guide: 3B–4B models at Q4_K_M give the best experience. 7B models can fit at Q2–Q3 but expect noticeable quality loss. Start with smaller models and see what quality level is acceptable for your use case.
- Should I get 8.0 GB or 12.0 GB for AI?
Upgrading from 8.0 GB to 12.0 GB gives you significantly more flexibility. At 8.0 GB you can run 967 models; with 12.0 GB you'll unlock larger models and higher-quality quantizations. If budget allows, the extra VRAM is always worth it for AI workloads — you can't add VRAM later.