Meta·Llama 2

Llama 2 7B HF — Hardware Requirements & GPU Compatibility

Chat

Meta Llama 2 7B is a 6.7-billion parameter base (pretrained) language model from Meta's Llama 2 generation, provided in Hugging Face Transformers format. It was trained on 2 trillion tokens with a 4K token context window and represented a significant step in openly available large language models when released. As a base model, it is designed for further fine-tuning and research rather than direct chat use. While superseded by Llama 3 and later releases in terms of benchmark performance, Llama 2 7B remains widely used in the research community and as a baseline for comparison. Released under the Llama 2 Community License.

848.3K downloads 2.3K likesApr 2024

Specifications

Publisher
Meta
Family
Llama 2
Parameters
6.7B
Release Date
2024-04-17
License
Llama 2 Community

Get Started

How Much VRAM Does Llama 2 7B HF Need?

Select a quantization to see compatible GPUs below.

QuantizationBitsVRAM
Q2_K3.403.1 GB
Q3_K_S3.503.2 GB
Q3_K_M3.903.6 GB
Q3_K_L4.103.8 GB
IQ4_XS4.304.0 GB
Q4_K_S4.504.2 GB
Q4_K_M4.804.5 GB
Q5_K_S5.505.1 GB
Q5_K_M5.705.3 GB
Q6_K6.606.1 GB
Q8_08.007.4 GB

Which GPUs Can Run Llama 2 7B HF?

Q4_K_M · 4.5 GB

Llama 2 7B HF (Q4_K_M) requires 4.5 GB of VRAM to load the model weights. For comfortable inference with headroom for KV cache and system overhead, 6+ GB is recommended. 35 GPUs can run it, including NVIDIA GeForce RTX 5090, NVIDIA GeForce RTX 3090 Ti.

Which Devices Can Run Llama 2 7B HF?

Q4_K_M · 4.5 GB

33 devices with unified memory can run Llama 2 7B HF, including NVIDIA DGX H100, NVIDIA DGX A100 640GB.

Related Models

Frequently Asked Questions

How much VRAM does Llama 2 7B HF need?

Llama 2 7B HF requires 4.5 GB of VRAM at Q4_K_M, or 7.4 GB at Q8_0.

VRAM = Weights + KV Cache + Overhead

Weights = 6.7B × 4.8 bits ÷ 8 = 4 GB

KV Cache + Overhead 0.5 GB (at 2K context + ~0.3 GB framework)

VRAM usage by quantization

4.5 GB

Learn more about VRAM estimation →

What's the best quantization for Llama 2 7B HF?

For Llama 2 7B HF, Q4_K_M (4.5 GB) offers the best balance of quality and VRAM usage. Q5_K_S (5.1 GB) provides better quality if you have the VRAM. The smallest option is Q2_K at 3.1 GB.

VRAM requirement by quantization

Q2_K
3.1 GB
Q3_K_L
3.8 GB
Q4_K_S
4.2 GB
Q4_K_M
4.5 GB
Q5_K_M
5.3 GB
Q8_0
7.4 GB

★ Recommended — best balance of quality and VRAM usage.

Learn more about quantization →

Can I run Llama 2 7B HF on a Mac?

Llama 2 7B HF requires at least 3.1 GB at Q2_K, which exceeds the unified memory of most consumer Macs. You would need a Mac Studio or Mac Pro with a high-memory configuration.

Can I run Llama 2 7B HF locally?

Yes — Llama 2 7B HF can run locally on consumer hardware. At Q4_K_M quantization it needs 4.5 GB of VRAM. Popular tools include Ollama, LM Studio, and llama.cpp.

How fast is Llama 2 7B HF?

At Q4_K_M, Llama 2 7B HF can reach ~655 tok/s on AMD Instinct MI300X. On NVIDIA GeForce RTX 4090: ~147 tok/s. Speed depends mainly on GPU memory bandwidth. Real-world results typically within ±20%.

tok/s = (bandwidth GB/s ÷ model GB) × efficiency

Example: AMD Instinct MI300X5300 ÷ 4.5 × 0.55 = ~655 tok/s

Estimated speed at Q4_K_M (4.5 GB)

~655 tok/s
~147 tok/s
~490 tok/s
~405 tok/s

Real-world results typically within ±20%. Speed depends on batch size, quantization kernel, and software stack.

Learn more about tok/s estimation →

What's the download size of Llama 2 7B HF?

At Q4_K_M, the download is about 4.04 GB. The full-precision Q8_0 version is 6.74 GB. The smallest option (Q2_K) is 2.86 GB.