All LLM Models

Browse 40 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

Llama 3 1 Nemotron 51B Instruct

NVIDIA · 51B · runs from 112.2 GB

570 209

Llama 3 1 Nemotron 51B Instruct is a 51B-parameter open language model from NVIDIA in the Llama 3 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Kimi K2.6 Eagle3

NVIDIA · 1.8B · runs from 1.1 GB

381 7

Kimi K2.6 Eagle3 is a 1.8B-parameter open language model from NVIDIA in the Kimi K2 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Nemotron H 47B Reasoning 128K

NVIDIA · 46.8B · runs from 102.9 GB

372 21

Nemotron H 47B Reasoning 128K is a 46.8B-parameter open language model from NVIDIA in the Nemotron family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoning

Nemotron Terminal 14B

NVIDIA · 14.8B · runs from 6.9 GB

336 8

Nemotron Terminal 14B is a 14.8B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 40,960 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen3 Nemotron 235B A22B GenRM 2603

NVIDIA · 235.1B · runs from 100.4 GB

283 29

Qwen3 Nemotron 235B A22B GenRM 2603 is a 235.1B-parameter open language model from NVIDIA in the Qwen 3 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

NVIDIA Nemotron 3 Ultra 550B A55B GenRM

NVIDIA · 560.5B · runs from 262.1 GB

164 9

NVIDIA Nemotron 3 Ultra 550B A55B GenRM is a 560.5B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Nemotron Flash 3B

NVIDIA · 2.7B · runs from 6.0 GB

157 17

Nemotron Flash 3B is a 2.7B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 29,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

OpenCodeReasoning Nemotron 1.1 32B

NVIDIA · 32.8B · runs from 14.8 GB

136 48

OpenCodeReasoning Nemotron 1.1 32B is a 32.8B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 65,536 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCodeReasoning

Riva Translate 4B Instruct

NVIDIA · 4.2B · runs from 2.3 GB

131 18

Riva Translate 4B Instruct is a 4.2B-parameter open language model from NVIDIA. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Llama 3.3 Nemotron 70B Reward

NVIDIA · 70.6B · runs from 31.0 GB

112 3

Llama 3.3 Nemotron 70B Reward is a 70.6B-parameter open language model from NVIDIA in the Llama 3 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat