All LLM Models

Browse 17 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

NVIDIA Nemotron Nano 9B v2 Japanese

NVIDIA · 8.9B · runs from 4.4 GB

281.4K 124

NVIDIA Nemotron Nano 9B v2 Japanese is a specialized variant of the Nemotron Nano 9B v2, fine-tuned for Japanese language understanding and generation. At 8.9 billion parameters, it maintains the same hardware-friendly footprint as the English version while delivering natural Japanese conversational ability. For users looking to run a Japanese-language assistant locally, this model offers a rare combination of compact size and dedicated language optimization from a major hardware vendor. It handles Japanese text with the fluency you'd expect from a purpose-built model rather than a multilingual afterthought.

Chat

Nemotron Mini 4B Instruct

NVIDIA · 4B · runs from 1.8 GB

473.9K 182

Nemotron Mini 4B Instruct is a 4B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

NVIDIA Nemotron Nano 9B v2

NVIDIA · 8.9B · runs from 4.5 GB

308.0K 482

NVIDIA Nemotron Nano 9B v2 is a compact yet capable chat model from NVIDIA, packing 8.9 billion parameters into a size that runs comfortably on a wide range of consumer GPUs. Built on NVIDIA's Nemotron architecture, it delivers strong instruction-following and conversational performance while keeping VRAM requirements modest. This second-generation Nano model reflects NVIDIA's push to make high-quality language models accessible on local hardware. It's an excellent starting point for users who want a responsive, general-purpose assistant without needing top-tier GPU memory.

Chat

NVIDIA Nemotron 3 Nano 4B BF16

NVIDIA · 4.0B · runs from 2.2 GB

342.4K 90

NVIDIA Nemotron 3 Nano 4B BF16 is a 4.0B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Llama 3.1 Nemotron Nano 8B V1

NVIDIA · 8B · runs from 2.8 GB

308.6K 219

Llama 3.1 Nemotron Nano 8B is an 8-billion parameter chat model by NVIDIA, a compact entry in the Nemotron family derived from Meta's Llama 3.1 architecture. It applies NVIDIA's alignment and fine-tuning techniques to deliver improved response quality over the base Llama 3.1 8B Instruct model at the same parameter count. The model runs on consumer GPUs with 8GB or more of VRAM and supports a 128K token context window. Its small footprint and NVIDIA-tuned quality make it a practical option for local inference on mainstream hardware.

Chat

Nemotron Cascade 8B

NVIDIA · 8B · runs from 4 GB

31.7K 65

Nemotron Cascade 8B is a 8B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoning

Nemotron Labs Diffusion 14B

NVIDIA · 13.5B · runs from 6.5 GB

7.1K 143

Nemotron Labs Diffusion 14B is a 13.5B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Nemotron Orchestrator 8B

NVIDIA · 8.2B · runs from 4.1 GB

3.8K 580

Nemotron Orchestrator 8B is a 8.2B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 40,960 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

OpenMath Nemotron 1.5B

NVIDIA · 1.5B · runs from 1.0 GB

3.5K 29

OpenMath Nemotron 1.5B is a 1.5B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatMath

Nemotron Research Reasoning Qwen 1.5B

NVIDIA · 1.8B · runs from 1.1 GB

2.6K 243

Nemotron Research Reasoning Qwen 1.5B is a 1.8B-parameter open language model from NVIDIA in the Qwen family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoning

Nemotron Terminal 8B

NVIDIA · 8.2B · runs from 4.1 GB

2.3K 26

Nemotron Terminal 8B is a 8.2B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 40,960 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Nemotron Content Safety Reasoning 4B

NVIDIA · 4.3B · runs from 2.5 GB

2.3K 19

Nemotron Content Safety Reasoning 4B is a 4.3B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoning

Llama 3.1 Nemotron Safety Guard 8B v3

NVIDIA · 8.0B · runs from 4.0 GB

1.7K 13

Llama 3.1 Nemotron Safety Guard 8B v3 is a 8.0B-parameter open language model from NVIDIA in the Llama 3 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Kimi K2.6 Eagle3

NVIDIA · 1.8B · runs from 1.1 GB

381 7

Kimi K2.6 Eagle3 is a 1.8B-parameter open language model from NVIDIA in the Kimi K2 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Nemotron Terminal 14B

NVIDIA · 14.8B · runs from 6.9 GB

336 8

Nemotron Terminal 14B is a 14.8B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 40,960 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Nemotron Flash 3B

NVIDIA · 2.7B · runs from 6.0 GB

157 17

Nemotron Flash 3B is a 2.7B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 29,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Riva Translate 4B Instruct

NVIDIA · 4.2B · runs from 2.3 GB

131 18

Riva Translate 4B Instruct is a 4.2B-parameter open language model from NVIDIA. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat