All LLM Models
Browse 30 LLM models with VRAM requirements, quantization options, and hardware compatibility.
Understanding LLM VRAM Requirements
How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.
Model List
NVIDIA Nemotron 3 Nano 30B A3B BF16
NVIDIA · 31.6B · runs from 9.1 GB
NVIDIA Nemotron 3 Nano 30B A3B is a mixture-of-experts model with 31.6 billion total parameters but only around 3 billion active per token, giving it the intelligence of a much larger model with the speed of a small one. This BF16 version preserves full precision for maximum output quality. The MoE architecture makes this model especially interesting for local deployment. You get reasoning and instruction-following capabilities that punch well above what a traditional 3B model can deliver, while inference stays fast because only a fraction of the network fires for each token.
Nemotron 3 Nano Omni 30B A3B Reasoning BF16
NVIDIA · 33.0B · runs from 10.0 GB
Nemotron 3 Nano Omni 30B A3B Reasoning BF16 is a 33.0B-parameter open language model from NVIDIA in the Nemotron family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
NVIDIA Nemotron Nano 9B v2 Japanese
NVIDIA · 8.9B · runs from 4.4 GB
NVIDIA Nemotron Nano 9B v2 Japanese is a specialized variant of the Nemotron Nano 9B v2, fine-tuned for Japanese language understanding and generation. At 8.9 billion parameters, it maintains the same hardware-friendly footprint as the English version while delivering natural Japanese conversational ability. For users looking to run a Japanese-language assistant locally, this model offers a rare combination of compact size and dedicated language optimization from a major hardware vendor. It handles Japanese text with the fluency you'd expect from a purpose-built model rather than a multilingual afterthought.
Llama 3.1 Nemotron 70B Instruct HF
NVIDIA · 70.6B · runs from 20.4 GB
Llama 3.1 Nemotron 70B Instruct is a 70-billion parameter chat model by NVIDIA, created by applying reinforcement learning from human feedback (RLHF) to Meta's Llama 3.1 70B base model. NVIDIA's Nemotron training pipeline focuses on improving helpfulness, accuracy, and response quality beyond the standard Llama instruction tuning. The model requires substantial VRAM for local inference, typically needing multi-GPU setups or high-end professional GPUs. In quantized formats it becomes accessible on workstation-class hardware. It is available in Hugging Face Transformers format and is supported by popular inference engines.
Nemotron Mini 4B Instruct
NVIDIA · 4B · runs from 1.8 GB
Nemotron Mini 4B Instruct is a 4B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
NVIDIA Nemotron Nano 9B v2
NVIDIA · 8.9B · runs from 4.5 GB
NVIDIA Nemotron Nano 9B v2 is a compact yet capable chat model from NVIDIA, packing 8.9 billion parameters into a size that runs comfortably on a wide range of consumer GPUs. Built on NVIDIA's Nemotron architecture, it delivers strong instruction-following and conversational performance while keeping VRAM requirements modest. This second-generation Nano model reflects NVIDIA's push to make high-quality language models accessible on local hardware. It's an excellent starting point for users who want a responsive, general-purpose assistant without needing top-tier GPU memory.
Llama 3 3 Nemotron Super 49B V1 5
NVIDIA · 49.9B · runs from 15.1 GB
Llama 3.3 Nemotron Super 49B is a 49.9-billion parameter chat model by NVIDIA, built on a modified Llama 3.3 architecture. It occupies a unique size point between the common 70B and 8B tiers, offering strong reasoning and conversational ability while requiring less VRAM than full 70B models. NVIDIA's Nemotron Super training pipeline applies extensive alignment tuning to optimize helpfulness and factual accuracy. The model typically needs 32GB or more of VRAM for local inference at reduced precision, placing it within reach of high-end consumer GPUs like the RTX 4090 or professional workstation cards.
NVIDIA Nemotron 3 Nano 4B BF16
NVIDIA · 4.0B · runs from 2.2 GB
NVIDIA Nemotron 3 Nano 4B BF16 is a 4.0B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Llama 3.1 Nemotron Nano 8B V1
NVIDIA · 8B · runs from 2.8 GB
Llama 3.1 Nemotron Nano 8B is an 8-billion parameter chat model by NVIDIA, a compact entry in the Nemotron family derived from Meta's Llama 3.1 architecture. It applies NVIDIA's alignment and fine-tuning techniques to deliver improved response quality over the base Llama 3.1 8B Instruct model at the same parameter count. The model runs on consumer GPUs with 8GB or more of VRAM and supports a 128K token context window. Its small footprint and NVIDIA-tuned quality make it a practical option for local inference on mainstream hardware.
Nemotron Labs Diffusion 8B
NVIDIA · 8.5B · runs from 17.6 GB
Nemotron Labs Diffusion 8B is a 8.5B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Nemotron Cascade 2 30B A3B
NVIDIA · 31.6B · runs from 13.8 GB
Nemotron Cascade 2 30B A3B is a 31.6B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Nemotron Cascade 8B
NVIDIA · 8B · runs from 4 GB
Nemotron Cascade 8B is a 8B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Nemotron Labs Diffusion 3B
NVIDIA · 3.8B · runs from 8.1 GB
Nemotron Labs Diffusion 3B is a 3.8B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
NVIDIA Nemotron Nano 12B v2
NVIDIA · 12B · runs from 26.4 GB
NVIDIA Nemotron Nano 12B v2 is a 12B-parameter open language model from NVIDIA in the Nemotron family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Nemotron Labs Diffusion 14B
NVIDIA · 13.5B · runs from 6.5 GB
Nemotron Labs Diffusion 14B is a 13.5B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Nemotron Orchestrator 8B
NVIDIA · 8.2B · runs from 4.1 GB
Nemotron Orchestrator 8B is a 8.2B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 40,960 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
OpenMath Nemotron 1.5B
NVIDIA · 1.5B · runs from 1.0 GB
OpenMath Nemotron 1.5B is a 1.5B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Nemotron Research Reasoning Qwen 1.5B
NVIDIA · 1.8B · runs from 1.1 GB
Nemotron Research Reasoning Qwen 1.5B is a 1.8B-parameter open language model from NVIDIA in the Qwen family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Nemotron Terminal 8B
NVIDIA · 8.2B · runs from 4.1 GB
Nemotron Terminal 8B is a 8.2B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 40,960 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Nemotron Content Safety Reasoning 4B
NVIDIA · 4.3B · runs from 2.5 GB
Nemotron Content Safety Reasoning 4B is a 4.3B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Llama 3.1 Nemotron Safety Guard 8B v3
NVIDIA · 8.0B · runs from 4.0 GB
Llama 3.1 Nemotron Safety Guard 8B v3 is a 8.0B-parameter open language model from NVIDIA in the Llama 3 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Nemotron Terminal 32B
NVIDIA · 32.8B · runs from 14.6 GB
Nemotron Terminal 32B is a 32.8B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 40,960 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
OpenReasoning Nemotron 32B
NVIDIA · 32.8B · runs from 14.8 GB
OpenReasoning Nemotron 32B is a 32.8B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Nemotron H 8B Reasoning 128K
NVIDIA · 8.1B · runs from 17.8 GB
Nemotron H 8B Reasoning 128K is a 8.1B-parameter open language model from NVIDIA in the Nemotron family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Kimi K2.6 Eagle3
NVIDIA · 1.8B · runs from 1.1 GB
Kimi K2.6 Eagle3 is a 1.8B-parameter open language model from NVIDIA in the Kimi K2 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Nemotron Terminal 14B
NVIDIA · 14.8B · runs from 6.9 GB
Nemotron Terminal 14B is a 14.8B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 40,960 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Nemotron Flash 3B
NVIDIA · 2.7B · runs from 6.0 GB
Nemotron Flash 3B is a 2.7B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 29,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
OpenCodeReasoning Nemotron 1.1 32B
NVIDIA · 32.8B · runs from 14.8 GB
OpenCodeReasoning Nemotron 1.1 32B is a 32.8B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 65,536 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Riva Translate 4B Instruct
NVIDIA · 4.2B · runs from 2.3 GB
Riva Translate 4B Instruct is a 4.2B-parameter open language model from NVIDIA. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Llama 3.3 Nemotron 70B Reward
NVIDIA · 70.6B · runs from 31.0 GB
Llama 3.3 Nemotron 70B Reward is a 70.6B-parameter open language model from NVIDIA in the Llama 3 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.