All LLM Models
Browse 16 LLM models with VRAM requirements, quantization options, and hardware compatibility.
Understanding LLM VRAM Requirements
How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.
Model List
Llama 3.1 Nemotron 70B Instruct HF
NVIDIA · 70B
Llama 3.1 Nemotron 70B Instruct is a 70-billion parameter chat model by NVIDIA, created by applying reinforcement learning from human feedback (RLHF) to Meta's Llama 3.1 70B base model. NVIDIA's Nemotron training pipeline focuses on improving helpfulness, accuracy, and response quality beyond the standard Llama instruction tuning. The model requires substantial VRAM for local inference, typically needing multi-GPU setups or high-end professional GPUs. In quantized formats it becomes accessible on workstation-class hardware. It is available in Hugging Face Transformers format and is supported by popular inference engines.
NVIDIA Nemotron 3 Nano 30B A3B BF16
NVIDIA · 31.6B
NVIDIA Nemotron 3 Nano 30B A3B is a mixture-of-experts model with 31.6 billion total parameters but only around 3 billion active per token, giving it the intelligence of a much larger model with the speed of a small one. This BF16 version preserves full precision for maximum output quality. The MoE architecture makes this model especially interesting for local deployment. You get reasoning and instruction-following capabilities that punch well above what a traditional 3B model can deliver, while inference stays fast because only a fraction of the network fires for each token.
NVIDIA Nemotron Nano 9B v2
NVIDIA · 8.9B
NVIDIA Nemotron Nano 9B v2 is a compact yet capable chat model from NVIDIA, packing 8.9 billion parameters into a size that runs comfortably on a wide range of consumer GPUs. Built on NVIDIA's Nemotron architecture, it delivers strong instruction-following and conversational performance while keeping VRAM requirements modest. This second-generation Nano model reflects NVIDIA's push to make high-quality language models accessible on local hardware. It's an excellent starting point for users who want a responsive, general-purpose assistant without needing top-tier GPU memory.
NVIDIA Nemotron 3 Nano 30B A3B FP8
NVIDIA · 31.6B
NVIDIA Nemotron 3 Nano 30B A3B FP8 is the FP8-quantized version of NVIDIA's 31.6 billion parameter mixture-of-experts model. The 8-bit floating point format reduces memory requirements compared to BF16 while retaining strong output quality, making it a practical choice for GPUs with tighter VRAM budgets. With only about 3 billion parameters active per token, this model already runs efficiently. The FP8 quantization pushes the memory savings further without meaningful degradation, making it one of the best options for users who want MoE-class performance on mainstream hardware.
Llama 3 3 Nemotron Super 49B V1 5
NVIDIA · 49.9B
Llama 3.3 Nemotron Super 49B is a 49.9-billion parameter chat model by NVIDIA, built on a modified Llama 3.3 architecture. It occupies a unique size point between the common 70B and 8B tiers, offering strong reasoning and conversational ability while requiring less VRAM than full 70B models. NVIDIA's Nemotron Super training pipeline applies extensive alignment tuning to optimize helpfulness and factual accuracy. The model typically needs 32GB or more of VRAM for local inference at reduced precision, placing it within reach of high-end consumer GPUs like the RTX 4090 or professional workstation cards.
Llama 3.1 Nemotron Nano 8B V1
NVIDIA · 8B
Llama 3.1 Nemotron Nano 8B is an 8-billion parameter chat model by NVIDIA, a compact entry in the Nemotron family derived from Meta's Llama 3.1 architecture. It applies NVIDIA's alignment and fine-tuning techniques to deliver improved response quality over the base Llama 3.1 8B Instruct model at the same parameter count. The model runs on consumer GPUs with 8GB or more of VRAM and supports a 128K token context window. Its small footprint and NVIDIA-tuned quality make it a practical option for local inference on mainstream hardware.
NVIDIA Nemotron 3 Super 120B A12B NVFP4
NVIDIA · 67.2B
NVIDIA Nemotron 3 Super 120B A12B NVFP4 is a large-scale mixture-of-experts model compressed to roughly 67.2 billion parameters of effective memory usage through NVIDIA's NVFP4 quantization. With 12 billion parameters active per token from a 120 billion parameter pool, it delivers flagship-tier intelligence in a more accessible package. This is where the MoE architecture and aggressive quantization really shine together. A model that would normally require data center hardware becomes feasible on high-end consumer GPUs or multi-GPU setups. The NVFP4 format is purpose-built for NVIDIA silicon, keeping quality surprisingly close to the full-precision version.
NVIDIA Nemotron 3 Super 120B A12B FP8
NVIDIA · 123.6B
NVIDIA Nemotron 3 Super 120B A12B FP8 is the FP8 variant of NVIDIA's largest Nemotron 3 mixture-of-experts model, weighing in at 123.6 billion parameters. With 12 billion parameters active per token, it delivers exceptional reasoning and conversational depth while the FP8 format keeps memory usage lower than full precision. This model sits at the high end of what's achievable for local inference. You'll need serious GPU memory to run it, but the payoff is near-frontier model quality running entirely on your own hardware. The FP8 quantization offers a meaningful memory reduction over BF16 with minimal quality trade-off.
NVIDIA Nemotron Nano 9B v2 Japanese
NVIDIA · 8.9B
NVIDIA Nemotron Nano 9B v2 Japanese is a specialized variant of the Nemotron Nano 9B v2, fine-tuned for Japanese language understanding and generation. At 8.9 billion parameters, it maintains the same hardware-friendly footprint as the English version while delivering natural Japanese conversational ability. For users looking to run a Japanese-language assistant locally, this model offers a rare combination of compact size and dedicated language optimization from a major hardware vendor. It handles Japanese text with the fluency you'd expect from a purpose-built model rather than a multilingual afterthought.
NVIDIA Nemotron 3 Nano 30B A3B NVFP4
NVIDIA · 18.2B
NVIDIA Nemotron 3 Nano 30B A3B NVFP4 is the most aggressively quantized version of the Nemotron 3 Nano 30B, using NVIDIA's proprietary NVFP4 format to bring the effective size down to around 18.2 billion parameters worth of memory. This makes it accessible on GPUs that couldn't touch the BF16 or FP8 variants. NVFP4 is NVIDIA's custom 4-bit floating point quantization, optimized for their GPU architectures to minimize quality loss at extreme compression. If you're running a mid-range NVIDIA card and want MoE-level intelligence, this is the variant to try.
NVIDIA Nemotron 3 Nano 30B A3B Base BF16
NVIDIA · 31.6B
NVIDIA Nemotron 3 Nano 30B A3B Base BF16 is the foundation model version of the Nemotron 3 Nano 30B, offered in full BF16 precision. Unlike the chat-tuned variants, this base model hasn't been instruction-tuned, making it suitable for fine-tuning, research, or custom alignment workflows. At 31.6 billion total parameters with a mixture-of-experts architecture, the base model gives developers and researchers a strong starting point for building specialized applications. It retains all the architectural benefits of the MoE design while leaving the behavioral layer open for customization.
Qwen3.5 397B A17B NVFP4
NVIDIA · 397B
Qwen3.5 397B A17B NVFP4 is NVIDIA's NVFP4-quantized version of Alibaba's enormous Qwen3.5, a 397 billion parameter mixture-of-experts model with 17 billion active parameters per token. Even with aggressive quantization, this is one of the largest models you can attempt to run locally. This model represents the cutting edge of what's possible for local inference. The MoE architecture keeps per-token compute manageable despite the massive parameter count, and NVIDIA's NVFP4 quantization brings memory requirements down from utterly impossible to merely ambitious. Multi-GPU setups with substantial VRAM are essential, but the reward is near-frontier intelligence running entirely on your own machines.
Qwen3 Next 80B A3B Thinking NVFP4
NVIDIA · 80B
Qwen3 Next 80B A3B Thinking NVFP4 is NVIDIA's quantized version of Alibaba's Qwen3 Next 80B, a mixture-of-experts model with thinking capabilities and only 3 billion active parameters per token. The NVFP4 format significantly reduces memory requirements, bringing this 80B model within reach of high-end consumer hardware. The thinking mode enables explicit chain-of-thought reasoning, where the model works through problems step by step before delivering its answer. Combined with the MoE efficiency of activating just 3B parameters at a time, this model offers an unusual combination of deep reasoning and fast inference.
Qwen3 30B A3B NVFP4
NVIDIA · 15.6B
Qwen3 30B A3B NVFP4 is NVIDIA's NVFP4-quantized version of Alibaba's Qwen3 30B mixture-of-experts model, compressed to roughly 15.6 billion parameters of effective memory usage. With only 3 billion parameters active per token, it runs remarkably fast for a model of its intelligence class. This is one of the most efficient models available for local deployment. The combination of MoE architecture and NVFP4 quantization means you get 30B-class reasoning and instruction-following on hardware that would normally struggle with models half this size. It's an excellent choice for users who want strong performance without top-tier GPUs.
DeepSeek R1 0528 NVFP4 v2
NVIDIA · 393.6B
DeepSeek R1 0528 NVFP4 v2 is NVIDIA's optimized quantization of the massive 393.6 billion parameter DeepSeek R1 reasoning model, using the NVFP4 format to make this behemoth more practical for local deployment. DeepSeek R1 is renowned for its strong chain-of-thought reasoning, and this version preserves that capability at a fraction of the original memory cost. Running a 393B parameter model locally is no small feat even with aggressive quantization, but NVIDIA's NVFP4 format is specifically designed to squeeze maximum quality from minimal bits on their GPUs. For users with multi-GPU setups who want top-tier reasoning without cloud API dependencies, this is one of the most compelling options available.
Qwen3 32B NVFP4
NVIDIA · 17.2B
Qwen3 32B NVFP4 is NVIDIA's NVFP4-quantized version of Alibaba's dense Qwen3 32B model, reduced to approximately 17.2 billion parameters of effective memory usage. Unlike the MoE variants, this is a traditional dense model where all parameters contribute to every token, often yielding more consistent output quality. Qwen3 32B has earned a strong reputation as one of the best models in its size class, and NVIDIA's NVFP4 quantization makes it accessible on a broader range of GPUs. If you prefer the predictability of a dense architecture over MoE's efficiency trade-offs, this is the variant to choose.