What's the best quantization for Qwen3 32B NVFP4?

For Qwen3 32B NVFP4, Q4_K_M (10.9 GB) offers the best balance of quality and VRAM usage. Q5_0 (11.4 GB) provides better quality if you have the VRAM.

Can I run Qwen3 32B NVFP4 on a Mac?

Qwen3 32B NVFP4 requires at least 10.9 GB at Q4_K_M, which exceeds the unified memory of most consumer Macs. You would need a Mac Studio or Mac Pro with a high-memory configuration.

Can I run Qwen3 32B NVFP4 locally?

Yes — Qwen3 32B NVFP4 can run locally on consumer hardware. At Q4_K_M quantization it needs 10.9 GB of VRAM. Popular tools include Ollama, LM Studio, and llama.cpp.

How fast is Qwen3 32B NVFP4?

At Q4_K_M, Qwen3 32B NVFP4 can reach ~267 tok/s on AMD Instinct MI300X. On NVIDIA GeForce RTX 4090: ~60 tok/s. Speed depends mainly on GPU memory bandwidth. Real-world results typically within ±20%.

What's the download size of Qwen3 32B NVFP4?

At Q4_K_M, the download is about 10.30 GB. The full-precision Q8_0 version is 17.16 GB.

NVIDIA·Qwen·Qwen3ForCausalLM

Qwen3 32B NVFP4 — Hardware Requirements & GPU Compatibility

Chat

Qwen3 32B NVFP4 is NVIDIA's NVFP4-quantized version of Alibaba's dense Qwen3 32B model, reduced to approximately 17.2 billion parameters of effective memory usage. Unlike the MoE variants, this is a traditional dense model where all parameters contribute to every token, often yielding more consistent output quality. Qwen3 32B has earned a strong reputation as one of the best models in its size class, and NVIDIA's NVFP4 quantization makes it accessible on a broader range of GPUs. If you prefer the predictability of a dense architecture over MoE's efficiency trade-offs, this is the variant to choose.

76.7K downloads 13 likesSep 202541K context

Based on Qwen3 32B

Specifications

Publisher: NVIDIA
Family: Qwen
Parameters: 17.2B
Architecture: Qwen3ForCausalLM
Context Length: 40,960 tokens
Vocabulary Size: 151,936
Release Date: 2025-09-09
License: Apache 2.0

Get Started

HuggingFace

nvidia/Qwen3-32B-NVFP4

How Much VRAM Does Qwen3 32B NVFP4 Need?

Select a quantization to see compatible GPUs below.

Quantization	Bits	VRAM	+ Context	File Size	Quality
Q4_K_M	4.80	10.9 GB	17.3 GB	10.30 GB	4-bit medium quantization — most popular sweet spot
Q5_0	5.00	11.4 GB	17.7 GB	10.72 GB	5-bit legacy quantization
Q5_K_M	5.70	12.9 GB	19.2 GB	12.23 GB	5-bit medium quantization — good quality/size tradeoff
Q6_K	6.60	14.8 GB	21.2 GB	14.16 GB	6-bit quantization, very good quality
Q8_0	8.00	17.8 GB	24.2 GB	17.16 GB	8-bit quantization, near-lossless

Which GPUs Can Run Qwen3 32B NVFP4?

Q4_K_M · 10.9 GB

Show professional

Qwen3 32B NVFP4 (Q4_K_M) requires 10.9 GB of VRAM to load the model weights. For comfortable inference with headroom for KV cache and system overhead, 15+ GB is recommended. Using the full 41K context window can add up to 6.4 GB, bringing total usage to 17.3 GB. 27 GPUs can run it, including NVIDIA GeForce RTX 5090, NVIDIA GeForce RTX 3090 Ti, NVIDIA GeForce RTX 5080.

Runs great

— Plenty of headroom

NVIDIA GeForce RTX 5090~107 tok/s NVIDIA GeForce RTX 3090 Ti~60 tok/s NVIDIA GeForce RTX 4090~60 tok/s NVIDIA GeForce RTX 3090~56 tok/s AMD Radeon RX 7900 XTX~48 tok/s AMD Radeon RX 7900 XT~40 tok/s

Decent

— Enough VRAM, may be tight

Which Devices Can Run Qwen3 32B NVFP4?

Q4_K_M · 10.9 GB

27 devices with unified memory can run Qwen3 32B NVFP4, including NVIDIA DGX H100, NVIDIA DGX A100 640GB, Mac Mini M4 (16 GB).

Runs great

— Plenty of headroom

Decent

— Enough memory, may be tight

Mac Mini M4 (16 GB)~7 tok/s MacBook Air 13" M4 (16 GB)~7 tok/s MacBook Air 15" M4 (16 GB)~7 tok/s MacBook Pro 14" M4 (16 GB)~7 tok/s iPad Pro M4 13" (16 GB)~7 tok/s MacBook Air 13" M3 (16 GB)~6 tok/s

Related Models

Frequently Asked Questions

How much VRAM does Qwen3 32B NVFP4 need?: Qwen3 32B NVFP4 requires 10.9 GB of VRAM at Q4_K_M, or 17.8 GB at Q8_0. Full 41K context adds up to 6.4 GB (17.3 GB total).
VRAM = Weights + KV Cache + Overhead
Weights = 17.2B × 4.8 bits ÷ 8 = 10.3 GB
KV Cache + Overhead ≈ 0.6 GB (at 2K context + ~0.3 GB framework)
KV Cache + Overhead ≈ 7 GB (at full 41K context)
VRAM usage by quantization
Q4_K_M
10.9 GB
Q4_K_M + full context
17.3 GB
Learn more about VRAM estimation →
What's the best quantization for Qwen3 32B NVFP4?: For Qwen3 32B NVFP4, Q4_K_M (10.9 GB) offers the best balance of quality and VRAM usage. Q5_0 (11.4 GB) provides better quality if you have the VRAM.
VRAM requirement by quantization
Q4_K_M ★
10.9 GB~89%
Q5_0
11.4 GB~90%
Q5_K_M
12.9 GB~92%
Q6_K
14.8 GB~95%
Q8_0
17.8 GB~99%
★ Recommended — best balance of quality and VRAM usage.
Learn more about quantization →
Can I run Qwen3 32B NVFP4 on a Mac?: Qwen3 32B NVFP4 requires at least 10.9 GB at Q4_K_M, which exceeds the unified memory of most consumer Macs. You would need a Mac Studio or Mac Pro with a high-memory configuration.
Can I run Qwen3 32B NVFP4 locally?: Yes — Qwen3 32B NVFP4 can run locally on consumer hardware. At Q4_K_M quantization it needs 10.9 GB of VRAM. Popular tools include Ollama, LM Studio, and llama.cpp.
How fast is Qwen3 32B NVFP4?: At Q4_K_M, Qwen3 32B NVFP4 can reach ~267 tok/s on AMD Instinct MI300X. On NVIDIA GeForce RTX 4090: ~60 tok/s. Speed depends mainly on GPU memory bandwidth. Real-world results typically within ±20%.
tok/s = (bandwidth GB/s ÷ model GB) × efficiency
Example: AMD Instinct MI300X → 5300 ÷ 10.9 × 0.55 = ~267 tok/s
Estimated speed at Q4_K_M (10.9 GB)
AMD Instinct MI300X
~267 tok/s
NVIDIA GeForce RTX 4090
~60 tok/s
NVIDIA H100 SXM
~199 tok/s
AMD Instinct MI250X
~165 tok/s
Real-world results typically within ±20%. Speed depends on batch size, quantization kernel, and software stack.
Learn more about tok/s estimation →
What's the download size of Qwen3 32B NVFP4?: At Q4_K_M, the download is about 10.30 GB. The full-precision Q8_0 version is 17.16 GB.
Which GPUs can run Qwen3 32B NVFP4?: 27 consumer GPUs can run Qwen3 32B NVFP4 at Q4_K_M (10.9 GB). Top options include AMD Radeon RX 7900 XT, AMD Radeon RX 7900 XTX, NVIDIA GeForce RTX 3090, AMD Radeon RX 6700 XT. 6 GPUs have plenty of headroom for comfortable inference.
Which devices can run Qwen3 32B NVFP4?: 27 devices with unified memory can run Qwen3 32B NVFP4 at Q4_K_M (10.9 GB), including Mac Mini M4 (16 GB), Mac Mini M4 (32 GB), Mac Mini M4 Pro (24 GB), Mac Mini M4 Pro (48 GB). Apple Silicon Macs use unified memory shared between CPU and GPU, making them well-suited for local LLM inference.