All LLM Models
Browse 225 LLM models with VRAM requirements, quantization options, and hardware compatibility.
Understanding LLM VRAM Requirements
How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.
Model List
SmolLM2 1.7B Instruct GGUF
Bartowski · 1.7B
DeepSeek R1 0528 NVFP4 v2
NVIDIA · 393.6B
DeepSeek R1 0528 NVFP4 v2 is NVIDIA's optimized quantization of the massive 393.6 billion parameter DeepSeek R1 reasoning model, using the NVFP4 format to make this behemoth more practical for local deployment. DeepSeek R1 is renowned for its strong chain-of-thought reasoning, and this version preserves that capability at a fraction of the original memory cost. Running a 393B parameter model locally is no small feat even with aggressive quantization, but NVIDIA's NVFP4 format is specifically designed to squeeze maximum quality from minimal bits on their GPUs. For users with multi-GPU setups who want top-tier reasoning without cloud API dependencies, this is one of the most compelling options available.
Gemma 3 12B IT GGUF
MaziyarPanahi · 12B
Qwen3.5 122B A10B NVFP4
txn545 · 64.4B
An NVFP4-quantized version of Alibaba's Qwen3.5 122B A10B, repackaged by txn545. This large mixture-of-experts model has 122 billion total parameters with roughly 10 billion activated per token, giving it approximately 64.4 billion effective parameters in its quantized form. The NVFP4 (NVIDIA FP4) format is designed specifically for NVIDIA GPUs with FP4 support, offering aggressive compression while leveraging hardware-level acceleration. Qwen3.5 represents a significant generational upgrade in Alibaba's model lineup, and the 122B A10B variant delivers strong reasoning, coding, and multilingual performance. Despite the model's large total parameter count, the sparse activation pattern combined with FP4 quantization makes it feasible to run on high-end consumer GPUs, though multi-GPU setups will provide the best experience for longer context lengths.
Qwen3 Coder 30B A3B Instruct MLX 8bit
LM Studio Community · 30.5B
An MLX 8-bit quantized version of Alibaba's Qwen3 Coder 30B A3B Instruct, converted by LM Studio Community for Apple Silicon Macs. Compared to the 5-bit variant, this 8-bit quantization retains more of the original model's precision and output quality, at the cost of higher memory usage. This mixture-of-experts coding model with 30.5 billion total parameters is well suited for developers on Apple Silicon who want near-full-quality code generation and are willing to dedicate more memory to achieve it. Users with 48GB or more of unified memory will get the best experience, though it may fit on 32GB machines with some context length limitations.
Qwen3 32B NVFP4
NVIDIA · 17.2B
Qwen3 32B NVFP4 is NVIDIA's NVFP4-quantized version of Alibaba's dense Qwen3 32B model, reduced to approximately 17.2 billion parameters of effective memory usage. Unlike the MoE variants, this is a traditional dense model where all parameters contribute to every token, often yielding more consistent output quality. Qwen3 32B has earned a strong reputation as one of the best models in its size class, and NVIDIA's NVFP4 quantization makes it accessible on a broader range of GPUs. If you prefer the predictability of a dense architecture over MoE's efficiency trade-offs, this is the variant to choose.
Qmd Query Expansion 1.7B GGUF
tobil · 1.7B
Qwen2.5 1.5B Instruct GGUF
MaziyarPanahi · 1.5B
Mistral Small 24B Instruct 2501 GGUF
MaziyarPanahi · 24B
Gemma 3 27B IT GGUF
MaziyarPanahi · 27B
Qwen3 1.7B GGUF
MaziyarPanahi · 1.7B
A GGUF-quantized version of Alibaba's Qwen3 1.7B, repackaged by MaziyarPanahi. At 1.7 billion parameters, this lightweight model can run on virtually any modern hardware and offers solid general-purpose text generation for its size class. Qwen3 brings meaningful improvements in reasoning and instruction following over its predecessors. The GGUF format makes it easy to load in popular inference tools like llama.cpp and Ollama, with multiple quantization levels typically available to let you choose your preferred balance of quality and speed. A good option for users who want a fast, responsive small model for simple tasks and experimentation.
Qwen3 Coder 30B A3B Instruct MLX 5bit
LM Studio Community · 30.5B
An MLX 5-bit quantized version of Alibaba's Qwen3 Coder 30B A3B Instruct, converted by LM Studio Community for Apple Silicon Macs. This mixture-of-experts model has 30.5 billion total parameters but activates only a fraction per token, giving it strong code generation performance with better efficiency than a comparably sized dense model. The 5-bit quantization provides a middle ground between quality and memory usage, making it suitable for M-series Macs with 32GB or more of unified memory. It handles code completion, generation, refactoring, and explanation tasks well across a wide range of programming languages.
Qwen3 30B A3B Instruct 2507 GGUF
MaziyarPanahi · 30B
Qwen2.5 Coder 14B Instruct MLX 4bit
LM Studio Community · 2.3B
An MLX 4-bit quantized version of Alibaba's Qwen2.5 Coder 14B Instruct, converted by LM Studio Community for Apple Silicon Macs. Qwen2.5 Coder 14B is a capable mid-size coding model that handles code generation, completion, and explanation across many popular programming languages. The 4-bit quantization makes this model very accessible on Apple Silicon, fitting comfortably on Macs with 16GB or more of unified memory. It offers a strong balance of coding ability and resource efficiency, making it a practical everyday coding assistant for developers running local models on macOS.
Qwen3 4B Instruct 2507 GGUF
MaziyarPanahi · 4B
A GGUF-quantized version of Alibaba's Qwen3 4B Instruct 2507 (July 2025 release), repackaged by MaziyarPanahi. At 4 billion parameters, this model hits a sweet spot for users who want noticeably better output quality than sub-2B models while still running efficiently on modest hardware, including many integrated GPUs and older discrete cards. The 2507 revision reflects updated training and tuning from Alibaba, and the GGUF format ensures broad compatibility with llama.cpp, Ollama, LM Studio, and other popular local inference tools. A well-rounded small model for chat, writing assistance, and light reasoning tasks.