All LLM Models
Browse 856 LLM models with VRAM requirements, quantization options, and hardware compatibility.
Understanding LLM VRAM Requirements
How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.
Model List
QwQ 32B
Alibaba · 32.8B · runs from 9.8 GB
QwQ 32B is a 32-billion parameter reasoning-focused model from Alibaba Cloud's Qwen family. Unlike standard chat models, QwQ is specifically optimized for step-by-step logical reasoning, complex problem solving, and mathematical tasks. It employs extended chain-of-thought processing, generating detailed internal reasoning before producing final answers, which significantly improves accuracy on challenging analytical problems. The model requires a GPU with at least 24GB of VRAM for quantized inference and delivers reasoning performance competitive with much larger models. It is particularly well suited for users who need strong analytical capabilities for math, science, coding logic, and multi-step problem solving. Released under the Apache 2.0 license.
Qwen3 Next 80B A3B Thinking
Alibaba · 81.3B · runs from 22.8 GB
Qwen3 Next 80B A3B Thinking is a 81.3B-parameter open language model from Alibaba in the Qwen 3 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
DeepSeek v3 0324
DeepSeek · 684.5B · runs from 192.1 GB
DeepSeek V3 0324 is DeepSeek's flagship general-purpose chat model, featuring a 684.5 billion parameter mixture-of-experts architecture with roughly 37 billion parameters active per token. It delivers strong performance across a wide range of tasks including conversation, writing, analysis, coding, and instruction following, competing with the best closed-source models available. Like other large MoE models, V3 requires substantial memory to load all expert weights even though only a fraction are used during inference. Quantized versions make it feasible on multi-GPU setups, and its combination of broad capability with open weights has made it one of the most widely deployed open models for local and self-hosted use.
GLM 5.1
zai-org · 753.9B · runs from 211.5 GB
GLM 5.1 is a 753.9B-parameter open language model from zai-org in the GLM 5 family. It supports a context window of up to 202,752 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Meta Llama 3.1 405B Instruct
Meta · 405.9B · runs from 189.7 GB
Meta Llama 3.1 405B Instruct is a 405.9B-parameter open language model from Meta in the Llama 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Gemma 4 E4B IT Qat Q4 0 Unquantized
Google · 7.9B · runs from 3.9 GB
Gemma 4 E4B IT Qat Q4 0 Unquantized is a 7.9B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Gemma 4 31B IT Uncensored Heretic
llmfan46 · 31.3B · runs from 14.9 GB
Gemma 4 31B IT Uncensored Heretic is a 31.3B-parameter open language model from llmfan46 in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
DeepSeek R1 Distill Qwen 14B
DeepSeek · 14.8B · runs from 5.1 GB
DeepSeek R1 Distill Qwen 14B sits in a sweet spot between the smaller 7B distill and the more demanding 32B version, offering strong reasoning performance at 14.8 billion parameters on the Qwen 2.5 architecture. It captures a meaningful share of the full R1's chain-of-thought capabilities while keeping resource requirements within the range of mainstream consumer GPUs. Quantized to 4-bit, it fits comfortably on GPUs with 12 GB of VRAM, delivering reliable step-by-step reasoning for math, logic, and analytical problems.
Gemma 3 270M IT
Google · 268M · runs from 0.1 GB
Google Gemma 3 270M IT is a 270-million parameter instruction-tuned model from Google's Gemma 3 family, an experimental release pushing the boundaries of how small an effective chat model can be. The model runs on virtually any hardware, including entry-level GPUs and CPU-only setups, making it useful for experimentation, education, and exploring the limits of small-scale language modeling. Released under the Gemma license.
LFM2.5 8B A1B
LiquidAI · 8.5B · runs from 2.7 GB
LFM2.5 8B A1B is a 8.5B-parameter open language model from LiquidAI. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Gemma 4 E2B IT Qat Q4 0 Unquantized
Google · 5.1B · runs from 2.5 GB
Gemma 4 E2B IT Qat Q4 0 Unquantized is a 5.1B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
SmolLM2 135M Instruct
Hugging Face · 135M · runs from 0.4 GB
SmolLM2 135M Instruct is the instruction-tuned variant of Hugging Face's 135-million-parameter SmolLM2 model. Fine-tuned to follow user prompts and engage in basic conversational exchanges, it delivers surprisingly coherent responses given its minimal size, making it ideal for testing chat interfaces or running on extremely constrained devices. This model is a practical choice when you need an instruction-following model that fits comfortably in under 1 GB of memory. It works well for simple question answering, text reformatting, and lightweight assistant tasks where response quality can be traded for instant inference speed.
Qwen2.5 Coder 3B Instruct
Alibaba · 3.1B · runs from 1.4 GB
Qwen2.5 Coder 3B Instruct is a 3.1B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
DeepSeek R1 Distill Qwen 1.5B
DeepSeek · 1.8B · runs from 0.8 GB
DeepSeek R1 Distill Qwen 1.5B is the smallest model in the R1 distillation family, packing chain-of-thought reasoning capabilities into just 1.5 billion parameters using the Qwen 2.5 architecture. It represents an ambitious attempt to bring structured reasoning to the smallest practical model size. At this scale, the model can run on virtually any modern GPU and even on CPU-only setups with acceptable speed. While its reasoning depth is naturally limited compared to its larger siblings, it still demonstrates structured thinking patterns that set it apart from generic models of similar size.
GLM 4.6V Flash
zai-org · 10.3B · runs from 3.2 GB
GLM 4.6V Flash is a 10.3B-parameter open language model from zai-org in the GLM 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Qwen3 4B Thinking 2507
Alibaba · 4.0B · runs from 1.6 GB
Qwen3 4B Thinking 2507 is the reasoning-optimized variant of Alibaba's compact 4-billion-parameter Qwen3 model, released in the July 2025 update cycle. Despite its small size, this thinking variant is tuned to produce chain-of-thought reasoning and step-by-step problem solving, making it a surprisingly capable lightweight reasoner. This model is ideal for users who want basic reasoning and analytical capabilities on very modest hardware. It can run on most consumer GPUs and even some CPU-only setups when quantized, providing an accessible entry point for experimenting with reasoning-style models without any significant hardware investment.
Gemma 4 E4B IT Ultra Uncensored Heretic
llmfan46 · 8.0B · runs from 3.9 GB
Gemma 4 E4B IT Ultra Uncensored Heretic is a 8.0B-parameter open language model from llmfan46 in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Qwen3.6 27B Uncensored Heretic v2 Native MTP Preserved
llmfan46 · 27.4B · runs from 12.4 GB
Qwen3.6 27B Uncensored Heretic v2 Native MTP Preserved is a 27.4B-parameter open language model from llmfan46 in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Phi 4 Mini Reasoning
Microsoft · 3.8B · runs from 1.6 GB
Phi 4 Mini Reasoning is a 3.8B-parameter open language model from Microsoft in the Phi 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
MiniMax M2.5
MiniMaxAI · 228.7B · runs from 63.5 GB
MiniMax M2.5 is a large-scale mixture-of-experts model from MiniMax, a well-funded Chinese AI company. With roughly 228 billion total parameters and a MoE architecture that activates only a fraction per token, it aims to deliver performance competitive with much larger dense models while keeping inference costs manageable. Running it locally requires substantial hardware due to its large parameter footprint, but quantized versions can make it accessible to users with multi-GPU setups looking for a powerful multilingual model with strong Chinese and English capabilities.
DeepSeek R1 Distill Qwen 32B
DeepSeek · 32.8B · runs from 9.8 GB
DeepSeek R1 Distill Qwen 32B takes the reasoning capabilities developed in the full 684.5B R1 model and distills them into the 32.8 billion parameter Qwen 2.5 architecture. The result is a dense model that punches well above its weight class on math, science, and coding reasoning tasks, often matching models two to three times its size. At around 32.8 billion parameters, this model fits comfortably on a single high-end consumer GPU when quantized to 4-bit precision, making it one of the most capable reasoning models you can run on a desktop workstation.
DeepSeek R1
DeepSeek · 684.5B · runs from 192.1 GB
DeepSeek R1 is a groundbreaking reasoning model that uses reinforcement learning to develop chain-of-thought capabilities without relying on supervised fine-tuning. With 684.5 billion total parameters in a mixture-of-experts architecture (only 37 billion active per token), R1 achieves performance competitive with OpenAI's o1 on math, coding, and complex reasoning benchmarks while remaining fully open-weight. Running the full R1 locally is a serious undertaking, requiring well over 300 GB of VRAM at full precision, though quantized versions bring it within reach of multi-GPU setups. For users who want R1-level reasoning on more modest hardware, DeepSeek also released a family of distilled models that pack R1's reasoning patterns into smaller dense architectures.
DeepSeek R1 Distill Qwen 7B
DeepSeek · 7.6B · runs from 3.0 GB
DeepSeek R1 Distill Qwen 7B compresses the reasoning techniques from DeepSeek's full R1 model into a compact 7.6 billion parameter dense model built on the Qwen 2.5 architecture. Despite its small footprint, it demonstrates surprisingly capable step-by-step reasoning on math and logic problems that would stump many models several times its size. This is one of the most accessible reasoning models available for local use, fitting comfortably on GPUs with 6 GB or more of VRAM when quantized. It strikes a practical balance between genuine chain-of-thought reasoning ability and the hardware constraints of a typical consumer setup.
Qwen3.6 27B AEON Ultimate Uncensored BF16
AEON-7 · 27.4B · runs from 12.4 GB
Qwen3.6 27B AEON Ultimate Uncensored BF16 is a 27.4B-parameter open language model from AEON-7 in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
DeepSeek R1 Distill Llama 8B
DeepSeek · 8.0B · runs from 2.8 GB
DeepSeek R1 Distill Llama 8B brings R1's reinforcement-learned reasoning capabilities to the widely supported Llama 3.1 8B architecture. By distilling the full 684.5B R1 model's reasoning patterns into this 8 billion parameter dense model, DeepSeek created a version that benefits from the extensive Llama ecosystem of tools, quantizations, and inference engines. For users who prefer the Llama architecture or already have tooling built around it, this model offers a plug-and-play path to chain-of-thought reasoning. Its hardware requirements are very approachable, running well on consumer GPUs with 8 GB or more of VRAM at common quantization levels.
Qwen3.6 35B A3B Claude 4.6 Opus Reasoning Distilled
hesamation · 36.0B · runs from 15.7 GB
Qwen3.6 35B A3B Claude 4.6 Opus Reasoning Distilled is a 36.0B-parameter open language model from hesamation in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
SmolLM2 1.7B Instruct
Hugging Face · 1.7B · runs from 1.4 GB
SmolLM2 1.7B Instruct is the largest instruction-tuned model in the SmolLM2 family, offering the best balance of capability and efficiency Hugging Face achieved with this generation. At 1.7 billion parameters it produces substantially more coherent and useful responses than its smaller siblings, handling multi-turn conversations, summarization, and simple reasoning tasks with competence. With VRAM requirements well under 4 GB at standard precision, this model runs effortlessly on entry-level GPUs, older laptops, and even some mobile devices. It is an excellent choice for developers building lightweight local assistants or chatbots who want genuine conversational quality without the hardware demands of larger models.
GLM 5
zai-org · 753.9B · runs from 211.5 GB
GLM 5 is Zhipu AI's flagship foundation model, a massive mixture-of-experts architecture with nearly 754 billion total parameters. It represents one of the largest open-weight models available, offering state-of-the-art performance across reasoning, coding, math, and multilingual tasks in both Chinese and English. Running GLM 5 locally requires enterprise-grade multi-GPU infrastructure, but for users with access to such hardware, it provides a locally-hosted alternative to the largest proprietary models.
Qwen2.5 Coder 1.5B Instruct
Alibaba · 1.5B · runs from 1.0 GB
Qwen2.5 Coder 1.5B Instruct is a 1.5B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Mistral Small 3.2 24B Instruct 2506
Mistral AI · 24.0B · runs from 7.3 GB
Mistral Small 3.2 24B Instruct 2506 is a 24.0B-parameter open language model from Mistral AI in the Mistral family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.