All LLM Models

Browse 719 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

Gemma 4 E2B IT Qat Q4 0 Unquantized

Google · 5.1B · runs from 2.5 GB

4.9K 17

Gemma 4 E2B IT Qat Q4 0 Unquantized is a 5.1B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

SmolLM2 135M Instruct

Hugging Face · 135M · runs from 0.4 GB

1.7M 346

SmolLM2 135M Instruct is the instruction-tuned variant of Hugging Face's 135-million-parameter SmolLM2 model. Fine-tuned to follow user prompts and engage in basic conversational exchanges, it delivers surprisingly coherent responses given its minimal size, making it ideal for testing chat interfaces or running on extremely constrained devices. This model is a practical choice when you need an instruction-following model that fits comfortably in under 1 GB of memory. It works well for simple question answering, text reformatting, and lightweight assistant tasks where response quality can be traded for instant inference speed.

Chat

Qwen2.5 Coder 3B Instruct

Alibaba · 3.1B · runs from 1.4 GB

229.1K 111

Qwen2.5 Coder 3B Instruct is a 3.1B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

DeepSeek R1 Distill Qwen 1.5B

DeepSeek · 1.8B · runs from 0.8 GB

681.8K 1.5K

DeepSeek R1 Distill Qwen 1.5B is the smallest model in the R1 distillation family, packing chain-of-thought reasoning capabilities into just 1.5 billion parameters using the Qwen 2.5 architecture. It represents an ambitious attempt to bring structured reasoning to the smallest practical model size. At this scale, the model can run on virtually any modern GPU and even on CPU-only setups with acceptable speed. While its reasoning depth is naturally limited compared to its larger siblings, it still demonstrates structured thinking patterns that set it apart from generic models of similar size.

ChatReasoning

GLM 4.6V Flash

zai-org · 10.3B · runs from 3.2 GB

62.2K 609

GLM 4.6V Flash is a 10.3B-parameter open language model from zai-org in the GLM 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision

Qwen3 4B Thinking 2507

Alibaba · 4.0B · runs from 1.6 GB

527.0K 598

Qwen3 4B Thinking 2507 is the reasoning-optimized variant of Alibaba's compact 4-billion-parameter Qwen3 model, released in the July 2025 update cycle. Despite its small size, this thinking variant is tuned to produce chain-of-thought reasoning and step-by-step problem solving, making it a surprisingly capable lightweight reasoner. This model is ideal for users who want basic reasoning and analytical capabilities on very modest hardware. It can run on most consumer GPUs and even some CPU-only setups when quantized, providing an accessible entry point for experimenting with reasoning-style models without any significant hardware investment.

Chat

Gemma 4 E4B IT Ultra Uncensored Heretic

llmfan46 · 8.0B · runs from 3.9 GB

3.4K 22

Gemma 4 E4B IT Ultra Uncensored Heretic is a 8.0B-parameter open language model from llmfan46 in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen3.6 27B Uncensored Heretic v2 Native MTP Preserved

llmfan46 · 27.4B · runs from 12.4 GB

38.1K 30

Qwen3.6 27B Uncensored Heretic v2 Native MTP Preserved is a 27.4B-parameter open language model from llmfan46 in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision

Phi 4 Mini Reasoning

Microsoft · 3.8B · runs from 1.6 GB

53.8K 234

Phi 4 Mini Reasoning is a 3.8B-parameter open language model from Microsoft in the Phi 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatMathCodeReasoning

DeepSeek R1 Distill Qwen 32B

DeepSeek · 32.8B · runs from 9.8 GB

490.3K 1.6K

DeepSeek R1 Distill Qwen 32B takes the reasoning capabilities developed in the full 684.5B R1 model and distills them into the 32.8 billion parameter Qwen 2.5 architecture. The result is a dense model that punches well above its weight class on math, science, and coding reasoning tasks, often matching models two to three times its size. At around 32.8 billion parameters, this model fits comfortably on a single high-end consumer GPU when quantized to 4-bit precision, making it one of the most capable reasoning models you can run on a desktop workstation.

ChatReasoning

DeepSeek R1 Distill Qwen 7B

DeepSeek · 7.6B · runs from 3.0 GB

478.1K 842

DeepSeek R1 Distill Qwen 7B compresses the reasoning techniques from DeepSeek's full R1 model into a compact 7.6 billion parameter dense model built on the Qwen 2.5 architecture. Despite its small footprint, it demonstrates surprisingly capable step-by-step reasoning on math and logic problems that would stump many models several times its size. This is one of the most accessible reasoning models available for local use, fitting comfortably on GPUs with 6 GB or more of VRAM when quantized. It strikes a practical balance between genuine chain-of-thought reasoning ability and the hardware constraints of a typical consumer setup.

ChatReasoning

Qwen3.6 27B AEON Ultimate Uncensored BF16

AEON-7 · 27.4B · runs from 12.4 GB

29.0K 104

Qwen3.6 27B AEON Ultimate Uncensored BF16 is a 27.4B-parameter open language model from AEON-7 in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

DeepSeek R1 Distill Llama 8B

DeepSeek · 8.0B · runs from 2.8 GB

439.0K 864

DeepSeek R1 Distill Llama 8B brings R1's reinforcement-learned reasoning capabilities to the widely supported Llama 3.1 8B architecture. By distilling the full 684.5B R1 model's reasoning patterns into this 8 billion parameter dense model, DeepSeek created a version that benefits from the extensive Llama ecosystem of tools, quantizations, and inference engines. For users who prefer the Llama architecture or already have tooling built around it, this model offers a plug-and-play path to chain-of-thought reasoning. Its hardware requirements are very approachable, running well on consumer GPUs with 8 GB or more of VRAM at common quantization levels.

ChatReasoning

Qwen3.6 35B A3B Claude 4.6 Opus Reasoning Distilled

hesamation · 36.0B · runs from 15.7 GB

4.2K 86

Qwen3.6 35B A3B Claude 4.6 Opus Reasoning Distilled is a 36.0B-parameter open language model from hesamation in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

VisionReasoningChat

SmolLM2 1.7B Instruct

Hugging Face · 1.7B · runs from 1.4 GB

188.0K 733

SmolLM2 1.7B Instruct is the largest instruction-tuned model in the SmolLM2 family, offering the best balance of capability and efficiency Hugging Face achieved with this generation. At 1.7 billion parameters it produces substantially more coherent and useful responses than its smaller siblings, handling multi-turn conversations, summarization, and simple reasoning tasks with competence. With VRAM requirements well under 4 GB at standard precision, this model runs effortlessly on entry-level GPUs, older laptops, and even some mobile devices. It is an excellent choice for developers building lightweight local assistants or chatbots who want genuine conversational quality without the hardware demands of larger models.

Chat

Qwen2.5 Coder 1.5B Instruct

Alibaba · 1.5B · runs from 1.0 GB

748.8K 126

Qwen2.5 Coder 1.5B Instruct is a 1.5B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

Mistral Small 3.2 24B Instruct 2506

Mistral AI · 24.0B · runs from 7.3 GB

588.9K 593

Mistral Small 3.2 24B Instruct 2506 is a 24.0B-parameter open language model from Mistral AI in the Mistral family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen3.6 35B A3B Uncensored Heretic Native MTP Preserved

llmfan46 · 35.1B · runs from 15.3 GB

23.0K 26

Qwen3.6 35B A3B Uncensored Heretic Native MTP Preserved is a 35.1B-parameter open language model from llmfan46 in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision

MN 12B Mag Mell R1

inflatebot · 12.2B · runs from 4.1 GB

40.0K 239

MN 12B Mag Mell R1 is a 12.2B-parameter open language model from inflatebot. It supports a context window of up to 1,024,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoning

LFM2.5 1.2B Instruct

LiquidAI · 1.2B · runs from 0.8 GB

147.3K 604

LFM2.5 1.2B Instruct is an instruction-tuned model from Liquid AI that uses a novel hybrid architecture combining state-space models with attention mechanisms. At just 1.2 billion parameters, it is exceptionally lightweight and can run on virtually any hardware, including laptops and edge devices. Liquid AI's unconventional architecture aims to deliver better efficiency and longer context handling than traditional transformer models at this scale, making it an interesting option for users exploring alternatives to standard transformer-based LLMs.

Chat

Qwen3.6 12B IQ Ultra Heretic Uncensored Thinking v2 Hightop

DavidAU · 12.1B · runs from 5.6 GB

1.1K 24

Qwen3.6 12B IQ Ultra Heretic Uncensored Thinking v2 Hightop is a 12.1B-parameter open language model from DavidAU in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

VisionRoleplay

DeepSeek R1 Distill Llama 70B

DeepSeek · 70.6B · runs from 20.4 GB

119.2K 778

DeepSeek R1 Distill Llama 70B is the largest model in the R1 distillation lineup, combining the reasoning capabilities developed in the full 684.5B R1 with the robust Llama 3.1 70B architecture. At 70 billion parameters, it delivers the strongest reasoning performance of any dense R1 distill, approaching the full R1's quality on many math and coding benchmarks. Running this model locally requires a multi-GPU setup or a single GPU with very high VRAM capacity, though quantized versions can fit on hardware with 48 GB or more. For users who need top-tier open-weight reasoning and have the hardware to support a 70B dense model, this is one of the strongest options available.

ChatReasoning

Qwen3.5 27B Claude 4.6 Opus Reasoning Distilled

Jackrong · 27.8B · runs from 8.4 GB

61.6K 695

The full-precision version of Jackrong's Qwen3.5 27B reasoning distillation from Claude 4.6 Opus. With 27.8 billion parameters in unquantized form, this model preserves the maximum quality from the distillation process but requires significantly more VRAM, typically 56 GB or more in BF16. It is primarily intended for users with professional-grade GPUs or multi-GPU setups. This variant is ideal for further fine-tuning, experimentation, or running at full fidelity when hardware allows. Most users looking to run the model locally for inference should consider the GGUF-quantized version instead, which offers a much better tradeoff between quality and resource usage.

ChatReasoning

Mistral 7B Instruct v0.1

Mistral AI · 7B · runs from 3.5 GB

448.7K 1.8K

Mistral 7B Instruct v0.1 was the first instruction-tuned variant of the original Mistral 7B, fine-tuned for conversational and instruction-following tasks. While it has since been superseded by v0.2 and v0.3, it remains a solid lightweight chat model and an important milestone in the open-weight model ecosystem. Its hardware requirements are identical to the base Mistral 7B, running smoothly on GPUs with as little as 6 GB of VRAM when quantized. Users seeking the best Mistral 7B experience should generally prefer the newer v0.3 release, but v0.1 is still useful for reproducibility and benchmarking purposes.

Chat

Qwen3 Next 80B A3B Instruct

Alibaba · 81.3B · runs from 22.8 GB

323.6K 1.0K

Qwen3 Next 80B A3B Instruct is a Mixture of Experts (MoE) model from Alibaba Cloud's Qwen 3 series, with approximately 81.3 billion total parameters and around 3 billion active parameters per forward pass. This extreme ratio between total and active parameters allows the model to encode extensive knowledge across its expert layers while maintaining very fast per-token inference, making it an unusually efficient design for its capability level. The model is instruction-tuned for general-purpose chat and requires VRAM proportional to its full 80B parameter count for weight loading, typically needing high-VRAM GPUs or quantized multi-GPU setups. Its low active parameter count results in fast generation speeds despite the large total model size. Released under the Apache 2.0 license.

Chat

Deepseek Coder 6.7B Instruct

DeepSeek · 6.7B · runs from 4.2 GB

143.7K 496

DeepSeek Coder 6.7B Instruct is a first-generation code-specialized model trained on a large corpus of source code and programming-related data. At 6.7 billion parameters, it provides solid code completion, generation, and explanation capabilities across popular programming languages while remaining small enough to run on most consumer GPUs. While newer models in the DeepSeek lineup have surpassed it in raw capability, this model remains a practical choice for users who need a lightweight local coding assistant with minimal hardware requirements. It runs well on GPUs with as little as 6 GB of VRAM when quantized.

ChatCode

GLM 4.7 Flash REAP 23B A3B

Cerebras · 23.0B · runs from 7.4 GB

421 76

GLM 4.7 Flash REAP 23B A3B is a 23.0B-parameter open language model from Cerebras in the GLM 4 family. It supports a context window of up to 202,752 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

LFM2 24B A2B

LiquidAI · 23.8B · runs from 7.0 GB

20.5K 332

LFM2 24B A2B is a 23.8B-parameter open language model from LiquidAI. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen2.5 14B Instruct

Alibaba · 14.8B · runs from 5.1 GB

1.9M 347

Qwen2.5 14B Instruct is a 14-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 2.5 series. It supports a 128K token context window and provides a balanced tradeoff between quality and hardware requirements, running well on GPUs with 16GB of VRAM in quantized formats. The model is fine-tuned for chat, instruction following, and general-purpose assistant tasks. It performs well across reasoning, coding, and multilingual benchmarks for its size class, making it a practical option for local deployment when larger models are not feasible. Released under the Apache 2.0 license.

Chat

DeepSeek R1 Distill Qwen 32B Abliterated

huihui-ai · 32.8B · runs from 9.8 GB

33.7K 244

DeepSeek R1 Distill Qwen 32B Abliterated is a 32.8B-parameter open language model from huihui-ai in the DeepSeek R1 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoning