All LLM Models

Browse 593 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Featured only

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Gemma 4 E4B IT Ultra Uncensored Heretic

llmfan46 · 8.0B · runs from 3.9 GB

Gemma 4 E4B IT Ultra Uncensored Heretic is a 8.0B-parameter open language model from llmfan46 in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Phi 4 Mini Reasoning

Microsoft · 3.8B · runs from 1.6 GB

Phi 4 Mini Reasoning is a 3.8B-parameter open language model from Microsoft in the Phi 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatMathCodeReasoning

DeepSeek R1 Distill Qwen 32B

DeepSeek · 32.8B · runs from 9.8 GB

DeepSeek R1 Distill Qwen 32B takes the reasoning capabilities developed in the full 684.5B R1 model and distills them into the 32.8 billion parameter Qwen 2.5 architecture. The result is a dense model that punches well above its weight class on math, science, and coding reasoning tasks, often matching models two to three times its size. At around 32.8 billion parameters, this model fits comfortably on a single high-end consumer GPU when quantized to 4-bit precision, making it one of the most capable reasoning models you can run on a desktop workstation.

DeepSeek R1 Distill Qwen 7B

DeepSeek · 7.6B · runs from 3.0 GB

DeepSeek R1 Distill Qwen 7B compresses the reasoning techniques from DeepSeek's full R1 model into a compact 7.6 billion parameter dense model built on the Qwen 2.5 architecture. Despite its small footprint, it demonstrates surprisingly capable step-by-step reasoning on math and logic problems that would stump many models several times its size. This is one of the most accessible reasoning models available for local use, fitting comfortably on GPUs with 6 GB or more of VRAM when quantized. It strikes a practical balance between genuine chain-of-thought reasoning ability and the hardware constraints of a typical consumer setup.

DeepSeek R1 Distill Llama 8B

DeepSeek · 8.0B · runs from 2.8 GB

DeepSeek R1 Distill Llama 8B brings R1's reinforcement-learned reasoning capabilities to the widely supported Llama 3.1 8B architecture. By distilling the full 684.5B R1 model's reasoning patterns into this 8 billion parameter dense model, DeepSeek created a version that benefits from the extensive Llama ecosystem of tools, quantizations, and inference engines. For users who prefer the Llama architecture or already have tooling built around it, this model offers a plug-and-play path to chain-of-thought reasoning. Its hardware requirements are very approachable, running well on consumer GPUs with 8 GB or more of VRAM at common quantization levels.

SmolLM2 1.7B Instruct

Hugging Face · 1.7B · runs from 1.4 GB

SmolLM2 1.7B Instruct is the largest instruction-tuned model in the SmolLM2 family, offering the best balance of capability and efficiency Hugging Face achieved with this generation. At 1.7 billion parameters it produces substantially more coherent and useful responses than its smaller siblings, handling multi-turn conversations, summarization, and simple reasoning tasks with competence. With VRAM requirements well under 4 GB at standard precision, this model runs effortlessly on entry-level GPUs, older laptops, and even some mobile devices. It is an excellent choice for developers building lightweight local assistants or chatbots who want genuine conversational quality without the hardware demands of larger models.

Qwen2.5 Coder 1.5B Instruct

Alibaba · 1.5B · runs from 1.0 GB

Qwen2.5 Coder 1.5B Instruct is a 1.5B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Mistral Small 3.2 24B Instruct 2506

Mistral AI · 24.0B · runs from 7.3 GB

Mistral Small 3.2 24B Instruct 2506 is a 24.0B-parameter open language model from Mistral AI in the Mistral family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

MN 12B Mag Mell R1

inflatebot · 12.2B · runs from 4.1 GB

MN 12B Mag Mell R1 is a 12.2B-parameter open language model from inflatebot. It supports a context window of up to 1,024,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

LFM2.5 1.2B Instruct

LiquidAI · 1.2B · runs from 0.8 GB

LFM2.5 1.2B Instruct is an instruction-tuned model from Liquid AI that uses a novel hybrid architecture combining state-space models with attention mechanisms. At just 1.2 billion parameters, it is exceptionally lightweight and can run on virtually any hardware, including laptops and edge devices. Liquid AI's unconventional architecture aims to deliver better efficiency and longer context handling than traditional transformer models at this scale, making it an interesting option for users exploring alternatives to standard transformer-based LLMs.

Qwen3.6 12B IQ Ultra Heretic Uncensored Thinking v2 Hightop

DavidAU · 12.1B · runs from 5.6 GB

Qwen3.6 12B IQ Ultra Heretic Uncensored Thinking v2 Hightop is a 12.1B-parameter open language model from DavidAU in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen3.5 27B Claude 4.6 Opus Reasoning Distilled

Jackrong · 27.8B · runs from 8.4 GB

The full-precision version of Jackrong's Qwen3.5 27B reasoning distillation from Claude 4.6 Opus. With 27.8 billion parameters in unquantized form, this model preserves the maximum quality from the distillation process but requires significantly more VRAM, typically 56 GB or more in BF16. It is primarily intended for users with professional-grade GPUs or multi-GPU setups. This variant is ideal for further fine-tuning, experimentation, or running at full fidelity when hardware allows. Most users looking to run the model locally for inference should consider the GGUF-quantized version instead, which offers a much better tradeoff between quality and resource usage.

Mistral 7B Instruct v0.1

Mistral AI · 7B · runs from 3.5 GB

Mistral 7B Instruct v0.1 was the first instruction-tuned variant of the original Mistral 7B, fine-tuned for conversational and instruction-following tasks. While it has since been superseded by v0.2 and v0.3, it remains a solid lightweight chat model and an important milestone in the open-weight model ecosystem. Its hardware requirements are identical to the base Mistral 7B, running smoothly on GPUs with as little as 6 GB of VRAM when quantized. Users seeking the best Mistral 7B experience should generally prefer the newer v0.3 release, but v0.1 is still useful for reproducibility and benchmarking purposes.

Deepseek Coder 6.7B Instruct

DeepSeek · 6.7B · runs from 4.2 GB

DeepSeek Coder 6.7B Instruct is a first-generation code-specialized model trained on a large corpus of source code and programming-related data. At 6.7 billion parameters, it provides solid code completion, generation, and explanation capabilities across popular programming languages while remaining small enough to run on most consumer GPUs. While newer models in the DeepSeek lineup have surpassed it in raw capability, this model remains a practical choice for users who need a lightweight local coding assistant with minimal hardware requirements. It runs well on GPUs with as little as 6 GB of VRAM when quantized.

GLM 4.7 Flash REAP 23B A3B

Cerebras · 23.0B · runs from 7.4 GB

GLM 4.7 Flash REAP 23B A3B is a 23.0B-parameter open language model from Cerebras in the GLM 4 family. It supports a context window of up to 202,752 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

LFM2 24B A2B

LiquidAI · 23.8B · runs from 7.0 GB

LFM2 24B A2B is a 23.8B-parameter open language model from LiquidAI. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen2.5 14B Instruct

Alibaba · 14.8B · runs from 5.1 GB

Qwen2.5 14B Instruct is a 14-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 2.5 series. It supports a 128K token context window and provides a balanced tradeoff between quality and hardware requirements, running well on GPUs with 16GB of VRAM in quantized formats. The model is fine-tuned for chat, instruction following, and general-purpose assistant tasks. It performs well across reasoning, coding, and multilingual benchmarks for its size class, making it a practical option for local deployment when larger models are not feasible. Released under the Apache 2.0 license.

DeepSeek R1 Distill Qwen 32B Abliterated

huihui-ai · 32.8B · runs from 9.8 GB

DeepSeek R1 Distill Qwen 32B Abliterated is a 32.8B-parameter open language model from huihui-ai in the DeepSeek R1 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 4 12B IT Heretic

igorls · 12.0B · runs from 6.1 GB

Gemma 4 12B IT Heretic is a 12.0B-parameter open language model from igorls in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Huihui Qwen3 Coder 30B A3B Instruct Abliterated

huihui-ai · 30.5B · runs from 8.8 GB

Huihui Qwen3 Coder 30B A3B Instruct Abliterated is a 30.5B-parameter open language model from huihui-ai in the Qwen 3 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 2 9B IT

Google · 9.2B · runs from 3.0 GB

Google Gemma 2 9B IT is a 9.2-billion parameter instruction-tuned model from Google's Gemma 2 series. It is a text-only chat model optimized for conversational tasks, instruction following, and general-purpose assistance. At release, it was recognized for delivering unusually strong performance relative to its parameter count. The model runs efficiently on consumer GPUs with 8-12GB of VRAM in quantized formats, making it accessible on mainstream hardware. It is a popular choice for local inference among users who want strong quality without the VRAM demands of larger models. Released under the Gemma license.

DeepSeek Coder v2 Lite Instruct

DeepSeek · 15.7B · runs from 7.2 GB

DeepSeek Coder V2 Lite Instruct is a code-focused mixture-of-experts model with 15.7 billion total parameters, trained to handle both programming tasks and general conversation. It supports a wide range of programming languages and excels at code generation, debugging, explanation, and refactoring. The MoE architecture keeps compute costs manageable despite the model's broad capabilities, and the Lite variant is sized to run on a single consumer GPU. For developers looking for a capable local coding assistant that can also handle general chat, this model offers an appealing combination of code specialization and practical hardware requirements.

Gemma 4 E2B IT Uncensored

TrevorJS · 5.1B · runs from 2.5 GB

Gemma 4 E2B IT Uncensored is a 5.1B-parameter open language model from TrevorJS in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

SmolLM2 360M Instruct

Hugging Face · 362M · runs from 0.5 GB

SmolLM2 360M Instruct is an instruction-tuned model from Hugging Face that occupies the sweet spot between the 135M and 1.7B entries in the SmolLM2 lineup. At 360 million parameters, it offers noticeably better coherence and instruction-following ability than the smallest variants while still running comfortably on virtually any modern GPU or even on CPU. This model is well suited for on-device assistants, embedded applications, and rapid prototyping where you need real conversational ability without dedicating significant hardware resources. It handles short-form generation, summarization, and basic reasoning tasks with reasonable quality.

Gemma 4 26B A4B IT Uncensored

TrevorJS · 25.8B · runs from 11.6 GB

Gemma 4 26B A4B IT Uncensored is a 25.8B-parameter open language model from TrevorJS in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Nemotron 3 Nano Omni 30B A3B Reasoning BF16

NVIDIA · 33.0B · runs from 10.0 GB

Nemotron 3 Nano Omni 30B A3B Reasoning BF16 is a 33.0B-parameter open language model from NVIDIA in the Nemotron family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Phi 2

Microsoft · 2.8B · runs from 2.1 GB

Microsoft Phi 2 is a 2.8-billion parameter language model from Microsoft Research that pioneered the concept of small but highly capable language models. Released in late 2023, Phi 2 demonstrated that strategic data curation and training methodology could allow a sub-3B model to outperform many 7B and 13B models on reasoning and coding benchmarks. The model runs on virtually any modern GPU and even on CPU-only setups. While succeeded by Phi 3 and Phi 4, Phi 2 remains historically significant as the model that proved small-scale language models could be genuinely useful for practical tasks. Released under the MIT license.

Granite 4.1 3B

IBM · 3.4B · runs from 1.6 GB

Granite 4.1 3B is a 3.4B-parameter open language model from IBM in the Granite family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

NVIDIA Nemotron Nano 9B v2 Japanese

NVIDIA · 8.9B · runs from 4.4 GB

NVIDIA Nemotron Nano 9B v2 Japanese is a specialized variant of the Nemotron Nano 9B v2, fine-tuned for Japanese language understanding and generation. At 8.9 billion parameters, it maintains the same hardware-friendly footprint as the English version while delivering natural Japanese conversational ability. For users looking to run a Japanese-language assistant locally, this model offers a rare combination of compact size and dedicated language optimization from a major hardware vendor. It handles Japanese text with the fluency you'd expect from a purpose-built model rather than a multilingual afterthought.

Cydonia 24B V4.3

TheDrummer · 23.6B · runs from 7.8 GB

Cydonia 24B V4.3 is a 23.6B-parameter open language model from TheDrummer. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.