All LLM Models

Browse 593 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Featured only

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Qwen2.5 3B Instruct

Alibaba · 3.1B · runs from 1.4 GB

Qwen2.5 3B Instruct is a 3.1-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 2.5 family. It is designed for efficient local inference on consumer hardware, supporting a 128K token context window despite its compact footprint. The model can run on GPUs with as little as 4GB of VRAM when quantized. Despite its small size, Qwen2.5 3B Instruct delivers competitive performance for basic conversational tasks, summarization, and simple instruction following. It is a good option for edge deployment and resource-constrained environments. Released under the Apache 2.0 license.

DeepSeek R1 0528 Qwen3 8B

DeepSeek · 8.2B · runs from 2.9 GB

DeepSeek R1 0528 Qwen3 8B is a 8.2B-parameter open language model from DeepSeek in the DeepSeek R1 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Phi 3.5 Mini Instruct

Microsoft · 3.8B · runs from 2.3 GB

Phi 3.5 Mini Instruct is a 3.8B-parameter open language model from Microsoft in the Phi 3 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 3 12B IT

Google · 12.2B · runs from 3.7 GB

Google Gemma 3 12B IT is a 12-billion parameter multimodal instruction-tuned model from Google's Gemma 3 series. It supports both text and image inputs, offering vision-language capabilities at a more accessible size point than the 27B variant. Gemma 3 12B IT runs on consumer GPUs with 12-16GB of VRAM in quantized formats, making it a practical choice for local multimodal inference without requiring top-tier hardware. Released under the Gemma license.

GLM 4.7 Flash

zai-org · 31.2B · runs from 9.7 GB

GLM 4.7 Flash is a 31.2B-parameter open language model from zai-org in the GLM 4 family. It supports a context window of up to 202,752 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen2.5 Coder 32B Instruct

Alibaba · 32.8B · runs from 9.8 GB

Qwen2.5 Coder 32B Instruct is a 32.8-billion parameter code-specialized model from Alibaba Cloud, instruction-tuned for programming assistance and code generation. It is trained on a large corpus of source code alongside natural language data, making it highly capable for tasks such as code completion, debugging, code explanation, and software engineering dialogue. The model supports a 128K token context window and delivers code generation quality competitive with the best open-weight coding models at any scale. It requires a GPU with at least 24GB of VRAM for quantized inference. Released under the Apache 2.0 license.

Meta Llama 3 8B Instruct

Meta · 8.0B · runs from 2.6 GB

Meta Llama 3 8B Instruct is the instruction-tuned version of Meta's Llama 3 8B base model, with 8 billion parameters. It is fine-tuned for dialogue and chat use cases using supervised fine-tuning and RLHF, making it ready for conversational applications out of the box. The model supports an 8K token context window and performs well across coding, reasoning, and general knowledge tasks. Its efficient size makes it one of the most popular models for local inference on consumer hardware. Released under the Meta Llama 3 Community License.

Qwen3 4B Instruct 2507

Alibaba · 4.0B · runs from 1.6 GB

Qwen3 4B Instruct 2507 is a July 2025 refresh of Alibaba's compact 4-billion-parameter chat model from the Qwen3 family. This updated release brings improved instruction following and conversational quality while remaining lightweight enough to run on most modern GPUs and even some higher-end integrated graphics setups. With its modest size, the 4B Instruct 2507 strikes a practical balance between capability and resource efficiency. It is well suited for everyday chat, summarization, and light assistant tasks on consumer hardware, making it one of the more accessible entry points into the Qwen3 lineup.

Qwen2.5 Coder 14B Instruct

Alibaba · 14.8B · runs from 5.1 GB

Qwen2.5 Coder 14B Instruct is a 14.8B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen2.5 0.5B Instruct

Alibaba · 494M · runs from 0.5 GB

Qwen2.5 0.5B Instruct is the smallest instruction-tuned model in Alibaba Cloud's Qwen 2.5 family, with just 494 million parameters. It is designed for ultra-lightweight deployment scenarios where minimal hardware resources are available, running comfortably on virtually any modern GPU or even CPU-only configurations. Despite its tiny footprint, the model supports a 128K token context window and can handle basic chat, simple summarization, and lightweight instruction following. It is primarily useful for edge deployment, experimentation, and prototyping where model size is a critical constraint. Released under the Apache 2.0 license.

Phi 4 Mini Instruct

Microsoft · 3.8B · runs from 2.2 GB

Microsoft Phi 4 Mini Instruct is a 3.8-billion parameter instruction-tuned model from Microsoft Research's Phi 4 family. It applies the Phi series' data-centric training philosophy to a compact model, delivering strong performance in coding, reasoning, and chat tasks relative to its small footprint. The model runs on consumer GPUs with as little as 4-6GB of VRAM when quantized, making it accessible on mainstream and even entry-level hardware. Released under the MIT license.

Gemma 3 1B IT

Google · 1000M · runs from 0.3 GB

Google Gemma 3 1B IT is a 1-billion parameter instruction-tuned model from Google's Gemma 3 family. It is an ultra-compact text-only chat model designed for deployment on minimal hardware, including low-VRAM GPUs and edge devices. The model handles basic conversational tasks, simple instruction following, and lightweight text generation. It can run on virtually any modern GPU and even on CPU-only setups with acceptable latency. Released under the Gemma license.

Mistral 7B Instruct v0.2

Mistral AI · 7.2B · runs from 3.6 GB

Mistral 7B Instruct v0.2 is a 7.2B-parameter open language model from Mistral AI in the Mistral family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

TinyLlama 1.1B Chat v1.0

TinyLlama · 1.1B · runs from 0.8 GB

TinyLlama 1.1B Chat is a 1.1-billion parameter chat model built on the Llama 2 architecture and trained on approximately 3 trillion tokens, an unusually large dataset for a model of its size. The TinyLlama project demonstrated that small models can achieve strong performance when given sufficient training compute, making it a standout in the sub-2B parameter class. The Chat variant is fine-tuned for conversational use and runs on virtually any modern GPU, including entry-level cards with 4GB of VRAM or less. It is a practical choice for lightweight local inference, edge deployment, and experimentation where hardware resources are limited.

Mistral 7B Instruct v0.3

Mistral AI · 7.2B · runs from 2.7 GB

Mistral 7B Instruct v0.3 is the latest instruction-tuned release of Mistral AI's original 7-billion-parameter model, delivering meaningful improvements in instruction following, function calling, and multilingual support over its predecessors. With an extended 32K-token vocabulary and refined chat capabilities, v0.3 remains one of the most capable sub-10B models available. At 7.2 billion parameters it sits comfortably in the sweet spot for local inference, running well on GPUs with 6–8 GB of VRAM at full precision and even on 4 GB cards with 4-bit quantization. It is an excellent default choice for anyone getting started with local LLMs who wants strong conversational performance without heavy hardware.

Gemma 3 27B IT

Google · 27.4B · runs from 8.3 GB

Google Gemma 3 27B IT is a 27.4-billion parameter multimodal instruction-tuned model from Google's Gemma 3 family. It supports both text and image inputs, making it one of the most capable openly available vision-language models for local inference. The model handles conversational AI, visual question answering, image description, and complex reasoning tasks across modalities. Gemma 3 27B IT requires a GPU with at least 24GB of VRAM for quantized inference, placing it within reach of high-end consumer cards like the RTX 4090. It uses a dense Transformer architecture with a large context window and benefits from Google's extensive pretraining pipeline. Released under the Gemma license.

Mistral Nemo Instruct 2407

Mistral AI · 12.2B · runs from 4.8 GB

Mistral Nemo Instruct 2407 is a 12.2B-parameter open language model from Mistral AI in the Mistral family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Phi 4

Microsoft · 14.7B · runs from 5.1 GB

Microsoft Phi 4 is a 14-billion parameter language model from Microsoft Research's Phi series, designed to deliver strong reasoning, mathematical, and coding performance at an efficient size. Phi 4 continues the Phi family's focus on maximizing capability per parameter through high-quality training data curation, achieving benchmark scores that rival much larger models on reasoning and STEM tasks. The model runs well on consumer GPUs with 12-16GB of VRAM in quantized formats. It excels at mathematical problem solving, code generation, and structured reasoning. Released under the MIT license.

Mistral Small 24B Instruct 2501

Mistral AI · 23.6B · runs from 7.8 GB

Mistral Small 24B Instruct is Mistral AI's January 2025 release targeting the mid-range parameter sweet spot. At 24 billion parameters it sits between lightweight 7B models and heavier 70B-class offerings, delivering strong instruction-following, reasoning, and coding performance without demanding top-tier hardware. This model fits comfortably on a single GPU with 16–24 GB of VRAM at common quantization levels, making it an attractive option for users with cards like the RTX 4090 or RTX 3090 who want a noticeable step up from 7B models. It strikes an appealing balance between quality and resource requirements for serious local use.

QwQ 32B

Alibaba · 32.8B · runs from 9.8 GB

QwQ 32B is a 32-billion parameter reasoning-focused model from Alibaba Cloud's Qwen family. Unlike standard chat models, QwQ is specifically optimized for step-by-step logical reasoning, complex problem solving, and mathematical tasks. It employs extended chain-of-thought processing, generating detailed internal reasoning before producing final answers, which significantly improves accuracy on challenging analytical problems. The model requires a GPU with at least 24GB of VRAM for quantized inference and delivers reasoning performance competitive with much larger models. It is particularly well suited for users who need strong analytical capabilities for math, science, coding logic, and multi-step problem solving. Released under the Apache 2.0 license.

Gemma 4 E4B IT Qat Q4 0 Unquantized

Google · 7.9B · runs from 3.9 GB

Gemma 4 E4B IT Qat Q4 0 Unquantized is a 7.9B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

DeepSeek R1 Distill Qwen 14B

DeepSeek · 14.8B · runs from 5.1 GB

DeepSeek R1 Distill Qwen 14B sits in a sweet spot between the smaller 7B distill and the more demanding 32B version, offering strong reasoning performance at 14.8 billion parameters on the Qwen 2.5 architecture. It captures a meaningful share of the full R1's chain-of-thought capabilities while keeping resource requirements within the range of mainstream consumer GPUs. Quantized to 4-bit, it fits comfortably on GPUs with 12 GB of VRAM, delivering reliable step-by-step reasoning for math, logic, and analytical problems.

Gemma 3 270M IT

Google · 268M · runs from 0.1 GB

Google Gemma 3 270M IT is a 270-million parameter instruction-tuned model from Google's Gemma 3 family, an experimental release pushing the boundaries of how small an effective chat model can be. The model runs on virtually any hardware, including entry-level GPUs and CPU-only setups, making it useful for experimentation, education, and exploring the limits of small-scale language modeling. Released under the Gemma license.

LFM2.5 8B A1B

LiquidAI · 8.5B · runs from 2.7 GB

LFM2.5 8B A1B is a 8.5B-parameter open language model from LiquidAI. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 4 E2B IT Qat Q4 0 Unquantized

Google · 5.1B · runs from 2.5 GB

Gemma 4 E2B IT Qat Q4 0 Unquantized is a 5.1B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

SmolLM2 135M Instruct

Hugging Face · 135M · runs from 0.4 GB

SmolLM2 135M Instruct is the instruction-tuned variant of Hugging Face's 135-million-parameter SmolLM2 model. Fine-tuned to follow user prompts and engage in basic conversational exchanges, it delivers surprisingly coherent responses given its minimal size, making it ideal for testing chat interfaces or running on extremely constrained devices. This model is a practical choice when you need an instruction-following model that fits comfortably in under 1 GB of memory. It works well for simple question answering, text reformatting, and lightweight assistant tasks where response quality can be traded for instant inference speed.

Qwen2.5 Coder 3B Instruct

Alibaba · 3.1B · runs from 1.4 GB

Qwen2.5 Coder 3B Instruct is a 3.1B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

DeepSeek R1 Distill Qwen 1.5B

DeepSeek · 1.8B · runs from 0.8 GB

DeepSeek R1 Distill Qwen 1.5B is the smallest model in the R1 distillation family, packing chain-of-thought reasoning capabilities into just 1.5 billion parameters using the Qwen 2.5 architecture. It represents an ambitious attempt to bring structured reasoning to the smallest practical model size. At this scale, the model can run on virtually any modern GPU and even on CPU-only setups with acceptable speed. While its reasoning depth is naturally limited compared to its larger siblings, it still demonstrates structured thinking patterns that set it apart from generic models of similar size.

GLM 4.6V Flash

zai-org · 10.3B · runs from 3.2 GB

GLM 4.6V Flash is a 10.3B-parameter open language model from zai-org in the GLM 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen3 4B Thinking 2507

Alibaba · 4.0B · runs from 1.6 GB

Qwen3 4B Thinking 2507 is the reasoning-optimized variant of Alibaba's compact 4-billion-parameter Qwen3 model, released in the July 2025 update cycle. Despite its small size, this thinking variant is tuned to produce chain-of-thought reasoning and step-by-step problem solving, making it a surprisingly capable lightweight reasoner. This model is ideal for users who want basic reasoning and analytical capabilities on very modest hardware. It can run on most consumer GPUs and even some CPU-only setups when quantized, providing an accessible entry point for experimenting with reasoning-style models without any significant hardware investment.