All LLM Models

Browse 856 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Featured only

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Qwen2.5 7B Instruct

Alibaba · 7.6B · runs from 2.7 GB

Qwen2.5 7B Instruct is a 7.6-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 2.5 series. It supports a 128K token context window and is fine-tuned for conversational AI, instruction following, and general assistant tasks. Its efficient size makes it well-suited for local deployment on consumer GPUs with 8GB or more of VRAM. The model delivers strong performance for its parameter class across reasoning, multilingual understanding, and coding tasks. It benefits from the improved pretraining data and techniques of the Qwen 2.5 generation. Released under the Apache 2.0 license and widely supported by inference frameworks such as llama.cpp, vLLM, and Ollama.

Qwen3 Coder Next

Alibaba · 79.7B · runs from 22.3 GB

Qwen3 Coder Next is a 79.7-billion parameter code-specialized instruction-tuned model from Alibaba Cloud, the next generation of the Qwen Coder series. It is trained extensively on source code and programming-related data, delivering strong performance across code generation, completion, debugging, refactoring, and software engineering dialogue. The model represents a significant step up in coding capability within the Qwen family. Due to its large parameter count, running Qwen3 Coder Next locally requires substantial VRAM, typically 48GB or more at reduced precision, placing it in the territory of professional GPUs or multi-GPU consumer setups. It is a top-tier choice for developers who need the most capable local coding assistant available. Released under the Apache 2.0 license.

Gemma 4 26B A4B IT

Google · 26.5B · runs from 8.0 GB

Gemma 4 26B A4B IT is a 26.5B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 4 31B IT

Google · 32.7B · runs from 10.6 GB

Gemma 4 31B IT is a 32.7B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama 3.2 1B Instruct

Meta · 1.2B · runs from 0.4 GB

Meta Llama 3.2 1B Instruct is a 1-billion parameter instruction-tuned model from Meta, the smallest in the Llama 3.2 family. It is designed for ultra-lightweight deployment scenarios where minimal hardware resources are available, supporting a 128K token context window despite its compact size. This model is suitable for basic conversational tasks, text summarization, and simple instruction following. It can run on virtually any modern GPU and even on CPU-only setups with acceptable performance. Released under the Llama 3.2 Community License.

Qwen3 14B

Alibaba · 14.8B · runs from 4.7 GB

Qwen3 14B is a 14-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 3 series. It occupies a practical middle ground in the Qwen 3 lineup, offering stronger reasoning and generation quality than the 8B variant while remaining manageable on GPUs with 16GB or more of VRAM in quantized formats. The model supports hybrid thinking mode for flexible reasoning depth. Qwen3 14B is well suited for chat, instruction following, coding assistance, and multilingual tasks. It benefits from the generational improvements of Qwen 3 in pretraining data and alignment techniques, delivering performance that competes with larger models from previous generations. Released under the Apache 2.0 license.

Qwen3.6 35B A3B

Alibaba · 36.0B · runs from 10.3 GB

Qwen3.6 35B A3B is a 36.0B-parameter open language model from Alibaba in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 4 E4B IT

Google · 8.0B · runs from 3.2 GB

Gemma 4 E4B IT is a 8.0B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 4 12B IT

Google · 12.0B · runs from 4.8 GB

Gemma 4 12B IT is a 12.0B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

NVIDIA Nemotron 3 Nano 30B A3B BF16

NVIDIA · 31.6B · runs from 9.1 GB

NVIDIA Nemotron 3 Nano 30B A3B is a mixture-of-experts model with 31.6 billion total parameters but only around 3 billion active per token, giving it the intelligence of a much larger model with the speed of a small one. This BF16 version preserves full precision for maximum output quality. The MoE architecture makes this model especially interesting for local deployment. You get reasoning and instruction-following capabilities that punch well above what a traditional 3B model can deliver, while inference stays fast because only a fraction of the network fires for each token.

DeepSeek V4 Flash

DeepSeek · 158.1B · runs from 43.8 GB

DeepSeek V4 Flash is a 158.1B-parameter open language model from DeepSeek in the DeepSeek V4 family. It supports a context window of up to 1,048,576 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen3 30B A3B Instruct 2507

Alibaba · 30.5B · runs from 8.8 GB

Qwen3 30B A3B Instruct 2507 is a July 2025 updated mixture-of-experts model from Alibaba with 30 billion total parameters but only around 3 billion active during inference. This MoE architecture gives it a remarkably small memory and compute footprint relative to its total parameter count, letting users run a model with broad knowledge on mid-range hardware. The 2507 instruct refresh improves alignment and instruction-following quality over the original release. Because only a fraction of the weights are active at any given time, this model can often run on a single consumer GPU with 8 GB or more of VRAM when quantized, making it an excellent choice for users who want strong chat performance without heavyweight hardware.

Qwen3.6 27B

Alibaba · 27.8B · runs from 8.4 GB

Qwen3.6 27B is a 27.8B-parameter open language model from Alibaba in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 4 E2B IT

Google · 5.1B · runs from 2.1 GB

Gemma 4 E2B IT is a 5.1B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen3 32B

Alibaba · 32.8B · runs from 9.7 GB

Qwen3 32B is the flagship dense model in Alibaba Cloud's Qwen 3 series, with 32 billion parameters. It is instruction-tuned for chat and delivers strong performance across reasoning, coding, mathematics, and multilingual tasks. Qwen3 32B supports a hybrid thinking mode that allows the model to engage in extended chain-of-thought reasoning or respond quickly depending on the task, giving users flexibility between depth and speed. The model requires a GPU with at least 24GB of VRAM for quantized inference, placing it within reach of high-end consumer cards like the RTX 4090. It represents a significant generational improvement over Qwen 2.5 in both instruction following and knowledge breadth. Released under the Apache 2.0 license.

Kimi K2.6

Moonshot AI · 1058.6B · runs from 295.0 GB

Kimi K2.6 is a 1058.6B-parameter open language model from Moonshot AI in the Kimi K2 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama 3.1 8B Instruct

Meta · 8.0B · runs from 3.6 GB

Meta Llama 3.1 8B Instruct is an 8-billion parameter instruction-tuned language model from Meta. Part of the Llama 3.1 release, it supports a 128K token context window and is fine-tuned for conversational use, tool calling, and general assistant tasks. Its compact size makes it well-suited for local deployment on modern consumer GPUs with 8GB or more of VRAM. Llama 3.1 8B Instruct delivers strong performance for its parameter class across benchmarks in reasoning, coding, and multilingual understanding. It is released under the Llama 3.1 Community License and is widely supported by inference frameworks such as llama.cpp, vLLM, and Ollama.

Qwen2.5 32B Instruct

Alibaba · 32.8B · runs from 9.8 GB

Qwen2.5 32B Instruct is a 32-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 2.5 family. It occupies a practical sweet spot between the 14B and 72B variants, offering strong reasoning and multilingual capabilities while remaining feasible to run on a single high-end consumer GPU with 24GB or more of VRAM at reduced precision. The model supports a 128K token context window and is optimized for conversational use, instruction following, and structured output generation. It is a popular choice for local inference when the 72B model is too demanding but users need more capability than the 14B variant. Released under the Apache 2.0 license.

Qwen3 Coder 30B A3B Instruct

Alibaba · 30.5B · runs from 8.8 GB

Qwen3 Coder 30B A3B Instruct is a code-specialized Mixture of Experts (MoE) model from Alibaba Cloud's Qwen 3 Coder series, with 30 billion total parameters and approximately 3 billion active parameters per forward pass. The MoE architecture allows it to deliver strong coding performance while keeping per-token compute costs low, making it faster at inference than comparably capable dense models. The model is instruction-tuned for programming assistance, code generation, debugging, and software engineering conversation. It requires VRAM proportional to its total 30B parameter count for loading weights, but benefits from efficient inference throughput due to its low active parameter count. Released under the Apache 2.0 license.

Qwen2.5 Coder 7B Instruct

Alibaba · 7.6B · runs from 3.0 GB

Qwen2.5 Coder 7B Instruct is a 7.6-billion parameter code-specialized instruction-tuned model from Alibaba Cloud. It is trained on a large corpus of source code and natural language, fine-tuned for programming assistance tasks such as code generation, completion, debugging, and code explanation. The model supports a 128K token context window and runs efficiently on consumer GPUs with 8GB or more of VRAM. It provides a good balance between coding capability and hardware requirements for developers looking to run a local coding assistant. Released under the Apache 2.0 license.

Llama 3.2 3B Instruct

Meta · 3.2B · runs from 1.0 GB

Meta Llama 3.2 3B Instruct is a 3-billion parameter instruction-tuned model from Meta's Llama 3.2 release, designed for efficient local inference on resource-constrained hardware. It supports a 128K token context window and is optimized for conversational AI, summarization, and general assistant tasks. Despite its small footprint, Llama 3.2 3B Instruct delivers competitive performance for its size class and can run on GPUs with as little as 4GB of VRAM when quantized. It is released under the Llama 3.2 Community License and is a practical choice for edge deployment and lightweight local inference.

Qwen3 30B A3B

Alibaba · 30.5B · runs from 8.8 GB

Qwen3 30B A3B is a Mixture of Experts (MoE) model from Alibaba Cloud's Qwen 3 series, with 30 billion total parameters and approximately 3 billion active parameters per forward pass. The MoE architecture delivers quality significantly above what a standard 3B dense model could achieve, while keeping per-token compute costs low. It supports hybrid thinking mode for flexible reasoning. The model requires VRAM proportional to its full 30B parameter count for weight loading, but its low active parameter count results in fast inference throughput. It is an efficient option for users who want quality beyond dense small models without the full cost of larger architectures. Released under the Apache 2.0 license.

GPT OSS 120B

OpenAI · 120.4B · runs from 51.6 GB

GPT-OSS 120B is the larger of OpenAI's open-source model releases, bringing 120.4 billion parameters of GPT-lineage capability to the open-weight ecosystem. It represents near-frontier performance across reasoning, knowledge, code generation, and conversational tasks, rivaling top proprietary offerings in many benchmarks. Running this model locally is a serious hardware commitment, typically requiring multiple high-VRAM GPUs or a professional-grade setup with 80+ GB of combined VRAM even at aggressive quantization levels. It is best suited for enthusiasts with multi-GPU rigs or workstation hardware who want the strongest possible local model from OpenAI's catalog.

Meta Llama 3.1 8B Instruct

Meta · 8.0B · runs from 2.4 GB

Meta Llama 3.1 8B Instruct is a 8.0B-parameter open language model from Meta in the Llama 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

GPT OSS 20B

OpenAI · 21.5B · runs from 6.3 GB

GPT-OSS 20B is one of OpenAI's first open-source model releases, marking a historic shift in the company's approach to open weights. At 21.5 billion parameters it delivers strong general-purpose chat and reasoning capabilities informed by the research behind the GPT family, making it a compelling option for users who want OpenAI-grade quality in a locally deployable package. The model runs comfortably on a single high-end consumer GPU such as an RTX 4090 at 4-bit quantization, or on workstation cards with 24 GB or more of VRAM at higher precision. It occupies a practical middle ground between lightweight 7B models and resource-heavy 70B+ offerings.

Qwen3 8B

Alibaba · 8.2B · runs from 2.9 GB

Qwen3 8B is an 8.2-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 3 series. It is a general-purpose chat model that delivers strong performance across reasoning, multilingual understanding, and coding tasks while remaining efficient enough to run on consumer GPUs with 8GB or more of VRAM. Like other Qwen 3 models, it supports hybrid thinking mode for flexible reasoning depth. The model benefits from the improved pretraining data and training methodology of the Qwen 3 generation, offering notable quality gains over Qwen 2.5 at the same parameter count. It is widely supported by inference frameworks including llama.cpp, vLLM, and Ollama. Released under the Apache 2.0 license.

Qwen3 4B

Alibaba · 4.0B · runs from 1.6 GB

Qwen3 4B is a compact 4-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 3 family. It is designed for efficient local inference on consumer hardware, supporting chat and general assistant tasks while fitting comfortably on GPUs with 6GB or more of VRAM in quantized formats. The model supports hybrid thinking mode, allowing it to balance reasoning depth and response speed. Despite its small footprint, Qwen3 4B delivers quality competitive with larger models from previous generations, making it a practical choice for lightweight local deployments and resource-constrained environments. Released under the Apache 2.0 license.

Gemma 2 2B IT

Google · 2.6B · runs from 0.9 GB

Google Gemma 2 2B IT is a 2-billion parameter instruction-tuned model from Google's Gemma 2 family, the smallest variant in the Gemma 2 series. It is designed for efficient local inference on resource-constrained hardware, handling basic conversational tasks and simple instruction following at minimal compute cost. The model can run on GPUs with as little as 4GB of VRAM when quantized, and even on CPU-only setups. Released under the Gemma license.

Qwen3 0.6B

Alibaba · 752M · runs from 0.6 GB

Qwen3 0.6B is the smallest instruction-tuned model in Alibaba Cloud's Qwen 3 family, with approximately 752 million parameters. It is designed for ultra-lightweight deployment where minimal hardware resources are available, running comfortably on virtually any modern GPU or CPU-only setups. The model supports hybrid thinking mode despite its tiny footprint. While limited in reasoning depth compared to larger variants, Qwen3 0.6B handles basic chat, simple summarization, and lightweight instruction following. It is primarily useful for edge deployment, rapid prototyping, and experimentation where model size is a critical constraint. Released under the Apache 2.0 license.

Qwen2.5 1.5B Instruct

Alibaba · 1.5B · runs from 0.8 GB

Qwen2.5 1.5B Instruct is a 1.5-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 2.5 series. It is a lightweight model suitable for deployment on minimal hardware, including low-VRAM GPUs and even CPU-only setups with acceptable latency. It supports a 128K token context window. The model handles basic conversational tasks, simple question answering, and text generation. While limited in reasoning depth compared to larger variants, it is useful for applications where fast response times and minimal resource consumption are priorities. Released under the Apache 2.0 license.