All LLM Models

Browse 51 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

QwQ 32B

Alibaba · 32B

52.6K 2.9K

QwQ 32B is a 32-billion parameter reasoning-focused model from Alibaba Cloud's Qwen family. Unlike standard chat models, QwQ is specifically optimized for step-by-step logical reasoning, complex problem solving, and mathematical tasks. It employs extended chain-of-thought processing, generating detailed internal reasoning before producing final answers, which significantly improves accuracy on challenging analytical problems. The model requires a GPU with at least 24GB of VRAM for quantized inference and delivers reasoning performance competitive with much larger models. It is particularly well suited for users who need strong analytical capabilities for math, science, coding logic, and multi-step problem solving. Released under the Apache 2.0 license.

ChatReasoning

Qwen2.5 Coder 32B Instruct

Alibaba · 32.8B

761.3K 2.0K

Qwen2.5 Coder 32B Instruct is a 32.8-billion parameter code-specialized model from Alibaba Cloud, instruction-tuned for programming assistance and code generation. It is trained on a large corpus of source code alongside natural language data, making it highly capable for tasks such as code completion, debugging, code explanation, and software engineering dialogue. The model supports a 128K token context window and delivers code generation quality competitive with the best open-weight coding models at any scale. It requires a GPU with at least 24GB of VRAM for quantized inference. Released under the Apache 2.0 license.

ChatCode

Qwen3 Coder 480B A35B Instruct

Alibaba · 480.2B

76.6K 1.3K

Qwen3 Coder 480B A35B Instruct is Alibaba's largest code-specialized model, a massive 480.2-billion-parameter mixture-of-experts system with roughly 35 billion parameters active per token. This is the most powerful open-weight coding model in the Qwen3 family, designed for professional-grade code generation, analysis, and software engineering tasks. Running this model locally is a serious undertaking that requires multi-GPU server-class hardware with several hundred gigabytes of combined VRAM. For users with access to such infrastructure, it offers exceptional code quality and understanding that rivals leading proprietary coding assistants, all while keeping data and computation entirely under local control.

ChatCode

Qwen3 0.6B

Alibaba · 752M

12.2M 1.1K

Qwen3 0.6B is the smallest instruction-tuned model in Alibaba Cloud's Qwen 3 family, with approximately 752 million parameters. It is designed for ultra-lightweight deployment where minimal hardware resources are available, running comfortably on virtually any modern GPU or CPU-only setups. The model supports hybrid thinking mode despite its tiny footprint. While limited in reasoning depth compared to larger variants, Qwen3 0.6B handles basic chat, simple summarization, and lightweight instruction following. It is primarily useful for edge deployment, rapid prototyping, and experimentation where model size is a critical constraint. Released under the Apache 2.0 license.

Chat

Qwen2.5 7B Instruct

Alibaba · 7.6B

22.8M 1.1K

Qwen2.5 7B Instruct is a 7.6-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 2.5 series. It supports a 128K token context window and is fine-tuned for conversational AI, instruction following, and general assistant tasks. Its efficient size makes it well-suited for local deployment on consumer GPUs with 8GB or more of VRAM. The model delivers strong performance for its parameter class across reasoning, multilingual understanding, and coding tasks. It benefits from the improved pretraining data and techniques of the Qwen 2.5 generation. Released under the Apache 2.0 license and widely supported by inference frameworks such as llama.cpp, vLLM, and Ollama.

Chat

Qwen3 Coder Next

Alibaba · 79.7B

1.1M 1.1K

Qwen3 Coder Next is a 79.7-billion parameter code-specialized instruction-tuned model from Alibaba Cloud, the next generation of the Qwen Coder series. It is trained extensively on source code and programming-related data, delivering strong performance across code generation, completion, debugging, refactoring, and software engineering dialogue. The model represents a significant step up in coding capability within the Qwen family. Due to its large parameter count, running Qwen3 Coder Next locally requires substantial VRAM, typically 48GB or more at reduced precision, placing it in the territory of professional GPUs or multi-GPU consumer setups. It is a top-tier choice for developers who need the most capable local coding assistant available. Released under the Apache 2.0 license.

ChatCode

Qwen3 235B A22B

Alibaba · 235.1B

759.7K 1.1K

Qwen3 235B A22B is the largest model in Alibaba Cloud's Qwen 3 series, a Mixture of Experts (MoE) model with 235 billion total parameters and approximately 22 billion active parameters per forward pass. The MoE architecture enables it to deliver performance competitive with the best available open-weight models while requiring significantly less compute per token than a comparably sized dense model. It supports hybrid thinking mode for flexible chain-of-thought reasoning. Due to its massive total parameter count, running Qwen3 235B A22B locally requires substantial VRAM to load all expert weights, typically needing multiple high-end professional GPUs even at reduced precision. In heavily quantized formats it becomes accessible on workstation-class multi-GPU setups. Released under the Apache 2.0 license.

Chat

Qwen3 8B

Alibaba · 8.2B

8.2M 985

Qwen3 8B is an 8.2-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 3 series. It is a general-purpose chat model that delivers strong performance across reasoning, multilingual understanding, and coding tasks while remaining efficient enough to run on consumer GPUs with 8GB or more of VRAM. Like other Qwen 3 models, it supports hybrid thinking mode for flexible reasoning depth. The model benefits from the improved pretraining data and training methodology of the Qwen 3 generation, offering notable quality gains over Qwen 2.5 at the same parameter count. It is widely supported by inference frameworks including llama.cpp, vLLM, and Ollama. Released under the Apache 2.0 license.

Chat

Qwen3 Coder 30B A3B Instruct

Alibaba · 30B

1.0M 976

Qwen3 Coder 30B A3B Instruct is a code-specialized Mixture of Experts (MoE) model from Alibaba Cloud's Qwen 3 Coder series, with 30 billion total parameters and approximately 3 billion active parameters per forward pass. The MoE architecture allows it to deliver strong coding performance while keeping per-token compute costs low, making it faster at inference than comparably capable dense models. The model is instruction-tuned for programming assistance, code generation, debugging, and software engineering conversation. It requires VRAM proportional to its total 30B parameter count for loading weights, but benefits from efficient inference throughput due to its low active parameter count. Released under the Apache 2.0 license.

ChatCode

Qwen3 Next 80B A3B Instruct

Alibaba · 81.3B

930.5K 949

Qwen3 Next 80B A3B Instruct is a Mixture of Experts (MoE) model from Alibaba Cloud's Qwen 3 series, with approximately 81.3 billion total parameters and around 3 billion active parameters per forward pass. This extreme ratio between total and active parameters allows the model to encode extensive knowledge across its expert layers while maintaining very fast per-token inference, making it an unusually efficient design for its capability level. The model is instruction-tuned for general-purpose chat and requires VRAM proportional to its full 80B parameter count for weight loading, typically needing high-VRAM GPUs or quantized multi-GPU setups. Its low active parameter count results in fast generation speeds despite the large total model size. Released under the Apache 2.0 license.

Chat

Qwen2.5 72B Instruct

Alibaba · 72.7B

733.3K 917

Qwen2.5 72B Instruct is the flagship model of the Qwen 2.5 series from Alibaba Cloud, with 72.7 billion parameters. It is instruction-tuned for conversational use and excels across reasoning, coding, mathematics, and multilingual tasks. Qwen2.5 72B delivers performance competitive with leading open-weight 70B-class models while supporting a 128K token context window and structured output generation. The model uses a Transformer architecture with grouped-query attention and was pretrained on a diverse multilingual corpus of over 18 trillion tokens. Running it locally requires high-VRAM GPUs or multi-GPU setups, though quantized formats make it accessible on workstation-class hardware. Released under the Apache 2.0 license.

Chat

Qwen3 30B A3B

Alibaba · 30B

1.2M 864

Qwen3 30B A3B is a Mixture of Experts (MoE) model from Alibaba Cloud's Qwen 3 series, with 30 billion total parameters and approximately 3 billion active parameters per forward pass. The MoE architecture delivers quality significantly above what a standard 3B dense model could achieve, while keeping per-token compute costs low. It supports hybrid thinking mode for flexible reasoning. The model requires VRAM proportional to its full 30B parameter count for weight loading, but its low active parameter count results in fast inference throughput. It is an efficient option for users who want quality beyond dense small models without the full cost of larger architectures. Released under the Apache 2.0 license.

Chat

Qwen3 30B A3B Instruct 2507

Alibaba · 30B

1.2M 783

Qwen3 30B A3B Instruct 2507 is a July 2025 updated mixture-of-experts model from Alibaba with 30 billion total parameters but only around 3 billion active during inference. This MoE architecture gives it a remarkably small memory and compute footprint relative to its total parameter count, letting users run a model with broad knowledge on mid-range hardware. The 2507 instruct refresh improves alignment and instruction-following quality over the original release. Because only a fraction of the weights are active at any given time, this model can often run on a single consumer GPU with 8 GB or more of VRAM when quantized, making it an excellent choice for users who want strong chat performance without heavyweight hardware.

Chat

Qwen3 4B Instruct 2507

Alibaba · 4B

3.8M 768

Qwen3 4B Instruct 2507 is a July 2025 refresh of Alibaba's compact 4-billion-parameter chat model from the Qwen3 family. This updated release brings improved instruction following and conversational quality while remaining lightweight enough to run on most modern GPUs and even some higher-end integrated graphics setups. With its modest size, the 4B Instruct 2507 strikes a practical balance between capability and resource efficiency. It is well suited for everyday chat, summarization, and light assistant tasks on consumer hardware, making it one of the more accessible entry points into the Qwen3 lineup.

Chat

Qwen3 235B A22B Instruct 2507

Alibaba · 235B

166.8K 765

Qwen3 235B A22B Instruct 2507 is Alibaba's flagship instruction-tuned model from the July 2025 update, featuring 235 billion total parameters with approximately 22 billion active during inference. As the largest instruct model in the Qwen3 lineup, it delivers top-tier conversational quality, knowledge depth, and instruction following. Despite its massive total parameter count, the MoE architecture keeps active compute manageable. Running this model locally still requires substantial hardware, typically multi-GPU setups with 48 GB or more of total VRAM, but the 2507 refresh makes it one of the most capable open-weight models available for users with high-end local infrastructure.

Chat

Qwen3 32B

Alibaba · 32B

5.0M 668

Qwen3 32B is the flagship dense model in Alibaba Cloud's Qwen 3 series, with 32 billion parameters. It is instruction-tuned for chat and delivers strong performance across reasoning, coding, mathematics, and multilingual tasks. Qwen3 32B supports a hybrid thinking mode that allows the model to engage in extended chain-of-thought reasoning or respond quickly depending on the task, giving users flexibility between depth and speed. The model requires a GPU with at least 24GB of VRAM for quantized inference, placing it within reach of high-end consumer cards like the RTX 4090. It represents a significant generational improvement over Qwen 2.5 in both instruction following and knowledge breadth. Released under the Apache 2.0 license.

Chat

Qwen2.5 Coder 7B Instruct

Alibaba · 7.6B

2.1M 666

Qwen2.5 Coder 7B Instruct is a 7.6-billion parameter code-specialized instruction-tuned model from Alibaba Cloud. It is trained on a large corpus of source code and natural language, fine-tuned for programming assistance tasks such as code generation, completion, debugging, and code explanation. The model supports a 128K token context window and runs efficiently on consumer GPUs with 8GB or more of VRAM. It provides a good balance between coding capability and hardware requirements for developers looking to run a local coding assistant. Released under the Apache 2.0 license.

ChatCode

Qwen2.5 1.5B Instruct

Alibaba · 1.5B

8.8M 637

Qwen2.5 1.5B Instruct is a 1.5-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 2.5 series. It is a lightweight model suitable for deployment on minimal hardware, including low-VRAM GPUs and even CPU-only setups with acceptable latency. It supports a 128K token context window. The model handles basic conversational tasks, simple question answering, and text generation. While limited in reasoning depth compared to larger variants, it is useful for applications where fast response times and minimal resource consumption are priorities. Released under the Apache 2.0 license.

Chat

Qwen3 4B

Alibaba · 4B

6.2M 570

Qwen3 4B is a compact 4-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 3 family. It is designed for efficient local inference on consumer hardware, supporting chat and general assistant tasks while fitting comfortably on GPUs with 6GB or more of VRAM in quantized formats. The model supports hybrid thinking mode, allowing it to balance reasoning depth and response speed. Despite its small footprint, Qwen3 4B delivers quality competitive with larger models from previous generations, making it a practical choice for lightweight local deployments and resource-constrained environments. Released under the Apache 2.0 license.

Chat

Qwen3 4B Thinking 2507

Alibaba · 4B

514.8K 567

Qwen3 4B Thinking 2507 is the reasoning-optimized variant of Alibaba's compact 4-billion-parameter Qwen3 model, released in the July 2025 update cycle. Despite its small size, this thinking variant is tuned to produce chain-of-thought reasoning and step-by-step problem solving, making it a surprisingly capable lightweight reasoner. This model is ideal for users who want basic reasoning and analytical capabilities on very modest hardware. It can run on most consumer GPUs and even some CPU-only setups when quantized, providing an accessible entry point for experimenting with reasoning-style models without any significant hardware investment.

Chat

Qwen2.5 0.5B Instruct

Alibaba · 494M

6.1M 483

Qwen2.5 0.5B Instruct is the smallest instruction-tuned model in Alibaba Cloud's Qwen 2.5 family, with just 494 million parameters. It is designed for ultra-lightweight deployment scenarios where minimal hardware resources are available, running comfortably on virtually any modern GPU or even CPU-only configurations. Despite its tiny footprint, the model supports a 128K token context window and can handle basic chat, simple summarization, and lightweight instruction following. It is primarily useful for edge deployment, experimentation, and prototyping where model size is a critical constraint. Released under the Apache 2.0 license.

Chat

Qwen3 1.7B

Alibaba · 1.7B

7.0M 427

Qwen3 1.7B is a 1.7-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 3 series. It is a lightweight model designed for deployment on minimal hardware, including low-VRAM GPUs and even CPU-only configurations with acceptable latency. Despite its compact size, it supports hybrid thinking mode and handles basic conversational tasks, simple question answering, and text generation. The model is useful for edge deployment, embedded applications, and scenarios where fast inference with minimal resource consumption is the priority. It represents a significant quality improvement over Qwen 2.5 at the sub-2B scale. Released under the Apache 2.0 license.

Chat

Qwen2.5 3B Instruct

Alibaba · 3.1B

7.0M 415

Qwen2.5 3B Instruct is a 3.1-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 2.5 family. It is designed for efficient local inference on consumer hardware, supporting a 128K token context window despite its compact footprint. The model can run on GPUs with as little as 4GB of VRAM when quantized. Despite its small size, Qwen2.5 3B Instruct delivers competitive performance for basic conversational tasks, summarization, and simple instruction following. It is a good option for edge deployment and resource-constrained environments. Released under the Apache 2.0 license.

Chat

Qwen3 235B A22B Thinking 2507

Alibaba · 235B

53.5K 399

Qwen3 235B A22B Thinking 2507 is the reasoning and chain-of-thought variant of Alibaba's largest Qwen3 mixture-of-experts model, updated in July 2025. With 235 billion total parameters and about 22 billion active per forward pass, it represents the pinnacle of Qwen3's reasoning capabilities. This model excels at complex multi-step problems, mathematical reasoning, code analysis, and tasks requiring deep logical thinking. It demands serious hardware to run locally, but for users with multi-GPU setups, it offers reasoning performance that rivals the best proprietary models while keeping all computation on your own machines.

Chat

Qwen2.5 0.5B

Alibaba · 494M

1.5M 381

Qwen2.5 0.5B is the smallest base (pretrained) model in Alibaba Cloud's Qwen 2.5 family, with 494 million parameters. As a base model, it is not instruction-tuned and is intended for fine-tuning, research, and as a foundation for custom applications. It supports a 128K token context window. Its minimal size makes it suitable for experimentation, rapid prototyping, and resource-constrained fine-tuning tasks. The model can run on virtually any hardware. Released under the Apache 2.0 license.

Chat

Qwen3 14B

Alibaba · 14B

2.7M 380

Qwen3 14B is a 14-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 3 series. It occupies a practical middle ground in the Qwen 3 lineup, offering stronger reasoning and generation quality than the 8B variant while remaining manageable on GPUs with 16GB or more of VRAM in quantized formats. The model supports hybrid thinking mode for flexible reasoning depth. Qwen3 14B is well suited for chat, instruction following, coding assistance, and multilingual tasks. It benefits from the generational improvements of Qwen 3 in pretraining data and alignment techniques, delivering performance that competes with larger models from previous generations. Released under the Apache 2.0 license.

Chat

Qwen3 30B A3B Thinking 2507

Alibaba · 30B

1.1M 368

Qwen3 30B A3B Thinking 2507 is the reasoning-focused variant of Alibaba's 30-billion-parameter mixture-of-experts model, updated in July 2025. Like its instruct sibling, it activates only about 3 billion parameters per token, keeping resource demands low while enabling multi-step reasoning and chain-of-thought problem solving. This thinking variant is designed for tasks that benefit from deliberate, step-by-step logic such as math, coding puzzles, and analytical questions. Its efficient MoE design means users with modest GPUs can still access strong reasoning capabilities without needing datacenter-class hardware.

Chat

Qwen2.5 32B Instruct

Alibaba · 32B

2.8M 337

Qwen2.5 32B Instruct is a 32-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 2.5 family. It occupies a practical sweet spot between the 14B and 72B variants, offering strong reasoning and multilingual capabilities while remaining feasible to run on a single high-end consumer GPU with 24GB or more of VRAM at reduced precision. The model supports a 128K token context window and is optimized for conversational use, instruction following, and structured output generation. It is a popular choice for local inference when the 72B model is too demanding but users need more capability than the 14B variant. Released under the Apache 2.0 license.

Chat

Qwen2.5 14B Instruct

Alibaba · 14B

2.0M 322

Qwen2.5 14B Instruct is a 14-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 2.5 series. It supports a 128K token context window and provides a balanced tradeoff between quality and hardware requirements, running well on GPUs with 16GB of VRAM in quantized formats. The model is fine-tuned for chat, instruction following, and general-purpose assistant tasks. It performs well across reasoning, coding, and multilingual benchmarks for its size class, making it a practical option for local deployment when larger models are not feasible. Released under the Apache 2.0 license.

Chat

Qwen1.5 MoE A2.7B

Alibaba · 14.3B

153.9K 221

Qwen1.5 MoE A2.7B is a Mixture of Experts (MoE) model from Alibaba Cloud's Qwen 1.5 generation, with 14.3 billion total parameters but only 2.7 billion active parameters per forward pass. The MoE architecture allows it to deliver performance closer to dense 7B models while requiring less compute during inference, as only a subset of expert layers are activated for each token. The model supports a 32K token context window and requires VRAM proportional to its total parameter count for loading, despite lower compute cost per token. It is an interesting architectural variant for users exploring efficient inference and MoE models locally. Released under a custom Qwen license.

Chat