All LLM Models

Browse 80 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

Qwen2.5 72B Instruct

Alibaba · 72.7B · runs from 21.0 GB

455.1K 951

Qwen2.5 72B Instruct is the flagship model of the Qwen 2.5 series from Alibaba Cloud, with 72.7 billion parameters. It is instruction-tuned for conversational use and excels across reasoning, coding, mathematics, and multilingual tasks. Qwen2.5 72B delivers performance competitive with leading open-weight 70B-class models while supporting a 128K token context window and structured output generation. The model uses a Transformer architecture with grouped-query attention and was pretrained on a diverse multilingual corpus of over 18 trillion tokens. Running it locally requires high-VRAM GPUs or multi-GPU setups, though quantized formats make it accessible on workstation-class hardware. Released under the Apache 2.0 license.

Chat

Qwen2.5 Coder 32B

Alibaba · 32.8B · runs from 9.8 GB

3.6K 156

Qwen2.5 Coder 32B is a 32.8B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

Qwen3 235B A22B

Alibaba · 235.1B · runs from 100.4 GB

539.2K 1.1K

Qwen3 235B A22B is the largest model in Alibaba Cloud's Qwen 3 series, a Mixture of Experts (MoE) model with 235 billion total parameters and approximately 22 billion active parameters per forward pass. The MoE architecture enables it to deliver performance competitive with the best available open-weight models while requiring significantly less compute per token than a comparably sized dense model. It supports hybrid thinking mode for flexible chain-of-thought reasoning. Due to its massive total parameter count, running Qwen3 235B A22B locally requires substantial VRAM to load all expert weights, typically needing multiple high-end professional GPUs even at reduced precision. In heavily quantized formats it becomes accessible on workstation-class multi-GPU setups. Released under the Apache 2.0 license.

Chat

Qwen2.5 Coder 7B

Alibaba · 7.6B · runs from 3.6 GB

205.3K 139

Qwen2.5 Coder 7B is a 7.6-billion parameter code-specialized base (pretrained) model from Alibaba Cloud's Qwen 2.5 Coder series. It is trained on a large dataset of source code and natural language but is not instruction-tuned, making it suitable for fine-tuning, code-related research, and custom downstream applications. The model supports a 128K token context window and runs efficiently on consumer GPUs. It serves as the foundation for the Qwen2.5 Coder 7B Instruct variant and community fine-tunes targeting specific programming languages or workflows. Released under the Apache 2.0 license.

ChatCode

Qwen3 30B A3B Thinking 2507

Alibaba · 30.5B · runs from 8.8 GB

138.1K 379

Qwen3 30B A3B Thinking 2507 is the reasoning-focused variant of Alibaba's 30-billion-parameter mixture-of-experts model, updated in July 2025. Like its instruct sibling, it activates only about 3 billion parameters per token, keeping resource demands low while enabling multi-step reasoning and chain-of-thought problem solving. This thinking variant is designed for tasks that benefit from deliberate, step-by-step logic such as math, coding puzzles, and analytical questions. Its efficient MoE design means users with modest GPUs can still access strong reasoning capabilities without needing datacenter-class hardware.

Chat

Qwen3 Coder 480B A35B Instruct

Alibaba · 480.2B · runs from 144.6 GB

41.6K 1.3K

Qwen3 Coder 480B A35B Instruct is Alibaba's largest code-specialized model, a massive 480.2-billion-parameter mixture-of-experts system with roughly 35 billion parameters active per token. This is the most powerful open-weight coding model in the Qwen3 family, designed for professional-grade code generation, analysis, and software engineering tasks. Running this model locally is a serious undertaking that requires multi-GPU server-class hardware with several hundred gigabytes of combined VRAM. For users with access to such infrastructure, it offers exceptional code quality and understanding that rivals leading proprietary coding assistants, all while keeping data and computation entirely under local control.

ChatCode

Qwen2.5 Coder 3B

Alibaba · 3.1B · runs from 1.4 GB

717.7K 51

Qwen2.5 Coder 3B is a 3.1B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

Qwen3 235B A22B Instruct 2507

Alibaba · 235.1B · runs from 71.0 GB

123.5K 784

Qwen3 235B A22B Instruct 2507 is Alibaba's flagship instruction-tuned model from the July 2025 update, featuring 235 billion total parameters with approximately 22 billion active during inference. As the largest instruct model in the Qwen3 lineup, it delivers top-tier conversational quality, knowledge depth, and instruction following. Despite its massive total parameter count, the MoE architecture keeps active compute manageable. Running this model locally still requires substantial hardware, typically multi-GPU setups with 48 GB or more of total VRAM, but the 2507 refresh makes it one of the most capable open-weight models available for users with high-end local infrastructure.

Chat

Qwen2.5 Coder 1.5B

Alibaba · 1.5B · runs from 1 GB

584.8K 85

Qwen2.5 Coder 1.5B is a 1.5-billion parameter code-specialized model from Alibaba Cloud's Qwen 2.5 Coder series. It is the smallest Coder variant that balances meaningful code generation capability with extremely low resource requirements, running on GPUs with as little as 2-4GB of VRAM. The model is suitable for lightweight code completion, simple code generation tasks, and as a compact local coding assistant in resource-constrained environments. It supports a 128K token context window. Released under the Apache 2.0 license.

ChatCode

Qwen2.5 3B

Alibaba · 3.1B · runs from 1.6 GB

499.8K 190

Qwen2.5 3B is a 3.1B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen3 0.6B Base

Alibaba · 596M · runs from 0.7 GB

478.3K 174

Qwen3 0.6B Base is the smallest pretrained foundation model in Alibaba Cloud's Qwen 3 family, with approximately 600 million parameters. As a base model, it is not tuned for chat or instructions and is intended for fine-tuning, research, and experimentation. Its minimal size makes it suitable for rapid prototyping and resource-constrained training experiments. The model runs on virtually any hardware, including CPU-only setups. It is useful for educational purposes, architecture exploration, and as a compact foundation for task-specific fine-tuning where model size is a primary constraint. Released under the Apache 2.0 license.

Chat

Qwen3 1.7B Base

Alibaba · 1.7B · runs from 1.0 GB

336.3K 65

Qwen3 1.7B Base is a 1.7-billion parameter pretrained foundation model from Alibaba Cloud's Qwen 3 family. It is a compact base model designed for fine-tuning, research, and custom applications rather than direct conversational use. Its small size makes it accessible for resource-constrained fine-tuning and rapid experimentation. The model can run on virtually any modern GPU and benefits from the improved pretraining data of the Qwen 3 generation. It is suitable as a lightweight foundation for domain-specific fine-tunes and student models in distillation pipelines. Released under the Apache 2.0 license.

Chat

Qwen2.5 7B

Alibaba · 7.6B · runs from 3.6 GB

802.3K 291

Qwen2.5 7B is a 7.6B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen2.5 72B

Alibaba · 72.7B · runs from 31.0 GB

30.8K 99

Qwen2.5 72B is a 72.7B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen2.5 14B

Alibaba · 14.8B · runs from 6.8 GB

60.9K 154

Qwen2.5 14B is a 14.8B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen3 4B Base

Alibaba · 4.0B · runs from 2.2 GB

758.6K 95

Qwen3 4B Base is a 4.0B-parameter open language model from Alibaba in the Qwen 3 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen2.5 32B

Alibaba · 32.8B · runs from 14.3 GB

65.7K 178

Qwen2.5 32B is a 32.8B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen2.5 1.5B

Alibaba · 1.5B · runs from 1 GB

1.2M 187

Qwen2.5 1.5B is a 1.5B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen1.5 0.5B Chat

Alibaba · 620M · runs from 0.8 GB

85.9K 95

Qwen1.5 0.5B Chat is an early-generation small language model from Alibaba's Qwen series with just 620 million parameters. As one of the smallest models in the Qwen family, it was designed to demonstrate that useful conversational ability is possible even at sub-billion parameter scales. This model runs easily on virtually any hardware including CPUs, older GPUs, and even mobile devices. While its capabilities are limited compared to larger Qwen models, it remains a useful option for embedded applications, rapid prototyping, or situations where minimal resource consumption is the top priority.

Chat

Qwen2.5 0.5B

Alibaba · 494M · runs from 0.5 GB

2.0M 421

Qwen2.5 0.5B is the smallest base (pretrained) model in Alibaba Cloud's Qwen 2.5 family, with 494 million parameters. As a base model, it is not instruction-tuned and is intended for fine-tuning, research, and as a foundation for custom applications. It supports a 128K token context window. Its minimal size makes it suitable for experimentation, rapid prototyping, and resource-constrained fine-tuning tasks. The model can run on virtually any hardware. Released under the Apache 2.0 license.

Chat

Qwen3 8B Base

Alibaba · 8.2B · runs from 4.1 GB

453.7K 107

Qwen3 8B Base is an 8.2-billion parameter pretrained foundation model from Alibaba Cloud's Qwen 3 series. As a base model, it is not instruction-tuned and is intended for fine-tuning, research, and as a starting point for custom downstream applications. It was trained on a large multilingual corpus with improved data quality and training methodology compared to the Qwen 2.5 generation. The model runs efficiently on consumer GPUs with 8GB or more of VRAM and serves as the foundation for the Qwen3 8B instruction-tuned variant and community fine-tunes. It is a strong choice for practitioners building specialized models through further training. Released under the Apache 2.0 license.

Chat

Qwen3Guard Gen 8B

Alibaba · 8.2B · runs from 4.1 GB

70.1K 114

Qwen3Guard Gen 8B is a 8.2B-parameter open language model from Alibaba in the Qwen 3 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen1.5 14B Chat

Alibaba · 14.2B · runs from 8 GB

10.9K 112

Qwen1.5 14B Chat is a 14.2B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen2 72B Instruct

Alibaba · 72.7B · runs from 21.0 GB

20.5K 717

Qwen2 72B Instruct is a 72.7B-parameter open language model from Alibaba in the Qwen 2 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen1.5 7B Chat

Alibaba · 7.7B · runs from 4.7 GB

12.5K 186

Qwen1.5 7B Chat is a 7.7B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen1.5 32B Chat

Alibaba · 32.5B · runs from 14.3 GB

9.6K 109

Qwen1.5 32B Chat is a 32.5B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen 14B Chat

Alibaba · 14.2B · runs from 6.6 GB

1.7K 373

Qwen 14B Chat is a 14.2B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen1.5 7B

Alibaba · 7.7B · runs from 4.7 GB

133.4K 56

Qwen1.5 7B is a 7.7B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen 7B

Alibaba · 7.7B · runs from 3.6 GB

17.3K 399

Qwen 7B is a 7.7B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen1.5 72B Chat

Alibaba · 72.3B · runs from 35.5 GB

9.2K 217

Qwen1.5 72B Chat is a 72.3B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat