All LLM Models

Browse 51 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

Qwen2.5 Coder 7B Instruct GGUF

Alibaba · 7B

99.3K 206

Qwen2.5 Coder 7B Instruct is a code-focused model from Alibaba's Qwen team, provided in official GGUF format for straightforward local use. At 7 billion parameters it offers solid code generation, completion, and explanation capabilities while remaining runnable on a single consumer GPU with 8 GB or more of VRAM. This model is a practical choice for developers who want a local coding assistant without the hardware demands of larger models. It handles Python, JavaScript, TypeScript, SQL, and many other languages competently, and its instruction tuning makes it responsive to natural-language prompts describing the code you need.

ChatCode

Qwen2.5 Coder 32B Instruct GGUF

Alibaba · 32B

165.9K 189

Qwen2.5 Coder 32B Instruct is the flagship code-specialized model in Alibaba's Qwen2.5 lineup, released in official GGUF format. With 32 billion parameters trained heavily on programming data, it delivers strong performance across code generation, refactoring, debugging, and technical explanation, rivaling much larger proprietary coding assistants on many benchmarks. Running the 32B model locally requires a higher-end setup, typically 24 GB or more of VRAM at moderate quantization levels, but the payoff is a highly capable offline coding companion with no API costs or data-privacy concerns. Lower quantizations can bring it within reach of 16 GB cards with some quality trade-off.

ChatCode

Qwen3 Coder 30B A3B Instruct FP8

Alibaba · 30.5B

314.4K 168

Qwen3 Coder 30B A3B Instruct FP8 is a code-focused mixture-of-experts model from Alibaba with 30.5 billion total parameters and roughly 3 billion active per token, served in FP8 precision. The combination of MoE efficiency and FP8 quantization makes this a remarkably accessible coding assistant that punches well above its effective weight class. Designed for code generation, completion, review, and technical conversation, this model benefits from specialized coding training on top of the Qwen3 MoE architecture. Its low active parameter count means it can run on consumer GPUs with moderate VRAM, making it one of the most hardware-friendly dedicated coding models available.

ChatCode

Qwen3 8B GGUF

Alibaba · 8B

74.3K 161

Qwen3 8B GGUF is the official GGUF-format release of Alibaba's 8-billion-parameter Qwen3 model. The GGUF format is optimized for llama.cpp and compatible inference engines, making this one of the easiest Qwen3 models to get running locally with tools like Ollama, LM Studio, or Jan. At 8 billion parameters, this model offers a solid middle ground in the Qwen3 lineup, delivering capable chat and general-purpose performance while remaining runnable on most consumer GPUs with 6 GB or more of VRAM. The GGUF packaging supports flexible quantization levels, letting users choose their own quality-versus-memory tradeoff.

Chat

Qwen3 0.6B Base

Alibaba · 0.6B

222.8K 155

Qwen3 0.6B Base is the smallest pretrained foundation model in Alibaba Cloud's Qwen 3 family, with approximately 600 million parameters. As a base model, it is not tuned for chat or instructions and is intended for fine-tuning, research, and experimentation. Its minimal size makes it suitable for rapid prototyping and resource-constrained training experiments. The model runs on virtually any hardware, including CPU-only setups. It is useful for educational purposes, architecture exploration, and as a compact foundation for task-specific fine-tuning where model size is a primary constraint. Released under the Apache 2.0 license.

Chat

Qwen2.5 Coder 7B

Alibaba · 7.6B

205.3K 139

Qwen2.5 Coder 7B is a 7.6-billion parameter code-specialized base (pretrained) model from Alibaba Cloud's Qwen 2.5 Coder series. It is trained on a large dataset of source code and natural language but is not instruction-tuned, making it suitable for fine-tuning, code-related research, and custom downstream applications. The model supports a 128K token context window and runs efficiently on consumer GPUs. It serves as the foundation for the Qwen2.5 Coder 7B Instruct variant and community fine-tunes targeting specific programming languages or workflows. Released under the Apache 2.0 license.

ChatCode

Qwen2.5 7B Instruct GGUF

Alibaba · 7B

50.7K 131

Qwen2.5 7B Instruct is Alibaba's general-purpose 7-billion-parameter model in official GGUF format, instruction-tuned for conversational and task-oriented use. It represents the most popular size class in the Qwen2.5 family, offering a well-rounded mix of reasoning ability, factual knowledge, and multilingual support that works on widely available consumer hardware. With quantized GGUF variants, the 7B model fits comfortably on GPUs with 8 GB of VRAM and runs at interactive speeds. It is a versatile workhorse for local AI, capable of drafting content, answering questions, extracting information, and holding coherent multi-turn conversations.

Chat

Qwen3 32B AWQ

Alibaba · 32.8B

666.0K 130

Qwen3 32B AWQ is an AWQ-quantized version of Alibaba's 32.8-billion-parameter Qwen3 dense model. AWQ (Activation-aware Weight Quantization) reduces the model's memory footprint significantly while preserving most of the original quality, making this large model much more accessible on consumer GPUs with 16 to 24 GB of VRAM. For users who want the full dense 32B Qwen3 experience but lack the VRAM to run it at full precision, the AWQ variant is an excellent compromise. It retains strong general-purpose capabilities across chat, reasoning, and creative tasks while fitting into a fraction of the memory that the unquantized model would require.

Chat

Qwen3 Coder Next FP8

Alibaba · 79.7B

589.7K 113

Qwen3 Coder Next FP8 is Alibaba's 79.7-billion-parameter code-specialized model served in FP8 precision. As the next-generation coding model in the Qwen3 family, it is trained and tuned specifically for software engineering tasks including code generation, debugging, refactoring, and technical explanation. At nearly 80 billion parameters, this is a substantial model that benefits greatly from FP8 quantization to reduce memory requirements. Users with high-end consumer GPUs or multi-GPU setups will find it delivers strong code completion and generation quality that competes with much larger models, though it does require significant VRAM to run comfortably.

ChatCode

Qwen2 1.5B

Alibaba · 1.5B

108.4K 100

Qwen2 1.5B is a 1.5-billion parameter base (pretrained) model from Alibaba Cloud's older Qwen 2 generation. It was trained on a multilingual corpus and supports a context window of up to 32K tokens. As a base model, it is designed for fine-tuning and research rather than direct conversational use. While superseded by the Qwen 2.5 series in terms of training data quality and benchmark performance, Qwen2 1.5B remains a lightweight option for experimentation and as a baseline for comparison. Released under the Apache 2.0 license.

Chat

Qwen2.5 Coder 14B Instruct GGUF

Alibaba · 14B

51.3K 98

Qwen2.5 Coder 14B Instruct is a mid-range code-specialized model from Alibaba, released in official GGUF format. Its 14 billion parameters give it a meaningful quality advantage over the 7B coding variant, producing more accurate completions, better handling of complex logic, and stronger performance on multi-file refactoring tasks. The 14B size is well suited to users with a 12-to-16 GB GPU who want the best coding capability their hardware can support. Quantized GGUF options make it feasible on cards like the RTX 4070 or RTX 3090, delivering a strong local coding experience without resorting to cloud APIs.

ChatCode

Qwen1.5 0.5B Chat

Alibaba · 620M

92.5K 93

Qwen1.5 0.5B Chat is an early-generation small language model from Alibaba's Qwen series with just 620 million parameters. As one of the smallest models in the Qwen family, it was designed to demonstrate that useful conversational ability is possible even at sub-billion parameter scales. This model runs easily on virtually any hardware including CPUs, older GPUs, and even mobile devices. While its capabilities are limited compared to larger Qwen models, it remains a useful option for embedded applications, rapid prototyping, or situations where minimal resource consumption is the top priority.

Chat

Qwen3 8B Base

Alibaba · 8.2B

1.9M 90

Qwen3 8B Base is an 8.2-billion parameter pretrained foundation model from Alibaba Cloud's Qwen 3 series. As a base model, it is not instruction-tuned and is intended for fine-tuning, research, and as a starting point for custom downstream applications. It was trained on a large multilingual corpus with improved data quality and training methodology compared to the Qwen 2.5 generation. The model runs efficiently on consumer GPUs with 8GB or more of VRAM and serves as the foundation for the Qwen3 8B instruction-tuned variant and community fine-tunes. It is a strong choice for practitioners building specialized models through further training. Released under the Apache 2.0 license.

Chat

Qwen2.5 3B Instruct GGUF

Alibaba · 3B

333.6K 88

Qwen2.5 3B Instruct is Alibaba's official GGUF release of the 3-billion-parameter instruction-tuned model from the Qwen2.5 family. It delivers noticeably stronger reasoning and more coherent long-form output than its smaller siblings while still fitting comfortably in the VRAM of a mid-range consumer GPU or running on CPU with acceptable speed. For users who need a step up from ultra-light models without jumping to the resource demands of 7B+, the 3B variant occupies a sweet spot. It handles multi-turn conversation, basic code assistance, and structured data extraction well, and quantized GGUF formats let you tune the quality-versus-memory trade-off to match your hardware.

Chat

Qwen2.5 1.5B Instruct GGUF

Alibaba · 1.5B

340.6K 86

Qwen2.5 1.5B Instruct is a compact general-purpose language model from Alibaba's Qwen team, offered here in official GGUF format for easy local deployment. With 1.5 billion parameters, it strikes a practical balance between capability and resource efficiency, handling everyday tasks like summarization, Q&A, and light creative writing without demanding a powerful GPU. This model is an excellent entry point for users who want a responsive local assistant on modest hardware. It runs comfortably on most modern laptops and even some higher-end single-board computers, making it one of the most accessible instruction-tuned models in the Qwen2.5 lineup.

Chat

Qwen2.5 Coder 1.5B

Alibaba · 1.5B

584.8K 85

Qwen2.5 Coder 1.5B is a 1.5-billion parameter code-specialized model from Alibaba Cloud's Qwen 2.5 Coder series. It is the smallest Coder variant that balances meaningful code generation capability with extremely low resource requirements, running on GPUs with as little as 2-4GB of VRAM. The model is suitable for lightweight code completion, simple code generation tasks, and as a compact local coding assistant in resource-constrained environments. It supports a 128K token context window. Released under the Apache 2.0 license.

ChatCode

Qwen3 30B A3B FP8

Alibaba · 30B

87.6K 82

Qwen3 30B A3B FP8 is the FP8 precision version of Alibaba's 30-billion-parameter mixture-of-experts model with approximately 3 billion active parameters per token. FP8 provides a good balance between quantization efficiency and output quality, sitting between full precision and more aggressive INT4 or INT8 formats. This variant is aimed at users who want near-original model quality with meaningful memory savings. The MoE architecture already keeps compute demands low, and FP8 further reduces the VRAM footprint, making it a practical choice for consumer GPUs in the 8 to 12 GB VRAM range.

Chat

Qwen2.5 0.5B Instruct GGUF

Alibaba · 0.5B

67.5K 81

Qwen2.5 0.5B Instruct is the smallest instruction-tuned model in Alibaba's Qwen2.5 series, offered in official GGUF format. With just 500 million parameters it is designed for extremely resource-constrained environments, running on virtually any modern CPU without a dedicated GPU and consuming minimal RAM. Despite its tiny footprint, the 0.5B variant can handle simple question answering, short text generation, and basic classification tasks. It is ideal for experimentation, edge deployment, or as an always-on local model where speed and low resource usage matter more than peak output quality.

Chat

Qwen3 1.7B Base

Alibaba · 1.7B

336.3K 65

Qwen3 1.7B Base is a 1.7-billion parameter pretrained foundation model from Alibaba Cloud's Qwen 3 family. It is a compact base model designed for fine-tuning, research, and custom applications rather than direct conversational use. Its small size makes it accessible for resource-constrained fine-tuning and rapid experimentation. The model can run on virtually any modern GPU and benefits from the improved pretraining data of the Qwen 3 generation. It is suitable as a lightweight foundation for domain-specific fine-tunes and student models in distillation pipelines. Released under the Apache 2.0 license.

Chat

Qwen3 30B A3B GPTQ Int4

Alibaba · 30.5B

104.7K 49

Qwen3 30B A3B GPTQ Int4 is a GPTQ INT4 quantized version of Alibaba's 30.5-billion-parameter mixture-of-experts model. The aggressive INT4 quantization combined with the MoE architecture's low active parameter count makes this one of the most memory-efficient ways to run a 30B-class model locally. With only about 3 billion parameters active per token and weights compressed to 4-bit precision, this model can fit comfortably on consumer GPUs with as little as 4 to 6 GB of VRAM. It is an excellent option for users who want to maximize model capability on budget hardware, though some quality degradation compared to higher-precision formats is expected.

Chat

Qwen2.5 Coder 0.5B

Alibaba · 494M

63.5K 46

Qwen2.5 Coder 0.5B is a 494-million parameter code-specialized model from Alibaba Cloud, the smallest in the Qwen 2.5 Coder series. It is designed for ultra-lightweight deployment where code-aware text generation is needed with minimal hardware resources. The model runs on virtually any GPU and even on CPU-only setups. While limited in capability compared to larger coding models, it is useful for basic code completion, prototyping, and experimentation. It supports a 128K token context window. Released under the Apache 2.0 license.

ChatCode