All LLM Models

Browse 225 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

SmolLM3 3B Base

Hugging Face · 3B

89.8K 150

SmolLM3 3B Base is the pretrained foundation model from Hugging Face's third-generation SmolLM family. Without instruction tuning or chat alignment, it serves as a versatile starting point for researchers and developers who want to fine-tune the model for specific domains, tasks, or behavioral profiles. With 3 billion parameters and the architectural improvements introduced in SmolLM3, this base model offers strong general language capabilities in a package that remains practical to train and adapt on consumer-grade hardware. It is an excellent choice for custom fine-tuning projects where off-the-shelf chat behavior is not needed.

Chat

Opt 350M

Meta · 350M

156.3K 149

Meta OPT 350M is a 350-million parameter language model from Meta's Open Pre-trained Transformer (OPT) project, released in 2022 as part of a suite of models ranging from 125M to 175B parameters. It was designed to provide researchers with open access to models comparable to GPT-3 at various scales. The 350M variant runs on minimal hardware and is suitable for research, prototyping, and educational use. While it has been surpassed by modern architectures in terms of capability, it remains a lightweight option for basic text generation experiments and as a benchmark baseline.

Chat

SmolLM2 1.7B

Hugging Face · 1.7B

97.2K 145

SmolLM2 1.7B is the base pretrained model from Hugging Face's second-generation SmolLM family. Unlike the instruct variant, this model has not been fine-tuned for chat or instruction following, making it a strong foundation for custom fine-tuning, domain adaptation, or research into small-scale language model behavior. At 1.7 billion parameters, it provides meaningful language understanding and generation capabilities while remaining lightweight enough to train and experiment with on consumer hardware. Researchers and developers who want full control over downstream behavior will find this base model more flexible than the instruction-tuned version.

Chat

GLM 5 FP8

zai-org · 753.9B

4.3M 143

GLM 5 FP8 is the FP8 quantized release of Zhipu AI's 754 billion parameter flagship model, reducing memory requirements by storing weights in 8-bit floating point precision. This quantization roughly halves the VRAM needed compared to the full-precision version while preserving most of the model's capability across reasoning, coding, and multilingual tasks. It remains a demanding model to run locally, but FP8 quantization meaningfully lowers the hardware barrier for users with high-end multi-GPU setups.

Chat

LFM2.5 1.2B Instruct GGUF

LiquidAI · 1.2B

59.3K 141

This is the GGUF-quantized release of Liquid AI's LFM2.5 1.2B Instruct, packaged for easy local inference with llama.cpp and compatible tools. At 1.2 billion parameters, the quantized versions are tiny enough to run on almost anything, from a Raspberry Pi to a basic laptop. GGUF quantization at various bit levels lets users choose their preferred tradeoff between quality and size, making this one of the most hardware-friendly models available for quick local experimentation.

Chat

Qwen2.5 Coder 7B

Alibaba · 7.6B

205.3K 139

Qwen2.5 Coder 7B is a 7.6-billion parameter code-specialized base (pretrained) model from Alibaba Cloud's Qwen 2.5 Coder series. It is trained on a large dataset of source code and natural language but is not instruction-tuned, making it suitable for fine-tuning, code-related research, and custom downstream applications. The model supports a 128K token context window and runs efficiently on consumer GPUs. It serves as the foundation for the Qwen2.5 Coder 7B Instruct variant and community fine-tunes targeting specific programming languages or workflows. Released under the Apache 2.0 license.

ChatCode

NVIDIA Nemotron 3 Super 120B A12B NVFP4

NVIDIA · 67.2B

167.7K 135

NVIDIA Nemotron 3 Super 120B A12B NVFP4 is a large-scale mixture-of-experts model compressed to roughly 67.2 billion parameters of effective memory usage through NVIDIA's NVFP4 quantization. With 12 billion parameters active per token from a 120 billion parameter pool, it delivers flagship-tier intelligence in a more accessible package. This is where the MoE architecture and aggressive quantization really shine together. A model that would normally require data center hardware becomes feasible on high-end consumer GPUs or multi-GPU setups. The NVFP4 format is purpose-built for NVIDIA silicon, keeping quality surprisingly close to the full-precision version.

Chat

DeepSeek R1 Distill Qwen 1.5B GGUF

Unsloth · 1.5B

55.2K 133

A GGUF-quantized version of DeepSeek R1 Distill Qwen 1.5B, repackaged by Unsloth. This model distills the reasoning capabilities of the much larger DeepSeek R1 into a compact 1.5 billion parameter architecture based on Qwen. It is designed specifically for chain-of-thought reasoning tasks, offering surprisingly capable step-by-step problem solving for its size. Despite its small footprint, the R1 distillation process preserves a meaningful share of the original model's logical reasoning ability. It runs easily on low-end hardware and is well suited for users who want to explore reasoning-focused models without dedicating significant GPU memory.

ChatReasoning

NVIDIA Nemotron 3 Super 120B A12B FP8

NVIDIA · 123.6B

64.6K 132

NVIDIA Nemotron 3 Super 120B A12B FP8 is the FP8 variant of NVIDIA's largest Nemotron 3 mixture-of-experts model, weighing in at 123.6 billion parameters. With 12 billion parameters active per token, it delivers exceptional reasoning and conversational depth while the FP8 format keeps memory usage lower than full precision. This model sits at the high end of what's achievable for local inference. You'll need serious GPU memory to run it, but the payoff is near-frontier model quality running entirely on your own hardware. The FP8 quantization offers a meaningful memory reduction over BF16 with minimal quality trade-off.

Chat

Qwen2.5 7B Instruct GGUF

Alibaba · 7B

50.7K 131

Qwen2.5 7B Instruct is Alibaba's general-purpose 7-billion-parameter model in official GGUF format, instruction-tuned for conversational and task-oriented use. It represents the most popular size class in the Qwen2.5 family, offering a well-rounded mix of reasoning ability, factual knowledge, and multilingual support that works on widely available consumer hardware. With quantized GGUF variants, the 7B model fits comfortably on GPUs with 8 GB of VRAM and runs at interactive speeds. It is a versatile workhorse for local AI, capable of drafting content, answering questions, extracting information, and holding coherent multi-turn conversations.

Chat

Mistral 7B Instruct v0.3 GGUF

MaziyarPanahi · 7B

122.6K 131
Chat

Qwen3 32B AWQ

Alibaba · 32.8B

666.0K 130

Qwen3 32B AWQ is an AWQ-quantized version of Alibaba's 32.8-billion-parameter Qwen3 dense model. AWQ (Activation-aware Weight Quantization) reduces the model's memory footprint significantly while preserving most of the original quality, making this large model much more accessible on consumer GPUs with 16 to 24 GB of VRAM. For users who want the full dense 32B Qwen3 experience but lack the VRAM to run it at full precision, the AWQ variant is an excellent compromise. It retains strong general-purpose capabilities across chat, reasoning, and creative tasks while fitting into a fraction of the memory that the unquantized model would require.

Chat

NVIDIA Nemotron Nano 9B v2 Japanese

NVIDIA · 8.9B

281.4K 124

NVIDIA Nemotron Nano 9B v2 Japanese is a specialized variant of the Nemotron Nano 9B v2, fine-tuned for Japanese language understanding and generation. At 8.9 billion parameters, it maintains the same hardware-friendly footprint as the English version while delivering natural Japanese conversational ability. For users looking to run a Japanese-language assistant locally, this model offers a rare combination of compact size and dedicated language optimization from a major hardware vendor. It handles Japanese text with the fluency you'd expect from a purpose-built model rather than a multilingual afterthought.

Chat

Olmo 3 7B Instruct

Allen AI · 528384

105.4K 122

OLMo 3 7B Instruct is an instruction-tuned language model from the Allen Institute for AI, built as part of their Open Language Model initiative. Like all OLMo releases, it comes with fully open training data, code, and intermediate checkpoints, setting a high standard for reproducibility and scientific transparency in the LLM space. At roughly 7 billion parameters, this model delivers competitive performance on instruction following, reasoning, and general knowledge tasks while remaining runnable on consumer GPUs with 8 GB or more of VRAM. It is an excellent choice for users who value open science and want a capable, well-documented model for local chat and assistant applications.

Chat

Qwen3 0.6B GGUF

Unsloth · 0.6B

108.9K 115

This is a GGUF-quantized version of Alibaba's Qwen3 0.6B, repackaged by Unsloth. Qwen3 0.6B is an ultra-compact model from the Qwen3 family, designed for deployment on heavily resource-constrained devices where even small models struggle to fit. With just 0.6 billion parameters, this is among the smallest available models in the Qwen3 lineup. The GGUF format from Unsloth makes it compatible with llama.cpp and related tools. While its capabilities are naturally limited by its tiny size, it can handle basic text tasks and is ideal for experimentation, embedded applications, or running on devices with minimal memory and compute.

Chat

NVIDIA Nemotron 3 Nano 30B A3B NVFP4

NVIDIA · 18.2B

430.8K 113

NVIDIA Nemotron 3 Nano 30B A3B NVFP4 is the most aggressively quantized version of the Nemotron 3 Nano 30B, using NVIDIA's proprietary NVFP4 format to bring the effective size down to around 18.2 billion parameters worth of memory. This makes it accessible on GPUs that couldn't touch the BF16 or FP8 variants. NVFP4 is NVIDIA's custom 4-bit floating point quantization, optimized for their GPU architectures to minimize quality loss at extreme compression. If you're running a mid-range NVIDIA card and want MoE-level intelligence, this is the variant to try.

Chat

Qwen3 Coder Next FP8

Alibaba · 79.7B

589.7K 113

Qwen3 Coder Next FP8 is Alibaba's 79.7-billion-parameter code-specialized model served in FP8 precision. As the next-generation coding model in the Qwen3 family, it is trained and tuned specifically for software engineering tasks including code generation, debugging, refactoring, and technical explanation. At nearly 80 billion parameters, this is a substantial model that benefits greatly from FP8 quantization to reduce memory requirements. Users with high-end consumer GPUs or multi-GPU setups will find it delivers strong code completion and generation quality that competes with much larger models, though it does require significant VRAM to run comfortably.

ChatCode

NVIDIA Nemotron 3 Nano 30B A3B Base BF16

NVIDIA · 31.6B

68.0K 113

NVIDIA Nemotron 3 Nano 30B A3B Base BF16 is the foundation model version of the Nemotron 3 Nano 30B, offered in full BF16 precision. Unlike the chat-tuned variants, this base model hasn't been instruction-tuned, making it suitable for fine-tuning, research, or custom alignment workflows. At 31.6 billion total parameters with a mixture-of-experts architecture, the base model gives developers and researchers a strong starting point for building specialized applications. It retains all the architectural benefits of the MoE design while leaving the behavioral layer open for customization.

Chat

Moonlight 16B A3B

Moonshot AI · 16.0B

72.7K 109

Moonlight 16B A3B is a compact Mixture-of-Experts model from Moonshot AI that packs 16 billion total parameters while activating only around 3 billion per token. This efficient sparse design lets it punch well above its active parameter count, delivering surprisingly strong chat performance for its effective inference cost. The small active parameter count means Moonlight runs briskly on modest hardware, fitting comfortably on GPUs with 8–12 GB of VRAM at common quantization levels. It is an appealing choice for users who want MoE-level performance diversity without the heavy memory footprint typically associated with mixture models.

Chat

Meta Llama 3 8B Instruct GGUF

MaziyarPanahi · 8B

118.1K 101
Chat

Prometheus 7B V2.0

prometheus-eval · 7.2B

75.4K 101

Prometheus 7B V2.0 is a specialized judge model trained by prometheus-eval to evaluate the quality of outputs from other language models. At 7.2 billion parameters, it is designed to score and critique LLM responses against custom rubrics, making it a valuable tool for automated evaluation pipelines and benchmarking. Unlike general-purpose chat models, Prometheus is purpose-built for assessment tasks. It can provide structured feedback on dimensions like helpfulness, accuracy, and coherence. Useful for researchers, developers building LLM applications, and anyone who needs consistent automated evaluation without relying on paid API calls to frontier models. Runs comfortably on most modern GPUs with 8 GB or more of VRAM.

Chat

Qwen2 1.5B

Alibaba · 1.5B

108.4K 100

Qwen2 1.5B is a 1.5-billion parameter base (pretrained) model from Alibaba Cloud's older Qwen 2 generation. It was trained on a multilingual corpus and supports a context window of up to 32K tokens. As a base model, it is designed for fine-tuning and research rather than direct conversational use. While superseded by the Qwen 2.5 series in terms of training data quality and benchmark performance, Qwen2 1.5B remains a lightweight option for experimentation and as a baseline for comparison. Released under the Apache 2.0 license.

Chat

Qwen2.5 Coder 14B Instruct GGUF

Alibaba · 14B

51.3K 98

Qwen2.5 Coder 14B Instruct is a mid-range code-specialized model from Alibaba, released in official GGUF format. Its 14 billion parameters give it a meaningful quality advantage over the 7B coding variant, producing more accurate completions, better handling of complex logic, and stronger performance on multi-file refactoring tasks. The 14B size is well suited to users with a 12-to-16 GB GPU who want the best coding capability their hardware can support. Quantized GGUF options make it feasible on cards like the RTX 4070 or RTX 3090, delivering a strong local coding experience without resorting to cloud APIs.

ChatCode

Meta Llama 3.1 8B Instruct

Unsloth · 8.0B

415.1K 94

This is an Unsloth repack of Meta's Llama 3.1 8B Instruct, optimized for efficient fine-tuning and inference. Llama 3.1 8B Instruct is one of the most widely used open-weight instruction-tuned models, delivering strong performance across general conversation, reasoning, and multilingual tasks. Unsloth's version provides the full-precision model weights in an optimized layout designed for their training and inference framework. At 8 billion parameters, this model offers a strong balance of capability and efficiency, suitable for users who want to fine-tune or run the model locally without additional quantization.

Chat

Qwen1.5 0.5B Chat

Alibaba · 620M

92.5K 93

Qwen1.5 0.5B Chat is an early-generation small language model from Alibaba's Qwen series with just 620 million parameters. As one of the smallest models in the Qwen family, it was designed to demonstrate that useful conversational ability is possible even at sub-billion parameter scales. This model runs easily on virtually any hardware including CPUs, older GPUs, and even mobile devices. While its capabilities are limited compared to larger Qwen models, it remains a useful option for embedded applications, rapid prototyping, or situations where minimal resource consumption is the top priority.

Chat

Qwen3 8B Base

Alibaba · 8.2B

1.9M 90

Qwen3 8B Base is an 8.2-billion parameter pretrained foundation model from Alibaba Cloud's Qwen 3 series. As a base model, it is not instruction-tuned and is intended for fine-tuning, research, and as a starting point for custom downstream applications. It was trained on a large multilingual corpus with improved data quality and training methodology compared to the Qwen 2.5 generation. The model runs efficiently on consumer GPUs with 8GB or more of VRAM and serves as the foundation for the Qwen3 8B instruction-tuned variant and community fine-tunes. It is a strong choice for practitioners building specialized models through further training. Released under the Apache 2.0 license.

Chat

Qwen3.5 9B Claude 4.6 Opus Reasoning Distilled GGUF

Jackrong · 9B

57.3K 88

A compact 9-billion-parameter GGUF model distilled from Claude 4.6 Opus reasoning using the Qwen3.5 9B base. This is the smallest model in Jackrong's reasoning distillation series and is designed to run comfortably on consumer GPUs with as little as 6 to 8 GB of VRAM at aggressive quantization levels. While the smaller size necessarily limits the depth of reasoning compared to its 27B sibling, this model punches above its weight class on structured thinking tasks thanks to the distillation approach. A practical choice for users with modest hardware who want improved reasoning over a standard 9B model without investing in a bigger GPU.

ChatReasoning

Meta Llama 3.1 8B Instruct AWQ INT4

Hugging Face · 8B

169.6K 88

This is an AWQ INT4-quantized version of Meta's Llama 3.1 8B Instruct, produced by Hugging Face's Hugging Quants project. Llama 3.1 8B Instruct is a widely adopted open-weight model known for its strong instruction-following, reasoning, and multilingual abilities. The AWQ INT4 quantization from Hugging Face aggressively compresses the model to 4-bit integer precision using activation-aware weight quantization, significantly reducing VRAM requirements while retaining most of the original model's quality. This format is optimized for GPU inference with frameworks like vLLM, AutoAWQ, and Transformers, making it a practical choice for users who want fast, memory-efficient local inference on consumer GPUs.

Chat

Qwen2.5 3B Instruct GGUF

Alibaba · 3B

333.6K 88

Qwen2.5 3B Instruct is Alibaba's official GGUF release of the 3-billion-parameter instruction-tuned model from the Qwen2.5 family. It delivers noticeably stronger reasoning and more coherent long-form output than its smaller siblings while still fitting comfortably in the VRAM of a mid-range consumer GPU or running on CPU with acceptable speed. For users who need a step up from ultra-light models without jumping to the resource demands of 7B+, the 3B variant occupies a sweet spot. It handles multi-turn conversation, basic code assistance, and structured data extraction well, and quantized GGUF formats let you tune the quality-versus-memory trade-off to match your hardware.

Chat

Qwen2.5 1.5B Instruct GGUF

Alibaba · 1.5B

340.6K 86

Qwen2.5 1.5B Instruct is a compact general-purpose language model from Alibaba's Qwen team, offered here in official GGUF format for easy local deployment. With 1.5 billion parameters, it strikes a practical balance between capability and resource efficiency, handling everyday tasks like summarization, Q&A, and light creative writing without demanding a powerful GPU. This model is an excellent entry point for users who want a responsive local assistant on modest hardware. It runs comfortably on most modern laptops and even some higher-end single-board computers, making it one of the most accessible instruction-tuned models in the Qwen2.5 lineup.

Chat