All LLM Models

Browse 225 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

Nemotron 3 Nano 30B A3B GGUF

Unsloth · 30B

116.9K 277

This is a GGUF-quantized version of NVIDIA's Nemotron 3 Nano 30B A3B, repackaged by Unsloth. Nemotron 3 Nano is NVIDIA's efficient language model using a Mixture-of-Experts (MoE) architecture with 30 billion total parameters and approximately 3 billion active parameters per token, designed to deliver strong performance with minimal computational overhead. The sparse MoE design makes this model far more efficient than its total parameter count suggests, requiring VRAM closer to a dense 3B model while producing output quality that competes with much larger architectures. Unsloth's GGUF conversion enables compatibility with llama.cpp and popular local inference frontends, making it an appealing option for users who want high-quality local inference without the hardware demands of a full 30B dense model.

Chat

Yi 1.5 34B Chat

01.AI · 34.4B

12.2K 274

Yi 1.5 34B Chat is a 34.4-billion parameter instruction-tuned model by 01.AI, the Chinese AI lab founded by Kai-Fu Lee. It is a bilingual model with strong performance in both English and Chinese, making it particularly well suited for users who need high-quality generation in either language. Yi 1.5 represents an improved iteration of the Yi model family with enhanced reasoning and coding ability. The 34B size requires a GPU with at least 24GB of VRAM for quantized inference, placing it within reach of high-end consumer cards like the RTX 4090. Released under the Yi License.

Chat

SmolLM 135M

Hugging Face · 135M

175.5K 253

SmolLM 135M is the original first-generation small language model from Hugging Face, designed to push the boundaries of what is achievable at extremely low parameter counts. With just 135 million parameters, it was a pioneering effort in making capable language models accessible on the most resource-constrained hardware. While the SmolLM2 and SmolLM3 families have since surpassed it in quality, the original SmolLM 135M remains a useful reference point for research and a practical option for ultra-lightweight deployment scenarios where every megabyte of memory counts.

Chat

Opt 125M

Meta · 125M

6.8M 234

Meta OPT 125M is a 125-million parameter language model from Meta's Open Pre-trained Transformer (OPT) project. Released in 2022, it was part of Meta's effort to provide the research community with openly available large language models that replicate the performance of GPT-3 class models at various scales. As one of the smallest models in the OPT family, the 125M variant is primarily useful for research, experimentation, and educational purposes. It can run on virtually any hardware, including CPU-only setups. While significantly less capable than modern models, it remains a useful reference point in LLM research.

Chat

GPT OSS 120B GGUF

Unsloth · 120B

83.7K 229

This is a GGUF-quantized version of OpenAI's GPT-OSS 120B, repackaged by Unsloth. GPT-OSS 120B is the larger variant of OpenAI's open-source model family, packing 120 billion parameters for significantly enhanced reasoning, knowledge, and generation capabilities compared to its smaller sibling. Unsloth's GGUF conversion enables this large model to run with llama.cpp and compatible tools. Even with aggressive quantization, a 120B-parameter model demands significant hardware, typically requiring multi-GPU setups or systems with very high VRAM capacity. For users with the hardware to support it, this model represents one of the most capable open-weight options available for local deployment.

Chat

Llama 3 3 Nemotron Super 49B V1 5

NVIDIA · 49.9B

56.5K 227

Llama 3.3 Nemotron Super 49B is a 49.9-billion parameter chat model by NVIDIA, built on a modified Llama 3.3 architecture. It occupies a unique size point between the common 70B and 8B tiers, offering strong reasoning and conversational ability while requiring less VRAM than full 70B models. NVIDIA's Nemotron Super training pipeline applies extensive alignment tuning to optimize helpfulness and factual accuracy. The model typically needs 32GB or more of VRAM for local inference at reduced precision, placing it within reach of high-end consumer GPUs like the RTX 4090 or professional workstation cards.

Chat

Qwen1.5 MoE A2.7B

Alibaba · 14.3B

153.9K 221

Qwen1.5 MoE A2.7B is a Mixture of Experts (MoE) model from Alibaba Cloud's Qwen 1.5 generation, with 14.3 billion total parameters but only 2.7 billion active parameters per forward pass. The MoE architecture allows it to deliver performance closer to dense 7B models while requiring less compute during inference, as only a subset of expert layers are activated for each token. The model supports a 32K token context window and requires VRAM proportional to its total parameter count for loading, despite lower compute cost per token. It is an interesting architectural variant for users exploring efficient inference and MoE models locally. Released under a custom Qwen license.

Chat

Llama 3.1 Nemotron Nano 8B V1

NVIDIA · 8B

308.6K 219

Llama 3.1 Nemotron Nano 8B is an 8-billion parameter chat model by NVIDIA, a compact entry in the Nemotron family derived from Meta's Llama 3.1 architecture. It applies NVIDIA's alignment and fine-tuning techniques to deliver improved response quality over the base Llama 3.1 8B Instruct model at the same parameter count. The model runs on consumer GPUs with 8GB or more of VRAM and supports a 128K token context window. Its small footprint and NVIDIA-tuned quality make it a practical option for local inference on mainstream hardware.

Chat

TinyLlama 1.1B Chat v1.0 GGUF

TheBloke · 1.1B

131.6K 213
Chat

Qwen3.5 27B Claude 4.6 Opus Reasoning Distilled GGUF

Jackrong · 27B

144.6K 209

A GGUF-quantized version of Jackrong's Qwen3.5 27B model, distilled from Claude 4.6 Opus with a focus on reasoning capabilities. This 27-billion-parameter model aims to capture the structured thinking and chain-of-thought abilities of a much larger frontier model in a size that can run on high-end consumer hardware. Available in multiple quantization levels, it offers a practical way to get strong reasoning performance locally without needing datacenter GPUs. As a distilled model, expect solid performance on logic puzzles, math, and multi-step problem solving, though it will not fully match its teacher model. The GGUF format makes it easy to run with llama.cpp, Ollama, or LM Studio. Best suited for users who prioritize analytical and reasoning tasks over raw creative generation.

ChatReasoning

Qwen2.5 Coder 7B Instruct GGUF

Alibaba · 7B

99.3K 206

Qwen2.5 Coder 7B Instruct is a code-focused model from Alibaba's Qwen team, provided in official GGUF format for straightforward local use. At 7 billion parameters it offers solid code generation, completion, and explanation capabilities while remaining runnable on a single consumer GPU with 8 GB or more of VRAM. This model is a practical choice for developers who want a local coding assistant without the hardware demands of larger models. It handles Python, JavaScript, TypeScript, SQL, and many other languages competently, and its instruction tuning makes it responsive to natural-language prompts describing the code you need.

ChatCode

Qwen3 4B GGUF

Unsloth · 4B

80.2K 201

This is a GGUF-quantized version of Alibaba's Qwen3 4B, repackaged by Unsloth. Qwen3 4B is a compact yet capable model from the latest generation of the Qwen series, offering strong multilingual performance and solid reasoning abilities in a small footprint. At 4 billion parameters in GGUF format, this model is lightweight enough to run comfortably on most consumer hardware, including laptops and systems with modest GPUs. Unsloth's conversion ensures compatibility with llama.cpp and its ecosystem of tools, making it an accessible option for users who want a responsive local model for everyday tasks without heavy resource demands.

Chat

Gpt2 Medium

OpenAI · 380M

716.0K 196

GPT-2 Medium scales the original GPT-2 architecture to 380 million parameters, offering noticeably improved text generation quality over the base 137M variant while remaining extremely lightweight by current standards. It supports the same autoregressive language modeling tasks as its smaller and larger siblings. Like all GPT-2 variants, it runs comfortably on virtually any modern hardware including CPU-only setups, making it an accessible option for learning, prototyping, and lightweight text generation experiments without needing a dedicated GPU.

Chat

Gemma 3 27B IT GGUF

Unsloth · 27B

129.2K 193
Vision

Llama 3.2 3B Instruct GGUF

Bartowski · 3B

407.1K 193

This is a GGUF-quantized version of Meta's Llama 3.2 3B Instruct, repackaged by Bartowski. Llama 3.2 3B Instruct is part of Meta's lightweight Llama 3.2 release, optimized for on-device deployment and edge inference while delivering surprisingly capable instruction following and text generation. At 3 billion parameters, this model hits a sweet spot between size and capability. The GGUF format provided by Bartowski enables compatibility with llama.cpp-based tools, making it easy to run locally. It's an excellent choice for users who want a responsive, low-resource model for everyday tasks like summarization, Q&A, and general chat.

Chat

Qwen2.5 Coder 32B Instruct GGUF

Alibaba · 32B

165.9K 189

Qwen2.5 Coder 32B Instruct is the flagship code-specialized model in Alibaba's Qwen2.5 lineup, released in official GGUF format. With 32 billion parameters trained heavily on programming data, it delivers strong performance across code generation, refactoring, debugging, and technical explanation, rivaling much larger proprietary coding assistants on many benchmarks. Running the 32B model locally requires a higher-end setup, typically 24 GB or more of VRAM at moderate quantization levels, but the payoff is a highly capable offline coding companion with no API costs or data-privacy concerns. Lower quantizations can bring it within reach of 16 GB cards with some quality trade-off.

ChatCode

SmolLM2 360M Instruct

Hugging Face · 360M

189.5K 181

SmolLM2 360M Instruct is an instruction-tuned model from Hugging Face that occupies the sweet spot between the 135M and 1.7B entries in the SmolLM2 lineup. At 360 million parameters, it offers noticeably better coherence and instruction-following ability than the smallest variants while still running comfortably on virtually any modern GPU or even on CPU. This model is well suited for on-device assistants, embedded applications, and rapid prototyping where you need real conversational ability without dedicating significant hardware resources. It handles short-form generation, summarization, and basic reasoning tasks with reasonable quality.

Chat

SmolLM 1.7B

Hugging Face · 1.7B

64.3K 181

SmolLM 1.7B is the largest model in Hugging Face's first-generation SmolLM family. At 1.7 billion parameters, it delivers solid general-purpose text generation in a compact package that runs easily on entry-level hardware, though it has been superseded by the improved SmolLM2 and SmolLM3 series. This model remains a reasonable choice for applications where proven stability matters more than cutting-edge performance. For most new projects, however, users should consider the SmolLM2 1.7B or SmolLM3 3B models, which offer better quality at comparable or only slightly higher resource requirements.

Chat

VulnLLM R 7B

UCSB-SURFI · 7.6B

59.7K 179

VulnLLM R 7B is a security-focused model developed by UCSB-SURFI, built on the Qwen2.5-7B base and fine-tuned specifically for vulnerability analysis and security reasoning. With 7.6 billion parameters, it targets tasks like identifying code vulnerabilities, explaining security flaws, and reasoning about attack vectors. This model fills a niche for security researchers and developers who want a locally-hosted assistant for code auditing and vulnerability assessment without sending sensitive code to external APIs. Its specialized training gives it an edge over general-purpose models on security-related tasks, though it is not a replacement for professional security tools. Runs on consumer GPUs with 8 GB of VRAM at typical quantization levels.

ChatReasoning

Hermes 3 Llama 3.2 3B

Nous Research · 3B

77.3K 175

Hermes 3 Llama 3.2 3B is a 3-billion parameter instruction-tuned model by Nous Research, fine-tuned from Meta's Llama 3.2 3B base. It applies the Hermes training methodology to a compact model, targeting strong instruction following and conversational quality at minimal hardware cost. Despite its small size, this model benefits from the Hermes fine-tuning approach that emphasizes system prompt adherence and structured output. It can run on GPUs with as little as 4GB of VRAM when quantized, making it suitable for lightweight local deployments and resource-constrained environments.

ChatRoleplay

SmolLM2 135M

Hugging Face · 135M

817.7K 171

SmolLM2 135M is one of the smallest capable language models available, developed by Hugging Face as part of their SmolLM2 family. With just 135 million parameters, it requires virtually no VRAM and can run on almost any hardware, making it an excellent starting point for researchers experimenting with language model behavior, fine-tuning workflows, or edge deployment scenarios. Despite its tiny footprint, SmolLM2 135M benefits from improved training data and techniques compared to its first-generation predecessor. It is best suited for lightweight text generation tasks, prototyping, and educational purposes rather than production-grade applications.

Chat

Qwen3 Coder 30B A3B Instruct FP8

Alibaba · 30.5B

314.4K 168

Qwen3 Coder 30B A3B Instruct FP8 is a code-focused mixture-of-experts model from Alibaba with 30.5 billion total parameters and roughly 3 billion active per token, served in FP8 precision. The combination of MoE efficiency and FP8 quantization makes this a remarkably accessible coding assistant that punches well above its effective weight class. Designed for code generation, completion, review, and technical conversation, this model benefits from specialized coding training on top of the Qwen3 MoE architecture. Its low active parameter count means it can run on consumer GPUs with moderate VRAM, making it one of the most hardware-friendly dedicated coding models available.

ChatCode

DeepSeek v2 Lite

DeepSeek · 15.7B

226.3K 168

DeepSeek V2 Lite is a compact mixture-of-experts model with 15.7 billion total parameters, designed to deliver a strong quality-to-compute ratio for general chat and instruction following. It uses the same innovative MLA (Multi-Head Latent Attention) architecture as the larger V2, which reduces memory requirements during inference. With its modest parameter count, V2 Lite runs comfortably on a single consumer GPU, making it accessible to users who want to try DeepSeek's MoE approach without needing specialized hardware. It handles everyday conversational tasks, summarization, and light analysis well, offering a practical entry point into the DeepSeek model family.

Chat

Qwen3 8B GGUF

Alibaba · 8B

74.3K 161

Qwen3 8B GGUF is the official GGUF-format release of Alibaba's 8-billion-parameter Qwen3 model. The GGUF format is optimized for llama.cpp and compatible inference engines, making this one of the easiest Qwen3 models to get running locally with tools like Ollama, LM Studio, or Jan. At 8 billion parameters, this model offers a solid middle ground in the Qwen3 lineup, delivering capable chat and general-purpose performance while remaining runnable on most consumer GPUs with 6 GB or more of VRAM. The GGUF packaging supports flexible quantization levels, letting users choose their own quality-versus-memory tradeoff.

Chat

Deepseek Coder 1.3B Instruct

DeepSeek · 1.3B

89.7K 159

DeepSeek Coder 1.3B Instruct is an ultra-compact code model designed for environments where hardware resources are extremely limited. Despite having just 1.3 billion parameters, it can handle basic code completion, simple generation tasks, and code Q&A across common programming languages. This is one of the smallest viable code models available, capable of running on integrated graphics or very low-end dedicated GPUs. It is well suited for edge deployment, embedded development environments, or as a fast local autocomplete engine where response speed matters more than handling complex multi-file reasoning tasks.

ChatCode

Gemma 3 12B IT GGUF

Unsloth · 12B

105.1K 157
Vision

Llama 3.2 1B Instruct GGUF

Bartowski · 1B

100.1K 156

This is a GGUF-quantized version of Meta's Llama 3.2 1B Instruct, repackaged by Bartowski. At just 1 billion parameters, Llama 3.2 1B Instruct is Meta's smallest instruction-tuned model, purpose-built for ultra-lightweight deployment on edge devices and resource-constrained hardware. The GGUF format from Bartowski makes this tiny model compatible with llama.cpp and its ecosystem. While it won't match larger models on complex reasoning, it excels at simple tasks like text classification, basic Q&A, and short-form generation. Its minimal resource requirements mean it can run on almost anything, making it ideal for experimentation or always-on local assistants on low-power hardware.

Chat

Qwen3 0.6B Base

Alibaba · 0.6B

222.8K 155

Qwen3 0.6B Base is the smallest pretrained foundation model in Alibaba Cloud's Qwen 3 family, with approximately 600 million parameters. As a base model, it is not tuned for chat or instructions and is intended for fine-tuning, research, and experimentation. Its minimal size makes it suitable for rapid prototyping and resource-constrained training experiments. The model runs on virtually any hardware, including CPU-only setups. It is useful for educational purposes, architecture exploration, and as a compact foundation for task-specific fine-tuning where model size is a primary constraint. Released under the Apache 2.0 license.

Chat

Gemma 3 270M IT GGUF

Unsloth · 270M

71.7K 154

A GGUF-quantized version of Google's Gemma 3 270M Instruct-Tuned, repackaged by Unsloth. With just 270 million parameters, this is one of the smallest instruction-tuned models available, making it an excellent choice for experimentation, testing inference pipelines, or running on extremely resource-constrained hardware. Don't expect strong reasoning or complex generation from a model this size, but it can handle simple completions and basic instruction following with remarkably low memory requirements.

Chat

Dolphin3.0 Llama3.1 8B GGUF

dphn · 8B

66.8K 153
Chat