All LLM Models

Browse 225 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

Llama 3.2 8X3B MOE Dark Champion Instruct Uncensored Abliterated 18.4B GGUF

DavidAU · 3B

75.5K 511

A creative frankenmerge Mixture-of-Experts model built by DavidAU from eight copies of Llama 3.2 3B, totaling 18.4 billion parameters with only a fraction active per token. This abliterated, uncensored MoE is specifically designed for roleplay, creative writing, and storytelling without content restrictions. As a custom MoE architecture, this model offers an unusual tradeoff: it provides the diversity and capacity of a much larger model while keeping per-token compute closer to a 3B model. The GGUF format makes it straightforward to run locally. Best suited for creative and narrative use cases rather than factual or analytical tasks. Expect imaginative and unrestricted outputs with the distinctive character of community-built experimental merges.

ChatRoleplay

Qwen2.5 0.5B Instruct

Alibaba · 494M

6.1M 483

Qwen2.5 0.5B Instruct is the smallest instruction-tuned model in Alibaba Cloud's Qwen 2.5 family, with just 494 million parameters. It is designed for ultra-lightweight deployment scenarios where minimal hardware resources are available, running comfortably on virtually any modern GPU or even CPU-only configurations. Despite its tiny footprint, the model supports a 128K token context window and can handle basic chat, simple summarization, and lightweight instruction following. It is primarily useful for edge deployment, experimentation, and prototyping where model size is a critical constraint. Released under the Apache 2.0 license.

Chat

NVIDIA Nemotron Nano 9B v2

NVIDIA · 8.9B

308.0K 482

NVIDIA Nemotron Nano 9B v2 is a compact yet capable chat model from NVIDIA, packing 8.9 billion parameters into a size that runs comfortably on a wide range of consumer GPUs. Built on NVIDIA's Nemotron architecture, it delivers strong instruction-following and conversational performance while keeping VRAM requirements modest. This second-generation Nano model reflects NVIDIA's push to make high-quality language models accessible on local hardware. It's an excellent starting point for users who want a responsive, general-purpose assistant without needing top-tier GPU memory.

Chat

Deepseek Coder 6.7B Instruct

DeepSeek · 6.7B

127.0K 481

DeepSeek Coder 6.7B Instruct is a first-generation code-specialized model trained on a large corpus of source code and programming-related data. At 6.7 billion parameters, it provides solid code completion, generation, and explanation capabilities across popular programming languages while remaining small enough to run on most consumer GPUs. While newer models in the DeepSeek lineup have surpassed it in raw capability, this model remains a practical choice for users who need a lightweight local coding assistant with minimal hardware requirements. It runs well on GPUs with as little as 6 GB of VRAM when quantized.

ChatCode

OpenAi GPT OSS 20B Abliterated Uncensored NEO Imatrix GGUF

DavidAU · 20B

82.1K 465

A GGUF-quantized, abliterated version of OpenAI's GPT-OSS 20B, processed by DavidAU using imatrix quantization for improved quality at lower bit depths. Based on huihui-ai's uncensored abliteration of the original model, this 20-billion-parameter variant removes built-in refusal behaviors while preserving the model's general capabilities. The imatrix quantization technique uses importance-weighted calibration data to minimize quality loss during compression, making this a well-optimized package for local inference. Suitable for users who want an unrestricted general-purpose assistant model at the 20B scale. Runs well on GPUs with 12 to 16 GB of VRAM depending on quantization level.

ChatCodeReasoning

Apertus 8B Instruct 2509

swiss-ai · 8B

117.9K 439

Apertus 8B Instruct is an open-source instruction-tuned model from Swiss AI, a collaborative research initiative. Built on an 8 billion parameter base, it emphasizes transparency, open data, and European AI sovereignty. For local users, it delivers solid general-purpose chat and instruction-following in a standard 8B footprint that runs well on consumer GPUs with 8 to 10 GB of VRAM, making it a practical choice for those who value open, community-driven model development.

Chat

Qwen3 1.7B

Alibaba · 1.7B

7.0M 427

Qwen3 1.7B is a 1.7-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 3 series. It is a lightweight model designed for deployment on minimal hardware, including low-VRAM GPUs and even CPU-only configurations with acceptable latency. Despite its compact size, it supports hybrid thinking mode and handles basic conversational tasks, simple question answering, and text generation. The model is useful for edge deployment, embedded applications, and scenarios where fast inference with minimal resource consumption is the priority. It represents a significant quality improvement over Qwen 2.5 at the sub-2B scale. Released under the Apache 2.0 license.

Chat

Sqlcoder 7B 2

defog · 6.7B

76.7K 421

SQLCoder 7B 2 is a 6.7-billion-parameter model from Defog, purpose-built for converting natural-language questions into SQL queries. Fine-tuned specifically on text-to-SQL tasks, it consistently outperforms much larger general-purpose models when the job is generating accurate, executable SQL against real database schemas. For developers and data analysts who regularly query databases, running SQLCoder locally means fast, private SQL generation without sending proprietary schema details to an external API. It works best when provided with table definitions as context and is particularly strong on PostgreSQL, MySQL, and SQLite dialects.

ChatCode

Qwen2.5 3B Instruct

Alibaba · 3.1B

7.0M 415

Qwen2.5 3B Instruct is a 3.1-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 2.5 family. It is designed for efficient local inference on consumer hardware, supporting a 128K token context window despite its compact footprint. The model can run on GPUs with as little as 4GB of VRAM when quantized. Despite its small size, Qwen2.5 3B Instruct delivers competitive performance for basic conversational tasks, summarization, and simple instruction following. It is a good option for edge deployment and resource-constrained environments. Released under the Apache 2.0 license.

Chat

Medgemma 27B Text IT

Google · 27B

58.5K 411

Google MedGemma 27B Text IT is a 27-billion parameter instruction-tuned model specialized for the medical domain, built on the Gemma architecture by Google. It is fine-tuned on medical and clinical text data to provide improved performance on healthcare-related tasks such as medical question answering, clinical reasoning, and health information summarization. The model requires a GPU with at least 24GB of VRAM for quantized inference. Its domain specialization makes it notably more capable than general models on clinical benchmarks, though it should not be used as a substitute for professional medical advice. Released under the Gemma license.

Chat

Llama 3.1 70B

Meta · 70.6B

76.7K 410

Meta Llama 3.1 70B is a 70.6-billion parameter base (pretrained) model from the Llama 3.1 family. It supports a 128K token context window and was trained on a massive multilingual corpus. As a base model, it is designed for fine-tuning and research rather than direct conversational use. The model serves as the foundation for the Llama 3.1 70B Instruct variant and numerous community fine-tunes. It delivers strong performance across language understanding and generation benchmarks. Released under the Llama 3.1 Community License.

Chat

Qwen3 235B A22B Thinking 2507

Alibaba · 235B

53.5K 399

Qwen3 235B A22B Thinking 2507 is the reasoning and chain-of-thought variant of Alibaba's largest Qwen3 mixture-of-experts model, updated in July 2025. With 235 billion total parameters and about 22 billion active per forward pass, it represents the pinnacle of Qwen3's reasoning capabilities. This model excels at complex multi-step problems, mathematical reasoning, code analysis, and tasks requiring deep logical thinking. It demands serious hardware to run locally, but for users with multi-GPU setups, it offers reasoning performance that rivals the best proprietary models while keeping all computation on your own machines.

Chat

Hermes 3 Llama 3.1 8B

Nous Research · 8.0B

471.6K 393

Hermes 3 Llama 3.1 8B is an 8-billion parameter instruction-tuned model by Nous Research, built on Meta's Llama 3.1 8B base. It is fine-tuned for advanced instruction following, multi-turn conversation, structured output, and creative roleplay scenarios. The Hermes series is known for producing highly steerable models that respond well to system prompts. This model supports a 128K token context window inherited from the Llama 3.1 architecture and runs efficiently on consumer GPUs with 8GB or more of VRAM. It is a popular choice among local inference enthusiasts who value strong instruction adherence and versatile conversational ability.

ChatRoleplay

Kimi Dev 72B

Moonshot AI · 72B

2.9K 384

Kimi Dev 72B is Moonshot AI's developer-focused model built on the Qwen2.5-72B architecture, specifically optimized for coding tasks, tool use, and agentic workflows. It combines strong general-purpose chat abilities with specialized developer capabilities, making it a compelling choice for software engineering assistance. At 72 billion parameters it requires substantial hardware, typically needing 40+ GB of VRAM at 4-bit quantization, which puts it in reach of dual consumer GPU setups or single professional cards like the A100 or RTX 6000 Ada. If you are primarily looking for a local coding assistant with strong reasoning skills, Kimi Dev is a top-tier option in the 70B class.

ChatCode

Qwen2.5 0.5B

Alibaba · 494M

1.5M 381

Qwen2.5 0.5B is the smallest base (pretrained) model in Alibaba Cloud's Qwen 2.5 family, with 494 million parameters. As a base model, it is not instruction-tuned and is intended for fine-tuning, research, and as a foundation for custom applications. It supports a 128K token context window. Its minimal size makes it suitable for experimentation, rapid prototyping, and resource-constrained fine-tuning tasks. The model can run on virtually any hardware. Released under the Apache 2.0 license.

Chat

Qwen3 14B

Alibaba · 14B

2.7M 380

Qwen3 14B is a 14-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 3 series. It occupies a practical middle ground in the Qwen 3 lineup, offering stronger reasoning and generation quality than the 8B variant while remaining manageable on GPUs with 16GB or more of VRAM in quantized formats. The model supports hybrid thinking mode for flexible reasoning depth. Qwen3 14B is well suited for chat, instruction following, coding assistance, and multilingual tasks. It benefits from the generational improvements of Qwen 3 in pretraining data and alignment techniques, delivering performance that competes with larger models from previous generations. Released under the Apache 2.0 license.

Chat

Gpt2 Xl

OpenAI · 1.6B

200.1K 376

GPT-2 XL is the largest variant of the GPT-2 family at 1.6 billion parameters, representing the full release of the model OpenAI originally withheld over safety concerns in 2019. It produces the most coherent and capable outputs of the GPT-2 lineup, though it remains far behind modern multi-billion-parameter instruction-tuned models. At its size, GPT-2 XL still runs easily on most consumer GPUs and even on CPUs with reasonable speed, making it useful for experimentation, fine-tuning projects, and as a baseline for comparing against newer architectures. It requires roughly 3 GB of VRAM at full precision.

Chat

Qwen3 30B A3B Thinking 2507

Alibaba · 30B

1.1M 368

Qwen3 30B A3B Thinking 2507 is the reasoning-focused variant of Alibaba's 30-billion-parameter mixture-of-experts model, updated in July 2025. Like its instruct sibling, it activates only about 3 billion parameters per token, keeping resource demands low while enabling multi-step reasoning and chain-of-thought problem solving. This thinking variant is designed for tasks that benefit from deliberate, step-by-step logic such as math, coding puzzles, and analytical questions. Its efficient MoE design means users with modest GPUs can still access strong reasoning capabilities without needing datacenter-class hardware.

Chat

Llama 7B

huggyllama · 6.7B

152.1K 356

This is a community reupload of Meta's original Llama 1 7B model, published by the huggyllama account on Hugging Face. The original Llama 1 was a 6.7-billion parameter base model released by Meta in early 2023, trained on 1 trillion tokens of publicly available data. It pioneered the wave of open-weight large language models. As a first-generation Llama model, it has been superseded by Llama 2 and Llama 3 in terms of quality and capability. It remains of historical and research interest as the model that catalyzed the open-source LLM ecosystem. This upload provides convenient access in Hugging Face Transformers format.

Chat

Qwen2.5 32B Instruct

Alibaba · 32B

2.8M 337

Qwen2.5 32B Instruct is a 32-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 2.5 family. It occupies a practical sweet spot between the 14B and 72B variants, offering strong reasoning and multilingual capabilities while remaining feasible to run on a single high-end consumer GPU with 24GB or more of VRAM at reduced precision. The model supports a 128K token context window and is optimized for conversational use, instruction following, and structured output generation. It is a popular choice for local inference when the 72B model is too demanding but users need more capability than the 14B variant. Released under the Apache 2.0 license.

Chat

LFM2 8B A1B

LiquidAI · 8.3B

62.9K 335

LFM2 8B A1B is Liquid AI's larger mixture-of-experts model, combining the company's novel hybrid architecture with approximately 8 billion total parameters. It uses a MoE design to keep active compute per token low while maintaining strong general performance across chat and reasoning tasks. For local users, it offers an intriguing alternative to conventional 8B transformers, with Liquid AI's architecture promising improved efficiency and throughput on consumer-grade hardware.

Chat

Meta Llama 3.1 8B Instruct GGUF

Bartowski · 8B

292.6K 329

This is a GGUF-quantized version of Meta's Llama 3.1 8B Instruct, repackaged by Bartowski. Llama 3.1 8B Instruct is one of the most popular open-weight models available, offering strong general-purpose instruction following, reasoning, and multilingual capabilities in a highly efficient 8-billion-parameter package. Bartowski's GGUF conversion makes this model ready to use with llama.cpp and compatible frontends like Ollama, LM Studio, and KoboldCpp. At 8B parameters, it strikes an excellent balance between quality and hardware requirements, running well on modern consumer GPUs with 8GB or more of VRAM, and even on CPU for users willing to trade speed for accessibility.

Chat

Qwen2.5 14B Instruct

Alibaba · 14B

2.0M 322

Qwen2.5 14B Instruct is a 14-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 2.5 series. It supports a 128K token context window and provides a balanced tradeoff between quality and hardware requirements, running well on GPUs with 16GB of VRAM in quantized formats. The model is fine-tuned for chat, instruction following, and general-purpose assistant tasks. It performs well across reasoning, coding, and multilingual benchmarks for its size class, making it a practical option for local deployment when larger models are not feasible. Released under the Apache 2.0 license.

Chat

NVIDIA Nemotron 3 Nano 30B A3B FP8

NVIDIA · 31.6B

1.5M 305

NVIDIA Nemotron 3 Nano 30B A3B FP8 is the FP8-quantized version of NVIDIA's 31.6 billion parameter mixture-of-experts model. The 8-bit floating point format reduces memory requirements compared to BF16 while retaining strong output quality, making it a practical choice for GPUs with tighter VRAM budgets. With only about 3 billion parameters active per token, this model already runs efficiently. The FP8 quantization pushes the memory savings further without meaningful degradation, making it one of the best options for users who want MoE-class performance on mainstream hardware.

Chat

SmolLM2 135M Instruct

Hugging Face · 135M

763.5K 301

SmolLM2 135M Instruct is the instruction-tuned variant of Hugging Face's 135-million-parameter SmolLM2 model. Fine-tuned to follow user prompts and engage in basic conversational exchanges, it delivers surprisingly coherent responses given its minimal size, making it ideal for testing chat interfaces or running on extremely constrained devices. This model is a practical choice when you need an instruction-following model that fits comfortably in under 1 GB of memory. It works well for simple question answering, text reformatting, and lightweight assistant tasks where response quality can be traded for instant inference speed.

Chat

Qwen3 30B A3B Instruct 2507 GGUF

Unsloth · 30B

50.4K 297
Chat

DeepSeek R1 Distill Llama 8B GGUF

Unsloth · 8B

52.5K 296
ChatReasoning

Qwen3 14B Claude 4.5 Opus High Reasoning Distill GGUF

TeichAI · 14B

105.9K 291

A GGUF-quantized distillation of Qwen3 14B trained on reasoning traces from Claude 4.5 Opus by TeichAI. At 14 billion parameters, this model sits in a sweet spot for users with mid-range GPUs who want improved reasoning without the memory demands of larger models. The distillation process targets high-quality chain-of-thought and analytical outputs. The smaller parameter count compared to 27B or 70B alternatives means faster inference and lower VRAM requirements, making it accessible on GPUs with 12 to 16 GB of memory at common quantization levels. A good option for users who need capable reasoning on a budget and are willing to trade some depth for speed and efficiency.

ChatReasoning

Openai GPT

OpenAI · 120M

231.9K 288

OpenAI GPT is the original 2018 transformer-based language model that started the GPT lineage, based on the paper "Improving Language Understanding by Generative Pre-Training." At just 120 million parameters, it is a historically significant model that demonstrated the power of unsupervised pretraining followed by supervised fine-tuning. This model is primarily of academic and historical interest today. It runs on essentially any hardware and can be useful for educational exploration of transformer architectures, but it should not be compared to modern instruction-tuned models in terms of practical capability.

Chat

Llama Guard 3 8B

Meta · 8.0B

180.6K 278

Meta Llama Guard 3 8B is an 8-billion parameter safety classifier model built on the Llama 3.1 architecture. Unlike general-purpose chat models, Llama Guard is specifically designed to classify whether prompts or responses contain unsafe content across categories such as violence, sexual content, criminal planning, and other policy violations. The model is intended to be used as a moderation layer in LLM-based applications, providing input and output safety filtering. It follows a taxonomy-based classification approach and can be customized for different safety policies. Released under the Llama 3.1 Community License.

Chat