All LLM Models

Browse 225 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

Qwen2.5 Coder 1.5B

Alibaba · 1.5B

584.8K 85

Qwen2.5 Coder 1.5B is a 1.5-billion parameter code-specialized model from Alibaba Cloud's Qwen 2.5 Coder series. It is the smallest Coder variant that balances meaningful code generation capability with extremely low resource requirements, running on GPUs with as little as 2-4GB of VRAM. The model is suitable for lightweight code completion, simple code generation tasks, and as a compact local coding assistant in resource-constrained environments. It supports a 128K token context window. Released under the Apache 2.0 license.

ChatCode

Gemma 2 2B IT GGUF

Bartowski · 2B

860.5K 85

This is a GGUF-quantized version of Google's Gemma 2 2B IT, repackaged by Bartowski. Gemma 2 2B IT is a lightweight instruction-tuned model from Google's Gemma 2 family, designed for efficient on-device inference while maintaining strong performance on conversational and instruction-following tasks. With only 2 billion parameters, this is one of the smallest capable instruction-tuned models available. The GGUF format makes it compatible with llama.cpp and its ecosystem of frontends, and the small size means it can run comfortably on virtually any modern hardware, including systems with limited VRAM or even CPU-only setups.

Chat

Qwen3 30B A3B FP8

Alibaba · 30B

87.6K 82

Qwen3 30B A3B FP8 is the FP8 precision version of Alibaba's 30-billion-parameter mixture-of-experts model with approximately 3 billion active parameters per token. FP8 provides a good balance between quantization efficiency and output quality, sitting between full precision and more aggressive INT4 or INT8 formats. This variant is aimed at users who want near-original model quality with meaningful memory savings. The MoE architecture already keeps compute demands low, and FP8 further reduces the VRAM footprint, making it a practical choice for consumer GPUs in the 8 to 12 GB VRAM range.

Chat

Qwen2.5 0.5B Instruct GGUF

Alibaba · 0.5B

67.5K 81

Qwen2.5 0.5B Instruct is the smallest instruction-tuned model in Alibaba's Qwen2.5 series, offered in official GGUF format. With just 500 million parameters it is designed for extremely resource-constrained environments, running on virtually any modern CPU without a dedicated GPU and consuming minimal RAM. Despite its tiny footprint, the 0.5B variant can handle simple question answering, short text generation, and basic classification tasks. It is ideal for experimentation, edge deployment, or as an always-on local model where speed and low resource usage matter more than peak output quality.

Chat

Gemma 3 1B IT GGUF

Unsloth · 1B

53.3K 80

A GGUF-quantized version of Google's Gemma 3 1B Instruct-Tuned, repackaged by Unsloth. At 1 billion parameters, this model sits in the lightweight tier and can run comfortably on virtually any modern hardware, including older GPUs and even CPU-only setups. It offers a meaningful step up from the 270M variant in coherence and instruction following, making it a practical option for simple chat tasks, summarization, and local prototyping where speed and low resource usage matter more than peak quality.

Chat

Salamandra 7B Instruct

BSC-LT · 7.8B

70.3K 76

Salamandra 7B Instruct is a 7.8-billion-parameter multilingual model developed by the Barcelona Supercomputing Center (BSC-LT) as part of a European initiative to build high-quality open language models. It has particular strength in Iberian languages including Spanish, Catalan, Portuguese, and Basque, while also supporting English and other major European languages. This model is an excellent choice for users who need strong performance in Spanish or other Iberian languages that are often underserved by mainstream LLMs. Running it locally ensures data privacy for sensitive multilingual workflows, and at 7B parameters it fits comfortably on a single consumer GPU with 8 GB or more of VRAM.

Chat

Amber

LLM360 · 6.7B

53.0K 72

Amber is a 6.7 billion parameter model from LLM360, an initiative dedicated to full transparency in large language model training. Every aspect of Amber's creation has been publicly documented and released, including the complete training data, all intermediate checkpoints, training code, and evaluation results. This level of openness makes Amber uniquely valuable for researchers studying training dynamics, data influence, and model behavior at scale. For local deployment, it offers solid general-purpose text generation at a size that fits comfortably on mid-range consumer GPUs, though users primarily seeking chat performance may prefer models specifically tuned for instruction following.

Chat

Qwen3.5 397B A17B NVFP4

NVIDIA · 397B

116.1K 71

Qwen3.5 397B A17B NVFP4 is NVIDIA's NVFP4-quantized version of Alibaba's enormous Qwen3.5, a 397 billion parameter mixture-of-experts model with 17 billion active parameters per token. Even with aggressive quantization, this is one of the largest models you can attempt to run locally. This model represents the cutting edge of what's possible for local inference. The MoE architecture keeps per-token compute manageable despite the massive parameter count, and NVIDIA's NVFP4 quantization brings memory requirements down from utterly impossible to merely ambitious. Multi-GPU setups with substantial VRAM are essential, but the reward is near-frontier intelligence running entirely on your own machines.

Chat

Mamba 130M HF

State Spaces · 129M

179.2K 69

Mamba 130M is a state-space model developed by State Spaces that offers a fundamentally different architecture from the Transformer-based models that dominate the LLM landscape. Using selective state-space layers instead of attention, Mamba achieves linear-time inference scaling with sequence length, making it particularly efficient for processing long inputs. At 130 million parameters this is primarily a research and demonstration model, but it showcases the potential of state-space architectures for local deployment. Users interested in exploring alternatives to Transformer-based language models will find Mamba 130M a lightweight and accessible entry point for experimentation.

Chat

Qwen3 1.7B Base

Alibaba · 1.7B

336.3K 65

Qwen3 1.7B Base is a 1.7-billion parameter pretrained foundation model from Alibaba Cloud's Qwen 3 family. It is a compact base model designed for fine-tuning, research, and custom applications rather than direct conversational use. Its small size makes it accessible for resource-constrained fine-tuning and rapid experimentation. The model can run on virtually any modern GPU and benefits from the improved pretraining data of the Qwen 3 generation. It is suitable as a lightweight foundation for domain-specific fine-tunes and student models in distillation pipelines. Released under the Apache 2.0 license.

Chat

Qwen3 1.7B GGUF

Unsloth · 1.7B

66.5K 65
Chat

Llama XLAM 2 8B Fc R

Salesforce · 8B

64.1K 59

xLAM 2 8B FC-R is an 8-billion parameter model by Salesforce, specifically optimized for function calling and tool use. Built on the Llama architecture, it is designed to reliably generate structured function call outputs, making it suitable for agentic workflows and applications that require models to interact with external tools and APIs. Unlike general-purpose chat models, xLAM 2 focuses on accurately parsing user intent into structured tool invocations with proper argument formatting. It runs on consumer GPUs with 8GB or more of VRAM and is a strong choice for developers building local AI agent systems that need reliable function-calling capabilities.

ChatFunctions

Dolphin 2.9.1 Yi 1.5 34B

dphn · 34.4B

4.7M 57

Dolphin 2.9.1 Yi 1.5 34B is a 34.4-billion parameter chat model created by Eric Hartford's Dolphin project, fine-tuned from 01.AI's Yi 1.5 34B base. The Dolphin series is known for producing uncensored fine-tunes that remove alignment-based refusals, giving users more direct and unrestricted model responses. This model combines the strong bilingual capabilities of Yi 1.5 with Dolphin's open fine-tuning approach. It requires a GPU with at least 24GB of VRAM for quantized local inference and is popular among users who prefer models without built-in content restrictions.

Chat

Qwen3 Next 80B A3B Thinking NVFP4

NVIDIA · 80B

128.7K 52

Qwen3 Next 80B A3B Thinking NVFP4 is NVIDIA's quantized version of Alibaba's Qwen3 Next 80B, a mixture-of-experts model with thinking capabilities and only 3 billion active parameters per token. The NVFP4 format significantly reduces memory requirements, bringing this 80B model within reach of high-end consumer hardware. The thinking mode enables explicit chain-of-thought reasoning, where the model works through problems step by step before delivering its answer. Combined with the MoE efficiency of activating just 3B parameters at a time, this model offers an unusual combination of deep reasoning and fast inference.

Chat

Qwen3 30B A3B GPTQ Int4

Alibaba · 30.5B

104.7K 49

Qwen3 30B A3B GPTQ Int4 is a GPTQ INT4 quantized version of Alibaba's 30.5-billion-parameter mixture-of-experts model. The aggressive INT4 quantization combined with the MoE architecture's low active parameter count makes this one of the most memory-efficient ways to run a 30B-class model locally. With only about 3 billion parameters active per token and weights compressed to 4-bit precision, this model can fit comfortably on consumer GPUs with as little as 4 to 6 GB of VRAM. It is an excellent option for users who want to maximize model capability on budget hardware, though some quality degradation compared to higher-precision formats is expected.

Chat

Jan v3 4B Base Instruct GGUF

janhq · 4B

315.6K 48

Jan v3 4B Base Instruct is a 4-billion-parameter model from the Jan AI project, provided in GGUF format for local deployment. Built on the Menlo Jan-v3-4B architecture, it is designed as a capable small assistant for both code and general chat, balancing helpfulness with a compact size that runs on modest consumer hardware. This model is a solid option for users exploring the Jan ecosystem or anyone who wants a lightweight local assistant that handles coding questions and everyday conversation in a single package. Its small parameter count keeps memory usage low, making it viable on laptops and entry-level desktop GPUs alike.

ChatCode

Qwen2.5 Coder 0.5B

Alibaba · 494M

63.5K 46

Qwen2.5 Coder 0.5B is a 494-million parameter code-specialized model from Alibaba Cloud, the smallest in the Qwen 2.5 Coder series. It is designed for ultra-lightweight deployment where code-aware text generation is needed with minimal hardware resources. The model runs on virtually any GPU and even on CPU-only setups. While limited in capability compared to larger coding models, it is useful for basic code completion, prototyping, and experimentation. It supports a 128K token context window. Released under the Apache 2.0 license.

ChatCode

Qwen3 Coder 30B A3B Instruct AWQ 4bit

cyankiwi · 5.3B

142.8K 44

An AWQ 4-bit quantized version of Alibaba's Qwen3 Coder 30B A3B Instruct, repackaged by cyankiwi. This mixture-of-experts coding model has 30 billion total parameters but only activates around 3 billion per token, resulting in roughly 5.3 billion effective parameters during inference. The AWQ quantization format is optimized for GPU inference and preserves model quality well at 4-bit precision. Designed specifically for code generation tasks, this model handles completion, debugging, refactoring, and explanation across many programming languages. The combination of sparse activation and 4-bit quantization makes it remarkably efficient to run, fitting in as little as 8GB of VRAM while delivering coding performance that punches well above what its effective parameter count might suggest.

ChatCode

Gemma 3 12B IT GGUF

LM Studio Community · 12B

62.0K 42
Vision

Llama 3.3 70B Instruct Awq

Casper Hansen · 70.6B

849.3K 41

This is an AWQ-quantized version of Meta's Llama 3.3 70B Instruct, repackaged by Casper Hansen. Llama 3.3 70B Instruct is one of the most capable open-weight models available, delivering performance competitive with much larger models across reasoning, coding, math, and multilingual tasks. Casper Hansen's AWQ (Activation-aware Weight Quantization) conversion reduces memory requirements while preserving model quality, making this 70.6-billion-parameter model more accessible for local deployment. AWQ quantization is designed for GPU inference and works with frameworks like vLLM and AutoAWQ. Running this model still requires substantial VRAM, but the quantization brings it within reach of high-end consumer or professional multi-GPU setups.

Chat

Qwen2.5 72B Instruct Abliterated

huihui-ai · 72.7B

179.7K 40

An abliterated (uncensored) version of Alibaba's Qwen2.5 72B Instruct, modified by huihui-ai. Abliteration is a technique that removes or weakens the model's built-in refusal mechanisms and safety guardrails, resulting in a model that is more willing to respond to a broader range of prompts without declining. The base Qwen2.5 72B Instruct is one of Alibaba's flagship open models at 72.7 billion parameters. This is a full-precision or minimally modified version of the weights, so running it locally requires substantial VRAM, typically 40GB or more even with quantization applied on top. Users interested in this model should understand that abliterated models lack standard safety filtering and should be used responsibly. The underlying Qwen2.5 72B architecture delivers strong performance across reasoning, coding, writing, and multilingual tasks.

Chat

Meta Llama 3.1 70B Instruct GGUF

MaziyarPanahi · 70B

114.5K 40
Chat

Pythia 160M

EleutherAI · 160M

2.6M 39

Pythia 160M is part of EleutherAI's Pythia training suite, a collection of models trained on the same data in the same order at multiple scales to enable rigorous scientific research into how language models learn. At 160 million parameters, it is the smallest model in the suite and runs on virtually any hardware. This model is primarily valuable for researchers studying scaling laws, training dynamics, and emergent capabilities across model sizes. EleutherAI released full training checkpoints, data, and code, making Pythia 160M one of the most transparent and reproducible models available for academic study.

Chat

Qwen3 Coder Next FP8 Dynamic

Unsloth · 79.7B

72.7K 37

An FP8 dynamic quantized version of Alibaba's Qwen3 Coder Next, repackaged by Unsloth. At 79.7 billion parameters, this is a large code-focused model that benefits substantially from dynamic FP8 quantization, which reduces memory requirements while preserving strong code generation quality across many programming languages. Qwen3 Coder Next represents Alibaba's latest generation of specialized coding models, with strong performance on code completion, generation, debugging, and explanation tasks. The FP8 dynamic format offers a good balance between model fidelity and memory savings, though you will still need a high-VRAM GPU or multi-GPU setup to run this model locally.

ChatCode

GPT OSS 20B Unsloth Bnb 4bit

Unsloth · 20.9B

202.1K 37

This is a BitsAndBytes 4-bit quantized version of OpenAI's GPT-OSS 20B, prepared by Unsloth. GPT-OSS 20B is OpenAI's open-source model, and this 4-bit quantization dramatically reduces its memory footprint for local inference and fine-tuning. The BitsAndBytes (BnB) 4-bit format is designed for use with the Hugging Face Transformers ecosystem and is particularly well-suited for QLoRA fine-tuning workflows. At 20.9 billion parameters compressed to 4-bit precision, this variant makes the model accessible on consumer GPUs while retaining strong performance for users who want to run or customize OpenAI's open-source offering locally.

Chat

GPT OSS 20B MXFP4 Q8

MLX Community · 20B

542.2K 36

An MLX-optimized MXFP4-Q8 quantized version of OpenAI's GPT-OSS 20B, converted by MLX Community for Apple Silicon Macs. This model uses a mixed-precision quantization scheme with MXFP4 weights and Q8 attention, designed to maximize performance on Apple's unified memory architecture while keeping the memory footprint manageable. GPT-OSS 20B is OpenAI's open-source entry at 20 billion parameters, and this MLX conversion makes it straightforward to run natively on M-series Macs without any CUDA dependency. Users with 32GB or more of unified memory should be able to run this model comfortably for general-purpose chat, writing, and reasoning tasks.

Chat

MiniMax M2.5 BF16 INT4 AWQ

mratsim · 39.1B

56.0K 34

An AWQ INT4 quantization of MiniMax M2.5 prepared by mratsim, featuring 39.1 billion effective parameters in a Mixture-of-Experts architecture. This model supports chat, code generation, and function calling, making it a versatile general-purpose assistant. The AWQ quantization is optimized for GPU inference with minimal quality loss. As an MoE model, M2.5 activates only a subset of its parameters per token, offering strong performance relative to its total size while keeping inference costs manageable. The INT4 AWQ format requires GPU inference with compatible frameworks like vLLM or AutoAWQ. Expect to need 24 GB or more of VRAM. A solid choice for users with high-end GPUs who want a capable all-rounder with function calling support.

ChatCodeFunctions

GPT OSS 20B BF16

Unsloth · 20.9B

128.2K 32

This is a BFloat16-precision repack of OpenAI's GPT-OSS 20B, prepared by Unsloth. GPT-OSS 20B is OpenAI's open-source model release, and this BF16 version preserves the full model quality without any lossy quantization. At 20.9 billion parameters in BF16 precision, this variant requires substantial VRAM to run but delivers the highest fidelity to the original model weights. It is best suited for users with high-end GPUs who want maximum quality for inference or as a starting point for full-precision fine-tuning. The Unsloth repack ensures compatibility with popular training and inference frameworks.

Chat

Qwen3 30B A3B Instruct 2507 AWQ 4bit

cyankiwi · 5.3B

87.0K 31

An AWQ 4-bit quantized version of Alibaba's Qwen3 30B A3B Instruct 2507 (July 2025 release), repackaged by cyankiwi. This general-purpose mixture-of-experts model has 30 billion total parameters with approximately 3 billion activated per token, yielding around 5.3 billion effective parameters. AWQ quantization is optimized for GPU inference and maintains strong output quality at 4-bit precision. The 2507 revision brings updated training from Alibaba, improving the model's instruction following, reasoning, and multilingual capabilities. Thanks to its sparse activation pattern and aggressive quantization, this model runs efficiently on GPUs with limited VRAM while providing versatile general-purpose performance for chat, writing, analysis, and reasoning tasks.

Chat

Qwen3 30B A3B NVFP4

NVIDIA · 15.6B

79.4K 25

Qwen3 30B A3B NVFP4 is NVIDIA's NVFP4-quantized version of Alibaba's Qwen3 30B mixture-of-experts model, compressed to roughly 15.6 billion parameters of effective memory usage. With only 3 billion parameters active per token, it runs remarkably fast for a model of its intelligence class. This is one of the most efficient models available for local deployment. The combination of MoE architecture and NVFP4 quantization means you get 30B-class reasoning and instruction-following on hardware that would normally struggle with models half this size. It's an excellent choice for users who want strong performance without top-tier GPUs.

Chat