All LLM Models
Browse 856 LLM models with VRAM requirements, quantization options, and hardware compatibility.
Understanding LLM VRAM Requirements
How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.
Model List
Qwen3.6 35B A3B Uncensored Heretic Native MTP Preserved
llmfan46 · 35.1B · runs from 15.3 GB
Qwen3.6 35B A3B Uncensored Heretic Native MTP Preserved is a 35.1B-parameter open language model from llmfan46 in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
MN 12B Mag Mell R1
inflatebot · 12.2B · runs from 4.1 GB
MN 12B Mag Mell R1 is a 12.2B-parameter open language model from inflatebot. It supports a context window of up to 1,024,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
NVIDIA Nemotron 3 Super 120B A12B BF16
NVIDIA · 123.6B · runs from 34.5 GB
NVIDIA Nemotron 3 Super 120B A12B BF16 is a 123.6B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
LFM2.5 1.2B Instruct
LiquidAI · 1.2B · runs from 0.8 GB
LFM2.5 1.2B Instruct is an instruction-tuned model from Liquid AI that uses a novel hybrid architecture combining state-space models with attention mechanisms. At just 1.2 billion parameters, it is exceptionally lightweight and can run on virtually any hardware, including laptops and edge devices. Liquid AI's unconventional architecture aims to deliver better efficiency and longer context handling than traditional transformer models at this scale, making it an interesting option for users exploring alternatives to standard transformer-based LLMs.
Qwen3.6 12B IQ Ultra Heretic Uncensored Thinking v2 Hightop
DavidAU · 12.1B · runs from 5.6 GB
Qwen3.6 12B IQ Ultra Heretic Uncensored Thinking v2 Hightop is a 12.1B-parameter open language model from DavidAU in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
DeepSeek R1 Distill Llama 70B
DeepSeek · 70.6B · runs from 20.4 GB
DeepSeek R1 Distill Llama 70B is the largest model in the R1 distillation lineup, combining the reasoning capabilities developed in the full 684.5B R1 with the robust Llama 3.1 70B architecture. At 70 billion parameters, it delivers the strongest reasoning performance of any dense R1 distill, approaching the full R1's quality on many math and coding benchmarks. Running this model locally requires a multi-GPU setup or a single GPU with very high VRAM capacity, though quantized versions can fit on hardware with 48 GB or more. For users who need top-tier open-weight reasoning and have the hardware to support a 70B dense model, this is one of the strongest options available.
Qwen3.5 27B Claude 4.6 Opus Reasoning Distilled
Jackrong · 27.8B · runs from 8.4 GB
The full-precision version of Jackrong's Qwen3.5 27B reasoning distillation from Claude 4.6 Opus. With 27.8 billion parameters in unquantized form, this model preserves the maximum quality from the distillation process but requires significantly more VRAM, typically 56 GB or more in BF16. It is primarily intended for users with professional-grade GPUs or multi-GPU setups. This variant is ideal for further fine-tuning, experimentation, or running at full fidelity when hardware allows. Most users looking to run the model locally for inference should consider the GGUF-quantized version instead, which offers a much better tradeoff between quality and resource usage.
Mistral 7B Instruct v0.1
Mistral AI · 7B · runs from 3.5 GB
Mistral 7B Instruct v0.1 was the first instruction-tuned variant of the original Mistral 7B, fine-tuned for conversational and instruction-following tasks. While it has since been superseded by v0.2 and v0.3, it remains a solid lightweight chat model and an important milestone in the open-weight model ecosystem. Its hardware requirements are identical to the base Mistral 7B, running smoothly on GPUs with as little as 6 GB of VRAM when quantized. Users seeking the best Mistral 7B experience should generally prefer the newer v0.3 release, but v0.1 is still useful for reproducibility and benchmarking purposes.
Qwen3 Next 80B A3B Instruct
Alibaba · 81.3B · runs from 22.8 GB
Qwen3 Next 80B A3B Instruct is a Mixture of Experts (MoE) model from Alibaba Cloud's Qwen 3 series, with approximately 81.3 billion total parameters and around 3 billion active parameters per forward pass. This extreme ratio between total and active parameters allows the model to encode extensive knowledge across its expert layers while maintaining very fast per-token inference, making it an unusually efficient design for its capability level. The model is instruction-tuned for general-purpose chat and requires VRAM proportional to its full 80B parameter count for weight loading, typically needing high-VRAM GPUs or quantized multi-GPU setups. Its low active parameter count results in fast generation speeds despite the large total model size. Released under the Apache 2.0 license.
Deepseek Coder 6.7B Instruct
DeepSeek · 6.7B · runs from 4.2 GB
DeepSeek Coder 6.7B Instruct is a first-generation code-specialized model trained on a large corpus of source code and programming-related data. At 6.7 billion parameters, it provides solid code completion, generation, and explanation capabilities across popular programming languages while remaining small enough to run on most consumer GPUs. While newer models in the DeepSeek lineup have surpassed it in raw capability, this model remains a practical choice for users who need a lightweight local coding assistant with minimal hardware requirements. It runs well on GPUs with as little as 6 GB of VRAM when quantized.
GLM 4.7 Flash REAP 23B A3B
Cerebras · 23.0B · runs from 7.4 GB
GLM 4.7 Flash REAP 23B A3B is a 23.0B-parameter open language model from Cerebras in the GLM 4 family. It supports a context window of up to 202,752 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
LFM2 24B A2B
LiquidAI · 23.8B · runs from 7.0 GB
LFM2 24B A2B is a 23.8B-parameter open language model from LiquidAI. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Qwen2.5 14B Instruct
Alibaba · 14.8B · runs from 5.1 GB
Qwen2.5 14B Instruct is a 14-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 2.5 series. It supports a 128K token context window and provides a balanced tradeoff between quality and hardware requirements, running well on GPUs with 16GB of VRAM in quantized formats. The model is fine-tuned for chat, instruction following, and general-purpose assistant tasks. It performs well across reasoning, coding, and multilingual benchmarks for its size class, making it a practical option for local deployment when larger models are not feasible. Released under the Apache 2.0 license.
DeepSeek R1 Distill Qwen 32B Abliterated
huihui-ai · 32.8B · runs from 9.8 GB
DeepSeek R1 Distill Qwen 32B Abliterated is a 32.8B-parameter open language model from huihui-ai in the DeepSeek R1 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Gemma 4 12B IT Heretic
igorls · 12.0B · runs from 6.1 GB
Gemma 4 12B IT Heretic is a 12.0B-parameter open language model from igorls in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Qwen3 235B A22B Thinking 2507
Alibaba · 235.1B · runs from 71.0 GB
Qwen3 235B A22B Thinking 2507 is the reasoning and chain-of-thought variant of Alibaba's largest Qwen3 mixture-of-experts model, updated in July 2025. With 235 billion total parameters and about 22 billion active per forward pass, it represents the pinnacle of Qwen3's reasoning capabilities. This model excels at complex multi-step problems, mathematical reasoning, code analysis, and tasks requiring deep logical thinking. It demands serious hardware to run locally, but for users with multi-GPU setups, it offers reasoning performance that rivals the best proprietary models while keeping all computation on your own machines.
Huihui Qwen3 Coder 30B A3B Instruct Abliterated
huihui-ai · 30.5B · runs from 8.8 GB
Huihui Qwen3 Coder 30B A3B Instruct Abliterated is a 30.5B-parameter open language model from huihui-ai in the Qwen 3 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Qwen2.5 72B Instruct
Alibaba · 72.7B · runs from 21.0 GB
Qwen2.5 72B Instruct is the flagship model of the Qwen 2.5 series from Alibaba Cloud, with 72.7 billion parameters. It is instruction-tuned for conversational use and excels across reasoning, coding, mathematics, and multilingual tasks. Qwen2.5 72B delivers performance competitive with leading open-weight 70B-class models while supporting a 128K token context window and structured output generation. The model uses a Transformer architecture with grouped-query attention and was pretrained on a diverse multilingual corpus of over 18 trillion tokens. Running it locally requires high-VRAM GPUs or multi-GPU setups, though quantized formats make it accessible on workstation-class hardware. Released under the Apache 2.0 license.
Gemma 2 9B IT
Google · 9.2B · runs from 3.0 GB
Google Gemma 2 9B IT is a 9.2-billion parameter instruction-tuned model from Google's Gemma 2 series. It is a text-only chat model optimized for conversational tasks, instruction following, and general-purpose assistance. At release, it was recognized for delivering unusually strong performance relative to its parameter count. The model runs efficiently on consumer GPUs with 8-12GB of VRAM in quantized formats, making it accessible on mainstream hardware. It is a popular choice for local inference among users who want strong quality without the VRAM demands of larger models. Released under the Gemma license.
DeepSeek V4 Pro
DeepSeek · 861.6B · runs from 366.5 GB
DeepSeek V4 Pro is a 861.6B-parameter open language model from DeepSeek in the DeepSeek V4 family. It supports a context window of up to 1,048,576 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
DeepSeek Coder v2 Lite Instruct
DeepSeek · 15.7B · runs from 7.2 GB
DeepSeek Coder V2 Lite Instruct is a code-focused mixture-of-experts model with 15.7 billion total parameters, trained to handle both programming tasks and general conversation. It supports a wide range of programming languages and excels at code generation, debugging, explanation, and refactoring. The MoE architecture keeps compute costs manageable despite the model's broad capabilities, and the Lite variant is sized to run on a single consumer GPU. For developers looking for a capable local coding assistant that can also handle general chat, this model offers an appealing combination of code specialization and practical hardware requirements.
Llama 4 Scout 17B 16E Instruct
Meta · 108.6B · runs from 32.9 GB
Llama 4 Scout 17B 16E Instruct is a 108.6B-parameter open language model from Meta in the Llama 4 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Gemma 4 E2B IT Uncensored
TrevorJS · 5.1B · runs from 2.5 GB
Gemma 4 E2B IT Uncensored is a 5.1B-parameter open language model from TrevorJS in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
SmolLM2 360M Instruct
Hugging Face · 362M · runs from 0.5 GB
SmolLM2 360M Instruct is an instruction-tuned model from Hugging Face that occupies the sweet spot between the 135M and 1.7B entries in the SmolLM2 lineup. At 360 million parameters, it offers noticeably better coherence and instruction-following ability than the smallest variants while still running comfortably on virtually any modern GPU or even on CPU. This model is well suited for on-device assistants, embedded applications, and rapid prototyping where you need real conversational ability without dedicating significant hardware resources. It handles short-form generation, summarization, and basic reasoning tasks with reasonable quality.
Gemma 4 26B A4B IT Uncensored
TrevorJS · 25.8B · runs from 11.6 GB
Gemma 4 26B A4B IT Uncensored is a 25.8B-parameter open language model from TrevorJS in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Nemotron 3 Nano Omni 30B A3B Reasoning BF16
NVIDIA · 33.0B · runs from 10.0 GB
Nemotron 3 Nano Omni 30B A3B Reasoning BF16 is a 33.0B-parameter open language model from NVIDIA in the Nemotron family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Phi 2
Microsoft · 2.8B · runs from 2.1 GB
Microsoft Phi 2 is a 2.8-billion parameter language model from Microsoft Research that pioneered the concept of small but highly capable language models. Released in late 2023, Phi 2 demonstrated that strategic data curation and training methodology could allow a sub-3B model to outperform many 7B and 13B models on reasoning and coding benchmarks. The model runs on virtually any modern GPU and even on CPU-only setups. While succeeded by Phi 3 and Phi 4, Phi 2 remains historically significant as the model that proved small-scale language models could be genuinely useful for practical tasks. Released under the MIT license.
Step 3.5 Flash
stepfun-ai · 199.4B · runs from 68.5 GB
Step 3.5 Flash is an efficient mixture-of-experts model from StepFun AI, a Chinese AI startup, featuring roughly 199 billion total parameters. The Flash designation signals its focus on speed and low-latency inference, making it well-suited for interactive applications despite its large total parameter count. Running it locally requires a multi-GPU setup, but its MoE architecture means only a portion of the model activates per token, delivering strong multilingual performance with better throughput than a comparably sized dense model.
Kimi K2 Instruct
Moonshot AI · 1026.5B · runs from 286.2 GB
Kimi K2 Instruct is Moonshot AI's massive Mixture-of-Experts model, weighing in at over one trillion total parameters. It represents one of the largest open-weight models available, delivering frontier-class performance across reasoning, coding, and multilingual tasks through its sparse MoE architecture that activates only a fraction of its full parameter count per token. Running Kimi K2 locally is an extreme undertaking, requiring professional multi-GPU setups with hundreds of gigabytes of combined VRAM even at aggressive quantization. This model is best suited for research labs, enterprise deployments, or enthusiasts with access to server-grade hardware who want to explore trillion-parameter-scale inference.
Granite 4.1 3B
IBM · 3.4B · runs from 1.6 GB
Granite 4.1 3B is a 3.4B-parameter open language model from IBM in the Granite family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.