All LLM Models

Browse 856 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

Qwen3.6 35B A3B Uncensored Heretic Native MTP Preserved

llmfan46 · 35.1B · runs from 15.3 GB

23.0K 26

Qwen3.6 35B A3B Uncensored Heretic Native MTP Preserved is a 35.1B-parameter open language model from llmfan46 in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision

MN 12B Mag Mell R1

inflatebot · 12.2B · runs from 4.1 GB

40.0K 239

MN 12B Mag Mell R1 is a 12.2B-parameter open language model from inflatebot. It supports a context window of up to 1,024,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoning

NVIDIA Nemotron 3 Super 120B A12B BF16

NVIDIA · 123.6B · runs from 34.5 GB

769.5K 383

NVIDIA Nemotron 3 Super 120B A12B BF16 is a 123.6B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

LFM2.5 1.2B Instruct

LiquidAI · 1.2B · runs from 0.8 GB

147.3K 604

LFM2.5 1.2B Instruct is an instruction-tuned model from Liquid AI that uses a novel hybrid architecture combining state-space models with attention mechanisms. At just 1.2 billion parameters, it is exceptionally lightweight and can run on virtually any hardware, including laptops and edge devices. Liquid AI's unconventional architecture aims to deliver better efficiency and longer context handling than traditional transformer models at this scale, making it an interesting option for users exploring alternatives to standard transformer-based LLMs.

Chat

Qwen3.6 12B IQ Ultra Heretic Uncensored Thinking v2 Hightop

DavidAU · 12.1B · runs from 5.6 GB

1.1K 24

Qwen3.6 12B IQ Ultra Heretic Uncensored Thinking v2 Hightop is a 12.1B-parameter open language model from DavidAU in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

VisionRoleplay

DeepSeek R1 Distill Llama 70B

DeepSeek · 70.6B · runs from 20.4 GB

119.2K 778

DeepSeek R1 Distill Llama 70B is the largest model in the R1 distillation lineup, combining the reasoning capabilities developed in the full 684.5B R1 with the robust Llama 3.1 70B architecture. At 70 billion parameters, it delivers the strongest reasoning performance of any dense R1 distill, approaching the full R1's quality on many math and coding benchmarks. Running this model locally requires a multi-GPU setup or a single GPU with very high VRAM capacity, though quantized versions can fit on hardware with 48 GB or more. For users who need top-tier open-weight reasoning and have the hardware to support a 70B dense model, this is one of the strongest options available.

ChatReasoning

Qwen3.5 27B Claude 4.6 Opus Reasoning Distilled

Jackrong · 27.8B · runs from 8.4 GB

61.6K 695

The full-precision version of Jackrong's Qwen3.5 27B reasoning distillation from Claude 4.6 Opus. With 27.8 billion parameters in unquantized form, this model preserves the maximum quality from the distillation process but requires significantly more VRAM, typically 56 GB or more in BF16. It is primarily intended for users with professional-grade GPUs or multi-GPU setups. This variant is ideal for further fine-tuning, experimentation, or running at full fidelity when hardware allows. Most users looking to run the model locally for inference should consider the GGUF-quantized version instead, which offers a much better tradeoff between quality and resource usage.

ChatReasoning

Mistral 7B Instruct v0.1

Mistral AI · 7B · runs from 3.5 GB

448.7K 1.8K

Mistral 7B Instruct v0.1 was the first instruction-tuned variant of the original Mistral 7B, fine-tuned for conversational and instruction-following tasks. While it has since been superseded by v0.2 and v0.3, it remains a solid lightweight chat model and an important milestone in the open-weight model ecosystem. Its hardware requirements are identical to the base Mistral 7B, running smoothly on GPUs with as little as 6 GB of VRAM when quantized. Users seeking the best Mistral 7B experience should generally prefer the newer v0.3 release, but v0.1 is still useful for reproducibility and benchmarking purposes.

Chat

Qwen3 Next 80B A3B Instruct

Alibaba · 81.3B · runs from 22.8 GB

323.6K 1.0K

Qwen3 Next 80B A3B Instruct is a Mixture of Experts (MoE) model from Alibaba Cloud's Qwen 3 series, with approximately 81.3 billion total parameters and around 3 billion active parameters per forward pass. This extreme ratio between total and active parameters allows the model to encode extensive knowledge across its expert layers while maintaining very fast per-token inference, making it an unusually efficient design for its capability level. The model is instruction-tuned for general-purpose chat and requires VRAM proportional to its full 80B parameter count for weight loading, typically needing high-VRAM GPUs or quantized multi-GPU setups. Its low active parameter count results in fast generation speeds despite the large total model size. Released under the Apache 2.0 license.

Chat

Deepseek Coder 6.7B Instruct

DeepSeek · 6.7B · runs from 4.2 GB

143.7K 496

DeepSeek Coder 6.7B Instruct is a first-generation code-specialized model trained on a large corpus of source code and programming-related data. At 6.7 billion parameters, it provides solid code completion, generation, and explanation capabilities across popular programming languages while remaining small enough to run on most consumer GPUs. While newer models in the DeepSeek lineup have surpassed it in raw capability, this model remains a practical choice for users who need a lightweight local coding assistant with minimal hardware requirements. It runs well on GPUs with as little as 6 GB of VRAM when quantized.

ChatCode

GLM 4.7 Flash REAP 23B A3B

Cerebras · 23.0B · runs from 7.4 GB

421 76

GLM 4.7 Flash REAP 23B A3B is a 23.0B-parameter open language model from Cerebras in the GLM 4 family. It supports a context window of up to 202,752 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

LFM2 24B A2B

LiquidAI · 23.8B · runs from 7.0 GB

20.5K 332

LFM2 24B A2B is a 23.8B-parameter open language model from LiquidAI. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen2.5 14B Instruct

Alibaba · 14.8B · runs from 5.1 GB

1.9M 347

Qwen2.5 14B Instruct is a 14-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 2.5 series. It supports a 128K token context window and provides a balanced tradeoff between quality and hardware requirements, running well on GPUs with 16GB of VRAM in quantized formats. The model is fine-tuned for chat, instruction following, and general-purpose assistant tasks. It performs well across reasoning, coding, and multilingual benchmarks for its size class, making it a practical option for local deployment when larger models are not feasible. Released under the Apache 2.0 license.

Chat

DeepSeek R1 Distill Qwen 32B Abliterated

huihui-ai · 32.8B · runs from 9.8 GB

33.7K 244

DeepSeek R1 Distill Qwen 32B Abliterated is a 32.8B-parameter open language model from huihui-ai in the DeepSeek R1 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoning

Gemma 4 12B IT Heretic

igorls · 12.0B · runs from 6.1 GB

1.4K 11

Gemma 4 12B IT Heretic is a 12.0B-parameter open language model from igorls in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen3 235B A22B Thinking 2507

Alibaba · 235.1B · runs from 71.0 GB

56.8K 406

Qwen3 235B A22B Thinking 2507 is the reasoning and chain-of-thought variant of Alibaba's largest Qwen3 mixture-of-experts model, updated in July 2025. With 235 billion total parameters and about 22 billion active per forward pass, it represents the pinnacle of Qwen3's reasoning capabilities. This model excels at complex multi-step problems, mathematical reasoning, code analysis, and tasks requiring deep logical thinking. It demands serious hardware to run locally, but for users with multi-GPU setups, it offers reasoning performance that rivals the best proprietary models while keeping all computation on your own machines.

Chat

Huihui Qwen3 Coder 30B A3B Instruct Abliterated

huihui-ai · 30.5B · runs from 8.8 GB

3.1K 35

Huihui Qwen3 Coder 30B A3B Instruct Abliterated is a 30.5B-parameter open language model from huihui-ai in the Qwen 3 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

Qwen2.5 72B Instruct

Alibaba · 72.7B · runs from 21.0 GB

455.1K 951

Qwen2.5 72B Instruct is the flagship model of the Qwen 2.5 series from Alibaba Cloud, with 72.7 billion parameters. It is instruction-tuned for conversational use and excels across reasoning, coding, mathematics, and multilingual tasks. Qwen2.5 72B delivers performance competitive with leading open-weight 70B-class models while supporting a 128K token context window and structured output generation. The model uses a Transformer architecture with grouped-query attention and was pretrained on a diverse multilingual corpus of over 18 trillion tokens. Running it locally requires high-VRAM GPUs or multi-GPU setups, though quantized formats make it accessible on workstation-class hardware. Released under the Apache 2.0 license.

Chat

Gemma 2 9B IT

Google · 9.2B · runs from 3.0 GB

391.0K 826

Google Gemma 2 9B IT is a 9.2-billion parameter instruction-tuned model from Google's Gemma 2 series. It is a text-only chat model optimized for conversational tasks, instruction following, and general-purpose assistance. At release, it was recognized for delivering unusually strong performance relative to its parameter count. The model runs efficiently on consumer GPUs with 8-12GB of VRAM in quantized formats, making it accessible on mainstream hardware. It is a popular choice for local inference among users who want strong quality without the VRAM demands of larger models. Released under the Gemma license.

Chat

DeepSeek V4 Pro

DeepSeek · 861.6B · runs from 366.5 GB

3.4M 4.8K

DeepSeek V4 Pro is a 861.6B-parameter open language model from DeepSeek in the DeepSeek V4 family. It supports a context window of up to 1,048,576 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

DeepSeek Coder v2 Lite Instruct

DeepSeek · 15.7B · runs from 7.2 GB

894.1K 609

DeepSeek Coder V2 Lite Instruct is a code-focused mixture-of-experts model with 15.7 billion total parameters, trained to handle both programming tasks and general conversation. It supports a wide range of programming languages and excels at code generation, debugging, explanation, and refactoring. The MoE architecture keeps compute costs manageable despite the model's broad capabilities, and the Lite variant is sized to run on a single consumer GPU. For developers looking for a capable local coding assistant that can also handle general chat, this model offers an appealing combination of code specialization and practical hardware requirements.

ChatCode

Llama 4 Scout 17B 16E Instruct

Meta · 108.6B · runs from 32.9 GB

454.9K 1.3K

Llama 4 Scout 17B 16E Instruct is a 108.6B-parameter open language model from Meta in the Llama 4 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision

Gemma 4 E2B IT Uncensored

TrevorJS · 5.1B · runs from 2.5 GB

1.3K 20

Gemma 4 E2B IT Uncensored is a 5.1B-parameter open language model from TrevorJS in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

SmolLM2 360M Instruct

Hugging Face · 362M · runs from 0.5 GB

283.9K 193

SmolLM2 360M Instruct is an instruction-tuned model from Hugging Face that occupies the sweet spot between the 135M and 1.7B entries in the SmolLM2 lineup. At 360 million parameters, it offers noticeably better coherence and instruction-following ability than the smallest variants while still running comfortably on virtually any modern GPU or even on CPU. This model is well suited for on-device assistants, embedded applications, and rapid prototyping where you need real conversational ability without dedicating significant hardware resources. It handles short-form generation, summarization, and basic reasoning tasks with reasonable quality.

Chat

Gemma 4 26B A4B IT Uncensored

TrevorJS · 25.8B · runs from 11.6 GB

218.7K 41

Gemma 4 26B A4B IT Uncensored is a 25.8B-parameter open language model from TrevorJS in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Nemotron 3 Nano Omni 30B A3B Reasoning BF16

NVIDIA · 33.0B · runs from 10.0 GB

340.0K 343

Nemotron 3 Nano Omni 30B A3B Reasoning BF16 is a 33.0B-parameter open language model from NVIDIA in the Nemotron family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Reasoning

Phi 2

Microsoft · 2.8B · runs from 2.1 GB

437.6K 3.5K

Microsoft Phi 2 is a 2.8-billion parameter language model from Microsoft Research that pioneered the concept of small but highly capable language models. Released in late 2023, Phi 2 demonstrated that strategic data curation and training methodology could allow a sub-3B model to outperform many 7B and 13B models on reasoning and coding benchmarks. The model runs on virtually any modern GPU and even on CPU-only setups. While succeeded by Phi 3 and Phi 4, Phi 2 remains historically significant as the model that proved small-scale language models could be genuinely useful for practical tasks. Released under the MIT license.

ChatCode

Step 3.5 Flash

stepfun-ai · 199.4B · runs from 68.5 GB

358.2K 821

Step 3.5 Flash is an efficient mixture-of-experts model from StepFun AI, a Chinese AI startup, featuring roughly 199 billion total parameters. The Flash designation signals its focus on speed and low-latency inference, making it well-suited for interactive applications despite its large total parameter count. Running it locally requires a multi-GPU setup, but its MoE architecture means only a portion of the model activates per token, delivering strong multilingual performance with better throughput than a comparably sized dense model.

Chat

Kimi K2 Instruct

Moonshot AI · 1026.5B · runs from 286.2 GB

531.9K 2.4K

Kimi K2 Instruct is Moonshot AI's massive Mixture-of-Experts model, weighing in at over one trillion total parameters. It represents one of the largest open-weight models available, delivering frontier-class performance across reasoning, coding, and multilingual tasks through its sparse MoE architecture that activates only a fraction of its full parameter count per token. Running Kimi K2 locally is an extreme undertaking, requiring professional multi-GPU setups with hundreds of gigabytes of combined VRAM even at aggressive quantization. This model is best suited for research labs, enterprise deployments, or enthusiasts with access to server-grade hardware who want to explore trillion-parameter-scale inference.

Chat

Granite 4.1 3B

IBM · 3.4B · runs from 1.6 GB

197.9K 75

Granite 4.1 3B is a 3.4B-parameter open language model from IBM in the Granite family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat