All LLM Models

Browse 225 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

Qwen3 30B A3B

Alibaba · 30B

1.2M 864

Qwen3 30B A3B is a Mixture of Experts (MoE) model from Alibaba Cloud's Qwen 3 series, with 30 billion total parameters and approximately 3 billion active parameters per forward pass. The MoE architecture delivers quality significantly above what a standard 3B dense model could achieve, while keeping per-token compute costs low. It supports hybrid thinking mode for flexible reasoning. The model requires VRAM proportional to its full 30B parameter count for weight loading, but its low active parameter count results in fast inference throughput. It is an efficient option for users who want quality beyond dense small models without the full cost of larger architectures. Released under the Apache 2.0 license.

Chat

DeepSeek R1 Distill Llama 8B

DeepSeek · 8B

857.1K 850

DeepSeek R1 Distill Llama 8B brings R1's reinforcement-learned reasoning capabilities to the widely supported Llama 3.1 8B architecture. By distilling the full 684.5B R1 model's reasoning patterns into this 8 billion parameter dense model, DeepSeek created a version that benefits from the extensive Llama ecosystem of tools, quantizations, and inference engines. For users who prefer the Llama architecture or already have tooling built around it, this model offers a plug-and-play path to chain-of-thought reasoning. Its hardware requirements are very approachable, running well on consumer GPUs with 8 GB or more of VRAM at common quantization levels.

ChatReasoning

DeepSeek R1 Distill Qwen 7B

DeepSeek · 7.6B

613.6K 804

DeepSeek R1 Distill Qwen 7B compresses the reasoning techniques from DeepSeek's full R1 model into a compact 7.6 billion parameter dense model built on the Qwen 2.5 architecture. Despite its small footprint, it demonstrates surprisingly capable step-by-step reasoning on math and logic problems that would stump many models several times its size. This is one of the most accessible reasoning models available for local use, fitting comfortably on GPUs with 6 GB or more of VRAM when quantized. It strikes a practical balance between genuine chain-of-thought reasoning ability and the hardware constraints of a typical consumer setup.

ChatReasoning

Qwen3 30B A3B Instruct 2507

Alibaba · 30B

1.2M 783

Qwen3 30B A3B Instruct 2507 is a July 2025 updated mixture-of-experts model from Alibaba with 30 billion total parameters but only around 3 billion active during inference. This MoE architecture gives it a remarkably small memory and compute footprint relative to its total parameter count, letting users run a model with broad knowledge on mid-range hardware. The 2507 instruct refresh improves alignment and instruction-following quality over the original release. Because only a fraction of the weights are active at any given time, this model can often run on a single consumer GPU with 8 GB or more of VRAM when quantized, making it an excellent choice for users who want strong chat performance without heavyweight hardware.

Chat

Gemma 2 9B IT

Google · 9.2B

239.7K 779

Google Gemma 2 9B IT is a 9.2-billion parameter instruction-tuned model from Google's Gemma 2 series. It is a text-only chat model optimized for conversational tasks, instruction following, and general-purpose assistance. At release, it was recognized for delivering unusually strong performance relative to its parameter count. The model runs efficiently on consumer GPUs with 8-12GB of VRAM in quantized formats, making it accessible on mainstream hardware. It is a popular choice for local inference among users who want strong quality without the VRAM demands of larger models. Released under the Gemma license.

Chat

Qwen3 4B Instruct 2507

Alibaba · 4B

3.8M 768

Qwen3 4B Instruct 2507 is a July 2025 refresh of Alibaba's compact 4-billion-parameter chat model from the Qwen3 family. This updated release brings improved instruction following and conversational quality while remaining lightweight enough to run on most modern GPUs and even some higher-end integrated graphics setups. With its modest size, the 4B Instruct 2507 strikes a practical balance between capability and resource efficiency. It is well suited for everyday chat, summarization, and light assistant tasks on consumer hardware, making it one of the more accessible entry points into the Qwen3 lineup.

Chat

Qwen3 235B A22B Instruct 2507

Alibaba · 235B

166.8K 765

Qwen3 235B A22B Instruct 2507 is Alibaba's flagship instruction-tuned model from the July 2025 update, featuring 235 billion total parameters with approximately 22 billion active during inference. As the largest instruct model in the Qwen3 lineup, it delivers top-tier conversational quality, knowledge depth, and instruction following. Despite its massive total parameter count, the MoE architecture keeps active compute manageable. Running this model locally still requires substantial hardware, typically multi-GPU setups with 48 GB or more of total VRAM, but the 2507 refresh makes it one of the most capable open-weight models available for users with high-end local infrastructure.

Chat

DeepSeek R1 Distill Llama 70B

DeepSeek · 70B

92.5K 753

DeepSeek R1 Distill Llama 70B is the largest model in the R1 distillation lineup, combining the reasoning capabilities developed in the full 684.5B R1 with the robust Llama 3.1 70B architecture. At 70 billion parameters, it delivers the strongest reasoning performance of any dense R1 distill, approaching the full R1's quality on many math and coding benchmarks. Running this model locally requires a multi-GPU setup or a single GPU with very high VRAM capacity, though quantized versions can fit on hardware with 48 GB or more. For users who need top-tier open-weight reasoning and have the hardware to support a 70B dense model, this is one of the strongest options available.

ChatReasoning

SmolLM2 1.7B Instruct

Hugging Face · 1.7B

119.7K 721

SmolLM2 1.7B Instruct is the largest instruction-tuned model in the SmolLM2 family, offering the best balance of capability and efficiency Hugging Face achieved with this generation. At 1.7 billion parameters it produces substantially more coherent and useful responses than its smaller siblings, handling multi-turn conversations, summarization, and simple reasoning tasks with competence. With VRAM requirements well under 4 GB at standard precision, this model runs effortlessly on entry-level GPUs, older laptops, and even some mobile devices. It is an excellent choice for developers building lightweight local assistants or chatbots who want genuine conversational quality without the hardware demands of larger models.

Chat

Step 3.5 Flash

stepfun-ai · 199.4B

83.6K 714

Step 3.5 Flash is an efficient mixture-of-experts model from StepFun AI, a Chinese AI startup, featuring roughly 199 billion total parameters. The Flash designation signals its focus on speed and low-latency inference, making it well-suited for interactive applications despite its large total parameter count. Running it locally requires a multi-GPU setup, but its MoE architecture means only a portion of the model activates per token, delivering strong multilingual performance with better throughput than a comparably sized dense model.

Chat

Llama 3.2 3B

Meta · 3.2B

1.6M 705

Meta Llama 3.2 3B is a 3.2-billion parameter base (pretrained) model from Meta's Llama 3.2 family. It supports a 128K token context window and is intended for fine-tuning, research, and custom applications rather than direct conversational use. The model provides a good balance between capability and efficiency at the small model scale. It is popular as a foundation for community fine-tunes and domain-specific adaptations. Released under the Llama 3.2 Community License.

Chat

Phi 4 Mini Instruct

Microsoft · 3.8B

309.1K 699

Microsoft Phi 4 Mini Instruct is a 3.8-billion parameter instruction-tuned model from Microsoft Research's Phi 4 family. It applies the Phi series' data-centric training philosophy to a compact model, delivering strong performance in coding, reasoning, and chat tasks relative to its small footprint. The model runs on consumer GPUs with as little as 4-6GB of VRAM when quantized, making it accessible on mainstream and even entry-level hardware. Released under the MIT license.

ChatCode

Qwen3.5 27B Claude 4.6 Opus Reasoning Distilled

Jackrong · 27.8B

61.6K 695

The full-precision version of Jackrong's Qwen3.5 27B reasoning distillation from Claude 4.6 Opus. With 27.8 billion parameters in unquantized form, this model preserves the maximum quality from the distillation process but requires significantly more VRAM, typically 56 GB or more in BF16. It is primarily intended for users with professional-grade GPUs or multi-GPU setups. This variant is ideal for further fine-tuning, experimentation, or running at full fidelity when hardware allows. Most users looking to run the model locally for inference should consider the GGUF-quantized version instead, which offers a much better tradeoff between quality and resource usage.

ChatReasoning

Gemma 3 12B IT

Google · 12B

1.9M 678

Google Gemma 3 12B IT is a 12-billion parameter multimodal instruction-tuned model from Google's Gemma 3 series. It supports both text and image inputs, offering vision-language capabilities at a more accessible size point than the 27B variant. Gemma 3 12B IT runs on consumer GPUs with 12-16GB of VRAM in quantized formats, making it a practical choice for local multimodal inference without requiring top-tier hardware. Released under the Gemma license.

Vision

NVIDIA Nemotron 3 Nano 30B A3B BF16

NVIDIA · 31.6B

924.9K 669

NVIDIA Nemotron 3 Nano 30B A3B is a mixture-of-experts model with 31.6 billion total parameters but only around 3 billion active per token, giving it the intelligence of a much larger model with the speed of a small one. This BF16 version preserves full precision for maximum output quality. The MoE architecture makes this model especially interesting for local deployment. You get reasoning and instruction-following capabilities that punch well above what a traditional 3B model can deliver, while inference stays fast because only a fraction of the network fires for each token.

Chat

Qwen3 32B

Alibaba · 32B

5.0M 668

Qwen3 32B is the flagship dense model in Alibaba Cloud's Qwen 3 series, with 32 billion parameters. It is instruction-tuned for chat and delivers strong performance across reasoning, coding, mathematics, and multilingual tasks. Qwen3 32B supports a hybrid thinking mode that allows the model to engage in extended chain-of-thought reasoning or respond quickly depending on the task, giving users flexibility between depth and speed. The model requires a GPU with at least 24GB of VRAM for quantized inference, placing it within reach of high-end consumer cards like the RTX 4090. It represents a significant generational improvement over Qwen 2.5 in both instruction following and knowledge breadth. Released under the Apache 2.0 license.

Chat

Qwen2.5 Coder 7B Instruct

Alibaba · 7.6B

2.1M 666

Qwen2.5 Coder 7B Instruct is a 7.6-billion parameter code-specialized instruction-tuned model from Alibaba Cloud. It is trained on a large corpus of source code and natural language, fine-tuned for programming assistance tasks such as code generation, completion, debugging, and code explanation. The model supports a 128K token context window and runs efficiently on consumer GPUs with 8GB or more of VRAM. It provides a good balance between coding capability and hardware requirements for developers looking to run a local coding assistant. Released under the Apache 2.0 license.

ChatCode

MiMo v2 Flash

XiaomiMiMo · 309.8B

337.1K 650

MiMo V2 Flash is Xiaomi's large-scale mixture-of-experts language model, built with nearly 310 billion total parameters. Designed for fast inference despite its size, the Flash variant prioritizes throughput and responsiveness, making it well-suited for interactive chat and real-time applications. Running it locally is a serious undertaking that demands high-end multi-GPU configurations, but it brings flagship-level Chinese and English language capabilities to users who have the hardware to support it.

Chat

Qwen2.5 1.5B Instruct

Alibaba · 1.5B

8.8M 637

Qwen2.5 1.5B Instruct is a 1.5-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 2.5 series. It is a lightweight model suitable for deployment on minimal hardware, including low-VRAM GPUs and even CPU-only setups with acceptable latency. It supports a 128K token context window. The model handles basic conversational tasks, simple question answering, and text generation. While limited in reasoning depth compared to larger variants, it is useful for applications where fast response times and minimal resource consumption are priorities. Released under the Apache 2.0 license.

Chat

Gemma 2 2B

Google · 2B

276.5K 629

Google Gemma 2 2B is a 2-billion parameter base (pretrained) model from Google's Gemma 2 family. As a base model, it is not instruction-tuned and is intended for fine-tuning, research, and custom downstream applications. Its compact size makes it suitable for experimentation, rapid prototyping, and domain-specific fine-tuning on consumer hardware with minimal VRAM. Released under the Gemma license.

Chat

GPT OSS 20B GGUF

Unsloth · 20B

324.2K 627

This is a GGUF-quantized version of OpenAI's GPT-OSS 20B, repackaged by Unsloth. GPT-OSS 20B is OpenAI's open-source model release, bringing the company's model-building expertise to the open-weight community with a 20-billion-parameter architecture. Unsloth's GGUF conversion makes this model compatible with llama.cpp and popular frontends like Ollama and LM Studio. At 20B parameters, it sits in a productive middle ground, large enough to deliver strong reasoning and generation quality while remaining runnable on consumer GPUs with 16GB or more of VRAM at appropriate quantization levels.

Chat

Distilgpt2

distilbert · 88M

2.3M 618

DistilGPT-2 is a distilled version of OpenAI's GPT-2 model, compressed to just 88 million parameters while retaining much of the original model's text generation ability. Created using knowledge distillation techniques, it offers significantly faster inference than the full GPT-2 with only a modest reduction in output quality. This model is one of the lightest autoregressive language models available and can run on virtually any hardware, including CPUs. It is a practical choice for educational projects, quick prototyping, and applications where inference speed and minimal resource usage are more important than state-of-the-art generation quality.

Chat

DeepSeek R1 Distill Qwen 14B

DeepSeek · 14.8B

742.0K 613

DeepSeek R1 Distill Qwen 14B sits in a sweet spot between the smaller 7B distill and the more demanding 32B version, offering strong reasoning performance at 14.8 billion parameters on the Qwen 2.5 architecture. It captures a meaningful share of the full R1's chain-of-thought capabilities while keeping resource requirements within the range of mainstream consumer GPUs. Quantized to 4-bit, it fits comfortably on GPUs with 12 GB of VRAM, delivering reliable step-by-step reasoning for math, logic, and analytical problems.

ChatReasoning

Qwen3 4B

Alibaba · 4B

6.2M 570

Qwen3 4B is a compact 4-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 3 family. It is designed for efficient local inference on consumer hardware, supporting chat and general assistant tasks while fitting comfortably on GPUs with 6GB or more of VRAM in quantized formats. The model supports hybrid thinking mode, allowing it to balance reasoning depth and response speed. Despite its small footprint, Qwen3 4B delivers quality competitive with larger models from previous generations, making it a practical choice for lightweight local deployments and resource-constrained environments. Released under the Apache 2.0 license.

Chat

Qwen3 4B Thinking 2507

Alibaba · 4B

514.8K 567

Qwen3 4B Thinking 2507 is the reasoning-optimized variant of Alibaba's compact 4-billion-parameter Qwen3 model, released in the July 2025 update cycle. Despite its small size, this thinking variant is tuned to produce chain-of-thought reasoning and step-by-step problem solving, making it a surprisingly capable lightweight reasoner. This model is ideal for users who want basic reasoning and analytical capabilities on very modest hardware. It can run on most consumer GPUs and even some CPU-only setups when quantized, providing an accessible entry point for experimenting with reasoning-style models without any significant hardware investment.

Chat

Gemma 3 270M IT

Google · 270M

95.6K 564

Google Gemma 3 270M IT is a 270-million parameter instruction-tuned model from Google's Gemma 3 family, an experimental release pushing the boundaries of how small an effective chat model can be. The model runs on virtually any hardware, including entry-level GPUs and CPU-only setups, making it useful for experimentation, education, and exploring the limits of small-scale language modeling. Released under the Gemma license.

Chat

Gemma 2 27B IT

Google · 27.2B

401.5K 560

Google Gemma 2 27B IT is a 27.2-billion parameter instruction-tuned model from Google's Gemma 2 generation. It is a text-only chat model optimized for conversational use, reasoning, and instruction following. Gemma 2 27B IT was one of the strongest openly available models in its size class at release. The model requires a GPU with at least 24GB of VRAM for quantized local inference. It is widely supported by popular inference engines and remains a strong choice for users seeking high-quality local chat without needing 70B-class hardware. Released under the Gemma license.

Chat

DeepSeek Coder v2 Lite Instruct

DeepSeek · 15.7B

239.5K 559

DeepSeek Coder V2 Lite Instruct is a code-focused mixture-of-experts model with 15.7 billion total parameters, trained to handle both programming tasks and general conversation. It supports a wide range of programming languages and excels at code generation, debugging, explanation, and refactoring. The MoE architecture keeps compute costs manageable despite the model's broad capabilities, and the Lite variant is sized to run on a single consumer GPU. For developers looking for a capable local coding assistant that can also handle general chat, this model offers an appealing combination of code specialization and practical hardware requirements.

ChatCode

Qwen3 Coder 30B A3B Instruct GGUF

Unsloth · 30B

150.9K 536

This is a GGUF-quantized version of Alibaba's Qwen3 Coder 30B A3B Instruct, repackaged by Unsloth. Qwen3 Coder is a code-specialized model from the Qwen3 family that uses a Mixture-of-Experts (MoE) architecture with 30 billion total parameters but only around 3 billion active parameters per inference step, delivering strong coding performance with efficient resource usage. The MoE design means this model punches well above its active parameter count in code generation, debugging, and explanation tasks. Unsloth's GGUF format makes it compatible with llama.cpp-based tools. Thanks to the sparse activation pattern, it requires significantly less VRAM than a dense 30B model, making it a compelling choice for developers who want a capable local coding assistant without top-tier hardware.

ChatCode

LFM2.5 1.2B Instruct

LiquidAI · 1.2B

219.2K 527

LFM2.5 1.2B Instruct is an instruction-tuned model from Liquid AI that uses a novel hybrid architecture combining state-space models with attention mechanisms. At just 1.2 billion parameters, it is exceptionally lightweight and can run on virtually any hardware, including laptops and edge devices. Liquid AI's unconventional architecture aims to deliver better efficiency and longer context handling than traditional transformer models at this scale, making it an interesting option for users exploring alternatives to standard transformer-based LLMs.

Chat