All LLM Models

Browse 225 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Featured only

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Qwen3 30B A3B

Alibaba · 30B

Qwen3 30B A3B is a Mixture of Experts (MoE) model from Alibaba Cloud's Qwen 3 series, with 30 billion total parameters and approximately 3 billion active parameters per forward pass. The MoE architecture delivers quality significantly above what a standard 3B dense model could achieve, while keeping per-token compute costs low. It supports hybrid thinking mode for flexible reasoning. The model requires VRAM proportional to its full 30B parameter count for weight loading, but its low active parameter count results in fast inference throughput. It is an efficient option for users who want quality beyond dense small models without the full cost of larger architectures. Released under the Apache 2.0 license.

DeepSeek R1 Distill Llama 8B

DeepSeek · 8B

DeepSeek R1 Distill Llama 8B brings R1's reinforcement-learned reasoning capabilities to the widely supported Llama 3.1 8B architecture. By distilling the full 684.5B R1 model's reasoning patterns into this 8 billion parameter dense model, DeepSeek created a version that benefits from the extensive Llama ecosystem of tools, quantizations, and inference engines. For users who prefer the Llama architecture or already have tooling built around it, this model offers a plug-and-play path to chain-of-thought reasoning. Its hardware requirements are very approachable, running well on consumer GPUs with 8 GB or more of VRAM at common quantization levels.

DeepSeek R1 Distill Qwen 7B

DeepSeek · 7.6B

DeepSeek R1 Distill Qwen 7B compresses the reasoning techniques from DeepSeek's full R1 model into a compact 7.6 billion parameter dense model built on the Qwen 2.5 architecture. Despite its small footprint, it demonstrates surprisingly capable step-by-step reasoning on math and logic problems that would stump many models several times its size. This is one of the most accessible reasoning models available for local use, fitting comfortably on GPUs with 6 GB or more of VRAM when quantized. It strikes a practical balance between genuine chain-of-thought reasoning ability and the hardware constraints of a typical consumer setup.

Qwen3 30B A3B Instruct 2507

Alibaba · 30B

Qwen3 30B A3B Instruct 2507 is a July 2025 updated mixture-of-experts model from Alibaba with 30 billion total parameters but only around 3 billion active during inference. This MoE architecture gives it a remarkably small memory and compute footprint relative to its total parameter count, letting users run a model with broad knowledge on mid-range hardware. The 2507 instruct refresh improves alignment and instruction-following quality over the original release. Because only a fraction of the weights are active at any given time, this model can often run on a single consumer GPU with 8 GB or more of VRAM when quantized, making it an excellent choice for users who want strong chat performance without heavyweight hardware.

Gemma 2 9B IT

Google · 9.2B

Google Gemma 2 9B IT is a 9.2-billion parameter instruction-tuned model from Google's Gemma 2 series. It is a text-only chat model optimized for conversational tasks, instruction following, and general-purpose assistance. At release, it was recognized for delivering unusually strong performance relative to its parameter count. The model runs efficiently on consumer GPUs with 8-12GB of VRAM in quantized formats, making it accessible on mainstream hardware. It is a popular choice for local inference among users who want strong quality without the VRAM demands of larger models. Released under the Gemma license.

Qwen3 4B Instruct 2507

Alibaba · 4B

Qwen3 4B Instruct 2507 is a July 2025 refresh of Alibaba's compact 4-billion-parameter chat model from the Qwen3 family. This updated release brings improved instruction following and conversational quality while remaining lightweight enough to run on most modern GPUs and even some higher-end integrated graphics setups. With its modest size, the 4B Instruct 2507 strikes a practical balance between capability and resource efficiency. It is well suited for everyday chat, summarization, and light assistant tasks on consumer hardware, making it one of the more accessible entry points into the Qwen3 lineup.

Qwen3 235B A22B Instruct 2507

Alibaba · 235B

Qwen3 235B A22B Instruct 2507 is Alibaba's flagship instruction-tuned model from the July 2025 update, featuring 235 billion total parameters with approximately 22 billion active during inference. As the largest instruct model in the Qwen3 lineup, it delivers top-tier conversational quality, knowledge depth, and instruction following. Despite its massive total parameter count, the MoE architecture keeps active compute manageable. Running this model locally still requires substantial hardware, typically multi-GPU setups with 48 GB or more of total VRAM, but the 2507 refresh makes it one of the most capable open-weight models available for users with high-end local infrastructure.

DeepSeek R1 Distill Llama 70B

DeepSeek · 70B

DeepSeek R1 Distill Llama 70B is the largest model in the R1 distillation lineup, combining the reasoning capabilities developed in the full 684.5B R1 with the robust Llama 3.1 70B architecture. At 70 billion parameters, it delivers the strongest reasoning performance of any dense R1 distill, approaching the full R1's quality on many math and coding benchmarks. Running this model locally requires a multi-GPU setup or a single GPU with very high VRAM capacity, though quantized versions can fit on hardware with 48 GB or more. For users who need top-tier open-weight reasoning and have the hardware to support a 70B dense model, this is one of the strongest options available.

SmolLM2 1.7B Instruct

Hugging Face · 1.7B

SmolLM2 1.7B Instruct is the largest instruction-tuned model in the SmolLM2 family, offering the best balance of capability and efficiency Hugging Face achieved with this generation. At 1.7 billion parameters it produces substantially more coherent and useful responses than its smaller siblings, handling multi-turn conversations, summarization, and simple reasoning tasks with competence. With VRAM requirements well under 4 GB at standard precision, this model runs effortlessly on entry-level GPUs, older laptops, and even some mobile devices. It is an excellent choice for developers building lightweight local assistants or chatbots who want genuine conversational quality without the hardware demands of larger models.

Step 3.5 Flash

stepfun-ai · 199.4B

Step 3.5 Flash is an efficient mixture-of-experts model from StepFun AI, a Chinese AI startup, featuring roughly 199 billion total parameters. The Flash designation signals its focus on speed and low-latency inference, making it well-suited for interactive applications despite its large total parameter count. Running it locally requires a multi-GPU setup, but its MoE architecture means only a portion of the model activates per token, delivering strong multilingual performance with better throughput than a comparably sized dense model.

Llama 3.2 3B

Meta · 3.2B

Meta Llama 3.2 3B is a 3.2-billion parameter base (pretrained) model from Meta's Llama 3.2 family. It supports a 128K token context window and is intended for fine-tuning, research, and custom applications rather than direct conversational use. The model provides a good balance between capability and efficiency at the small model scale. It is popular as a foundation for community fine-tunes and domain-specific adaptations. Released under the Llama 3.2 Community License.

Phi 4 Mini Instruct

Microsoft · 3.8B

Microsoft Phi 4 Mini Instruct is a 3.8-billion parameter instruction-tuned model from Microsoft Research's Phi 4 family. It applies the Phi series' data-centric training philosophy to a compact model, delivering strong performance in coding, reasoning, and chat tasks relative to its small footprint. The model runs on consumer GPUs with as little as 4-6GB of VRAM when quantized, making it accessible on mainstream and even entry-level hardware. Released under the MIT license.

Qwen3.5 27B Claude 4.6 Opus Reasoning Distilled

Jackrong · 27.8B

The full-precision version of Jackrong's Qwen3.5 27B reasoning distillation from Claude 4.6 Opus. With 27.8 billion parameters in unquantized form, this model preserves the maximum quality from the distillation process but requires significantly more VRAM, typically 56 GB or more in BF16. It is primarily intended for users with professional-grade GPUs or multi-GPU setups. This variant is ideal for further fine-tuning, experimentation, or running at full fidelity when hardware allows. Most users looking to run the model locally for inference should consider the GGUF-quantized version instead, which offers a much better tradeoff between quality and resource usage.

Gemma 3 12B IT

Google · 12B

Google Gemma 3 12B IT is a 12-billion parameter multimodal instruction-tuned model from Google's Gemma 3 series. It supports both text and image inputs, offering vision-language capabilities at a more accessible size point than the 27B variant. Gemma 3 12B IT runs on consumer GPUs with 12-16GB of VRAM in quantized formats, making it a practical choice for local multimodal inference without requiring top-tier hardware. Released under the Gemma license.

NVIDIA Nemotron 3 Nano 30B A3B BF16

NVIDIA · 31.6B

NVIDIA Nemotron 3 Nano 30B A3B is a mixture-of-experts model with 31.6 billion total parameters but only around 3 billion active per token, giving it the intelligence of a much larger model with the speed of a small one. This BF16 version preserves full precision for maximum output quality. The MoE architecture makes this model especially interesting for local deployment. You get reasoning and instruction-following capabilities that punch well above what a traditional 3B model can deliver, while inference stays fast because only a fraction of the network fires for each token.

Qwen3 32B

Alibaba · 32B

Qwen3 32B is the flagship dense model in Alibaba Cloud's Qwen 3 series, with 32 billion parameters. It is instruction-tuned for chat and delivers strong performance across reasoning, coding, mathematics, and multilingual tasks. Qwen3 32B supports a hybrid thinking mode that allows the model to engage in extended chain-of-thought reasoning or respond quickly depending on the task, giving users flexibility between depth and speed. The model requires a GPU with at least 24GB of VRAM for quantized inference, placing it within reach of high-end consumer cards like the RTX 4090. It represents a significant generational improvement over Qwen 2.5 in both instruction following and knowledge breadth. Released under the Apache 2.0 license.

Qwen2.5 Coder 7B Instruct

Alibaba · 7.6B

Qwen2.5 Coder 7B Instruct is a 7.6-billion parameter code-specialized instruction-tuned model from Alibaba Cloud. It is trained on a large corpus of source code and natural language, fine-tuned for programming assistance tasks such as code generation, completion, debugging, and code explanation. The model supports a 128K token context window and runs efficiently on consumer GPUs with 8GB or more of VRAM. It provides a good balance between coding capability and hardware requirements for developers looking to run a local coding assistant. Released under the Apache 2.0 license.

MiMo v2 Flash

XiaomiMiMo · 309.8B

MiMo V2 Flash is Xiaomi's large-scale mixture-of-experts language model, built with nearly 310 billion total parameters. Designed for fast inference despite its size, the Flash variant prioritizes throughput and responsiveness, making it well-suited for interactive chat and real-time applications. Running it locally is a serious undertaking that demands high-end multi-GPU configurations, but it brings flagship-level Chinese and English language capabilities to users who have the hardware to support it.

Qwen2.5 1.5B Instruct

Alibaba · 1.5B

Qwen2.5 1.5B Instruct is a 1.5-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 2.5 series. It is a lightweight model suitable for deployment on minimal hardware, including low-VRAM GPUs and even CPU-only setups with acceptable latency. It supports a 128K token context window. The model handles basic conversational tasks, simple question answering, and text generation. While limited in reasoning depth compared to larger variants, it is useful for applications where fast response times and minimal resource consumption are priorities. Released under the Apache 2.0 license.

Gemma 2 2B

Google · 2B

Google Gemma 2 2B is a 2-billion parameter base (pretrained) model from Google's Gemma 2 family. As a base model, it is not instruction-tuned and is intended for fine-tuning, research, and custom downstream applications. Its compact size makes it suitable for experimentation, rapid prototyping, and domain-specific fine-tuning on consumer hardware with minimal VRAM. Released under the Gemma license.

GPT OSS 20B GGUF

Unsloth · 20B

This is a GGUF-quantized version of OpenAI's GPT-OSS 20B, repackaged by Unsloth. GPT-OSS 20B is OpenAI's open-source model release, bringing the company's model-building expertise to the open-weight community with a 20-billion-parameter architecture. Unsloth's GGUF conversion makes this model compatible with llama.cpp and popular frontends like Ollama and LM Studio. At 20B parameters, it sits in a productive middle ground, large enough to deliver strong reasoning and generation quality while remaining runnable on consumer GPUs with 16GB or more of VRAM at appropriate quantization levels.

Distilgpt2

distilbert · 88M

DistilGPT-2 is a distilled version of OpenAI's GPT-2 model, compressed to just 88 million parameters while retaining much of the original model's text generation ability. Created using knowledge distillation techniques, it offers significantly faster inference than the full GPT-2 with only a modest reduction in output quality. This model is one of the lightest autoregressive language models available and can run on virtually any hardware, including CPUs. It is a practical choice for educational projects, quick prototyping, and applications where inference speed and minimal resource usage are more important than state-of-the-art generation quality.

DeepSeek R1 Distill Qwen 14B

DeepSeek · 14.8B

DeepSeek R1 Distill Qwen 14B sits in a sweet spot between the smaller 7B distill and the more demanding 32B version, offering strong reasoning performance at 14.8 billion parameters on the Qwen 2.5 architecture. It captures a meaningful share of the full R1's chain-of-thought capabilities while keeping resource requirements within the range of mainstream consumer GPUs. Quantized to 4-bit, it fits comfortably on GPUs with 12 GB of VRAM, delivering reliable step-by-step reasoning for math, logic, and analytical problems.

Qwen3 4B

Alibaba · 4B

Qwen3 4B is a compact 4-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 3 family. It is designed for efficient local inference on consumer hardware, supporting chat and general assistant tasks while fitting comfortably on GPUs with 6GB or more of VRAM in quantized formats. The model supports hybrid thinking mode, allowing it to balance reasoning depth and response speed. Despite its small footprint, Qwen3 4B delivers quality competitive with larger models from previous generations, making it a practical choice for lightweight local deployments and resource-constrained environments. Released under the Apache 2.0 license.

Qwen3 4B Thinking 2507

Alibaba · 4B

Qwen3 4B Thinking 2507 is the reasoning-optimized variant of Alibaba's compact 4-billion-parameter Qwen3 model, released in the July 2025 update cycle. Despite its small size, this thinking variant is tuned to produce chain-of-thought reasoning and step-by-step problem solving, making it a surprisingly capable lightweight reasoner. This model is ideal for users who want basic reasoning and analytical capabilities on very modest hardware. It can run on most consumer GPUs and even some CPU-only setups when quantized, providing an accessible entry point for experimenting with reasoning-style models without any significant hardware investment.

Gemma 3 270M IT

Google · 270M

Google Gemma 3 270M IT is a 270-million parameter instruction-tuned model from Google's Gemma 3 family, an experimental release pushing the boundaries of how small an effective chat model can be. The model runs on virtually any hardware, including entry-level GPUs and CPU-only setups, making it useful for experimentation, education, and exploring the limits of small-scale language modeling. Released under the Gemma license.

Gemma 2 27B IT

Google · 27.2B

Google Gemma 2 27B IT is a 27.2-billion parameter instruction-tuned model from Google's Gemma 2 generation. It is a text-only chat model optimized for conversational use, reasoning, and instruction following. Gemma 2 27B IT was one of the strongest openly available models in its size class at release. The model requires a GPU with at least 24GB of VRAM for quantized local inference. It is widely supported by popular inference engines and remains a strong choice for users seeking high-quality local chat without needing 70B-class hardware. Released under the Gemma license.

DeepSeek Coder v2 Lite Instruct

DeepSeek · 15.7B

DeepSeek Coder V2 Lite Instruct is a code-focused mixture-of-experts model with 15.7 billion total parameters, trained to handle both programming tasks and general conversation. It supports a wide range of programming languages and excels at code generation, debugging, explanation, and refactoring. The MoE architecture keeps compute costs manageable despite the model's broad capabilities, and the Lite variant is sized to run on a single consumer GPU. For developers looking for a capable local coding assistant that can also handle general chat, this model offers an appealing combination of code specialization and practical hardware requirements.

Qwen3 Coder 30B A3B Instruct GGUF

Unsloth · 30B

This is a GGUF-quantized version of Alibaba's Qwen3 Coder 30B A3B Instruct, repackaged by Unsloth. Qwen3 Coder is a code-specialized model from the Qwen3 family that uses a Mixture-of-Experts (MoE) architecture with 30 billion total parameters but only around 3 billion active parameters per inference step, delivering strong coding performance with efficient resource usage. The MoE design means this model punches well above its active parameter count in code generation, debugging, and explanation tasks. Unsloth's GGUF format makes it compatible with llama.cpp-based tools. Thanks to the sparse activation pattern, it requires significantly less VRAM than a dense 30B model, making it a compelling choice for developers who want a capable local coding assistant without top-tier hardware.

LFM2.5 1.2B Instruct

LiquidAI · 1.2B

LFM2.5 1.2B Instruct is an instruction-tuned model from Liquid AI that uses a novel hybrid architecture combining state-space models with attention mechanisms. At just 1.2 billion parameters, it is exceptionally lightweight and can run on virtually any hardware, including laptops and edge devices. Liquid AI's unconventional architecture aims to deliver better efficiency and longer context handling than traditional transformer models at this scale, making it an interesting option for users exploring alternatives to standard transformer-based LLMs.