All LLM Models

Browse 225 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

DeepSeek R1 Distill Qwen 32B

DeepSeek · 32.8B

938.1K 1.5K

DeepSeek R1 Distill Qwen 32B takes the reasoning capabilities developed in the full 684.5B R1 model and distills them into the 32.8 billion parameter Qwen 2.5 architecture. The result is a dense model that punches well above its weight class on math, science, and coding reasoning tasks, often matching models two to three times its size. At around 32.8 billion parameters, this model fits comfortably on a single high-end consumer GPU when quantized to 4-bit precision, making it one of the most capable reasoning models you can run on a desktop workstation.

ChatReasoning

Meta Llama 3 70B Instruct

Meta · 70.6B

83.9K 1.5K

Meta Llama 3 70B Instruct is a 70.6-billion parameter instruction-tuned model from Meta's Llama 3 release. It is fine-tuned for dialogue, coding assistance, and complex reasoning tasks using supervised fine-tuning and RLHF. At the time of release, it was among the most capable openly available models. The model supports an 8K token context window and requires substantial VRAM for local inference, typically needing multi-GPU setups or high-VRAM professional GPUs. It has been widely adopted for local deployment in quantized formats. Released under the Meta Llama 3 Community License.

Chat

DeepSeek R1 Distill Qwen 1.5B

DeepSeek · 1.5B

982.4K 1.5K

DeepSeek R1 Distill Qwen 1.5B is the smallest model in the R1 distillation family, packing chain-of-thought reasoning capabilities into just 1.5 billion parameters using the Qwen 2.5 architecture. It represents an ambitious attempt to bring structured reasoning to the smallest practical model size. At this scale, the model can run on virtually any modern GPU and even on CPU-only setups with acceptable speed. While its reasoning depth is naturally limited compared to its larger siblings, it still demonstrates structured thinking patterns that set it apart from generic models of similar size.

ChatReasoning

Phi 3 Mini 4k Instruct

Microsoft · 3.8B

926.3K 1.4K

Microsoft Phi 3 Mini 4K Instruct is a 3.8-billion parameter instruction-tuned model from Microsoft Research's Phi 3 generation, with a 4K token context window. The Phi 3 family demonstrated that small models trained on carefully curated, high-quality data can achieve performance competitive with models several times their size. The model runs on consumer GPUs with as little as 4-6GB of VRAM when quantized, making it one of the most accessible capable chat models for local deployment. Released under the MIT license.

ChatCode

Llama 3.2 1B Instruct

Meta · 1B

4.1M 1.3K

Meta Llama 3.2 1B Instruct is a 1-billion parameter instruction-tuned model from Meta, the smallest in the Llama 3.2 family. It is designed for ultra-lightweight deployment scenarios where minimal hardware resources are available, supporting a 128K token context window despite its compact size. This model is suitable for basic conversational tasks, text summarization, and simple instruction following. It can run on virtually any modern GPU and even on CPU-only setups with acceptable performance. Released under the Llama 3.2 Community License.

Chat

Qwen3 Coder 480B A35B Instruct

Alibaba · 480.2B

76.6K 1.3K

Qwen3 Coder 480B A35B Instruct is Alibaba's largest code-specialized model, a massive 480.2-billion-parameter mixture-of-experts system with roughly 35 billion parameters active per token. This is the most powerful open-weight coding model in the Qwen3 family, designed for professional-grade code generation, analysis, and software engineering tasks. Running this model locally is a serious undertaking that requires multi-GPU server-class hardware with several hundred gigabytes of combined VRAM. For users with access to such infrastructure, it offers exceptional code quality and understanding that rivals leading proprietary coding assistants, all while keeping data and computation entirely under local control.

ChatCode

DeepSeek V3.2

DeepSeek · 685.4B

273.6K 1.3K

DeepSeek V3.2 is the latest iteration of DeepSeek's general-purpose flagship, building on the V3 architecture with 685.4 billion total parameters in a mixture-of-experts configuration. This update refines the model's conversational abilities, instruction following, and multilingual performance compared to earlier V3 releases. Running V3.2 locally requires significant GPU resources due to the large total parameter count, though the MoE design means only a subset of parameters are active for any given token. Users with multi-GPU workstations or servers can run quantized versions effectively, making this one of the most powerful open-weight chat models available for self-hosted deployment.

Chat

Gemma 2 2B IT

Google · 2B

433.7K 1.3K

Google Gemma 2 2B IT is a 2-billion parameter instruction-tuned model from Google's Gemma 2 family, the smallest variant in the Gemma 2 series. It is designed for efficient local inference on resource-constrained hardware, handling basic conversational tasks and simple instruction following at minimal compute cost. The model can run on GPUs with as little as 4GB of VRAM when quantized, and even on CPU-only setups. Released under the Gemma license.

Chat

MiniMax M2.1

MiniMaxAI · 228.7B

53.2K 1.3K

MiniMax M2.1 is an earlier generation of MiniMax's large mixture-of-experts model series, featuring the same 228 billion total parameter architecture as its successor. It offers strong multilingual performance across Chinese and English tasks, including conversation, reasoning, and content generation. While M2.5 refines the formula, M2.1 remains a capable option for users with the multi-GPU hardware needed to host a model of this scale locally.

Chat

Gemma 7B IT

Google · 7B

54.4K 1.2K

Google Gemma 7B IT is a 7-billion parameter instruction-tuned model from the original Gemma generation. It is fine-tuned for conversational use and general instruction following, running efficiently on consumer GPUs with 8GB or more of VRAM. As a first-generation Gemma model, it has been superseded by Gemma 2 and Gemma 3 models in quality and capability, but it remains well-supported by inference frameworks. Released under the Gemma license.

Chat

MiniMax M2.5

MiniMaxAI · 228.7B

520.4K 1.2K

MiniMax M2.5 is a large-scale mixture-of-experts model from MiniMax, a well-funded Chinese AI company. With roughly 228 billion total parameters and a MoE architecture that activates only a fraction per token, it aims to deliver performance competitive with much larger dense models while keeping inference costs manageable. Running it locally requires substantial hardware due to its large parameter footprint, but quantized versions can make it accessible to users with multi-GPU setups looking for a powerful multilingual model with strong Chinese and English capabilities.

Chat

Qwen3 0.6B

Alibaba · 752M

12.2M 1.1K

Qwen3 0.6B is the smallest instruction-tuned model in Alibaba Cloud's Qwen 3 family, with approximately 752 million parameters. It is designed for ultra-lightweight deployment where minimal hardware resources are available, running comfortably on virtually any modern GPU or CPU-only setups. The model supports hybrid thinking mode despite its tiny footprint. While limited in reasoning depth compared to larger variants, Qwen3 0.6B handles basic chat, simple summarization, and lightweight instruction following. It is primarily useful for edge deployment, rapid prototyping, and experimentation where model size is a critical constraint. Released under the Apache 2.0 license.

Chat

Qwen2.5 7B Instruct

Alibaba · 7.6B

22.8M 1.1K

Qwen2.5 7B Instruct is a 7.6-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 2.5 series. It supports a 128K token context window and is fine-tuned for conversational AI, instruction following, and general assistant tasks. Its efficient size makes it well-suited for local deployment on consumer GPUs with 8GB or more of VRAM. The model delivers strong performance for its parameter class across reasoning, multilingual understanding, and coding tasks. It benefits from the improved pretraining data and techniques of the Qwen 2.5 generation. Released under the Apache 2.0 license and widely supported by inference frameworks such as llama.cpp, vLLM, and Ollama.

Chat

Qwen3 Coder Next

Alibaba · 79.7B

1.1M 1.1K

Qwen3 Coder Next is a 79.7-billion parameter code-specialized instruction-tuned model from Alibaba Cloud, the next generation of the Qwen Coder series. It is trained extensively on source code and programming-related data, delivering strong performance across code generation, completion, debugging, refactoring, and software engineering dialogue. The model represents a significant step up in coding capability within the Qwen family. Due to its large parameter count, running Qwen3 Coder Next locally requires substantial VRAM, typically 48GB or more at reduced precision, placing it in the territory of professional GPUs or multi-GPU consumer setups. It is a top-tier choice for developers who need the most capable local coding assistant available. Released under the Apache 2.0 license.

ChatCode

Llama 2 13B Chat HF

Meta · 13B

146.9K 1.1K

Meta Llama 2 13B Chat is a 13-billion parameter instruction-tuned model from Meta's Llama 2 family, fine-tuned for dialogue and chat applications. It offers improved reasoning and generation quality over the 7B variant while maintaining manageable hardware requirements with a 4K token context window. The model was fine-tuned using supervised fine-tuning and RLHF. It can run on consumer GPUs with 16GB or more of VRAM at reduced precision. Released under the Llama 2 Community License.

Chat

Falcon 7B

TII UAE · 7B

155.3K 1.1K

Falcon 7B was one of the first truly competitive open-source large language models, released in mid-2023 by the Technology Innovation Institute in Abu Dhabi. Trained on the massive RefinedWeb dataset, it demonstrated that carefully curated web data could rival models trained on more traditionally assembled corpora. At 7 billion parameters, Falcon 7B helped establish the 7B class as the sweet spot for local inference, offering genuine language understanding on consumer GPUs with as little as 6 GB of VRAM.

Chat

Qwen3 235B A22B

Alibaba · 235.1B

759.7K 1.1K

Qwen3 235B A22B is the largest model in Alibaba Cloud's Qwen 3 series, a Mixture of Experts (MoE) model with 235 billion total parameters and approximately 22 billion active parameters per forward pass. The MoE architecture enables it to deliver performance competitive with the best available open-weight models while requiring significantly less compute per token than a comparably sized dense model. It supports hybrid thinking mode for flexible chain-of-thought reasoning. Due to its massive total parameter count, running Qwen3 235B A22B locally requires substantial VRAM to load all expert weights, typically needing multiple high-end professional GPUs even at reduced precision. In heavily quantized formats it becomes accessible on workstation-class multi-GPU setups. Released under the Apache 2.0 license.

Chat

Falcon 7B Instruct

TII UAE · 7.2B

50.8K 1.0K

Falcon 7B Instruct is the instruction-tuned version of TII's Falcon 7B, fine-tuned on a mix of chat and instruction datasets to follow user prompts more reliably. It was among the early open models to show that a well-tuned 7B model could handle conversational tasks, summarization, and basic reasoning without requiring massive hardware. While newer models have since raised the bar, Falcon 7B Instruct remains a lightweight option for users who want a responsive local assistant with modest resource requirements.

Chat

Nanbeige4.1 3B

Nanbeige · 3.9B

689.2K 1.0K

Nanbeige4.1 3B is a compact chat model from Nanbeige, a Chinese AI startup focused on building efficient small-scale language models. At just under 4 billion parameters, it is designed to run on virtually any modern GPU or even on CPU, making it one of the more accessible options for users with limited hardware. Despite its small size, it handles basic conversation, simple reasoning, and Chinese-English bilingual tasks, serving as a practical entry point for local LLM experimentation.

Chat

Gemma 3 270M

Google · 270M

83.4K 995

Google Gemma 3 270M is a 270-million parameter base (pretrained) model from Google's Gemma 3 family. It is an experimental release intended for research, fine-tuning, and exploring the capabilities of ultra-small language models. The model runs on virtually any hardware with negligible resource requirements. Released under the Gemma license.

Chat

Qwen3 8B

Alibaba · 8.2B

8.2M 985

Qwen3 8B is an 8.2-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 3 series. It is a general-purpose chat model that delivers strong performance across reasoning, multilingual understanding, and coding tasks while remaining efficient enough to run on consumer GPUs with 8GB or more of VRAM. Like other Qwen 3 models, it supports hybrid thinking mode for flexible reasoning depth. The model benefits from the improved pretraining data and training methodology of the Qwen 3 generation, offering notable quality gains over Qwen 2.5 at the same parameter count. It is widely supported by inference frameworks including llama.cpp, vLLM, and Ollama. Released under the Apache 2.0 license.

Chat

Qwen3 Coder 30B A3B Instruct

Alibaba · 30B

1.0M 976

Qwen3 Coder 30B A3B Instruct is a code-specialized Mixture of Experts (MoE) model from Alibaba Cloud's Qwen 3 Coder series, with 30 billion total parameters and approximately 3 billion active parameters per forward pass. The MoE architecture allows it to deliver strong coding performance while keeping per-token compute costs low, making it faster at inference than comparably capable dense models. The model is instruction-tuned for programming assistance, code generation, debugging, and software engineering conversation. It requires VRAM proportional to its total 30B parameter count for loading weights, but benefits from efficient inference throughput due to its low active parameter count. Released under the Apache 2.0 license.

ChatCode

Llama 3.1 405B

Meta · 405B

514.6K 965

Meta Llama 3.1 405B is the largest model in the Llama family with 405 billion parameters. It represents Meta's most capable open-weight model, delivering performance competitive with leading proprietary models across reasoning, coding, math, and multilingual tasks. It features a 128K token context window. Due to its massive size, running Llama 3.1 405B locally requires significant hardware, typically multiple high-end professional GPUs with a combined VRAM of 200GB or more at reduced precision. It is primarily used in quantized formats for local inference or via multi-node setups. Released under the Llama 3.1 Community License.

Chat

Mistral Small 24B Instruct 2501

Mistral AI · 24B

178.1K 955

Mistral Small 24B Instruct is Mistral AI's January 2025 release targeting the mid-range parameter sweet spot. At 24 billion parameters it sits between lightweight 7B models and heavier 70B-class offerings, delivering strong instruction-following, reasoning, and coding performance without demanding top-tier hardware. This model fits comfortably on a single GPU with 16–24 GB of VRAM at common quantization levels, making it an attractive option for users with cards like the RTX 4090 or RTX 3090 who want a noticeable step up from 7B models. It strikes an appealing balance between quality and resource requirements for serious local use.

Chat

Qwen3 Next 80B A3B Instruct

Alibaba · 81.3B

930.5K 949

Qwen3 Next 80B A3B Instruct is a Mixture of Experts (MoE) model from Alibaba Cloud's Qwen 3 series, with approximately 81.3 billion total parameters and around 3 billion active parameters per forward pass. This extreme ratio between total and active parameters allows the model to encode extensive knowledge across its expert layers while maintaining very fast per-token inference, making it an unusually efficient design for its capability level. The model is instruction-tuned for general-purpose chat and requires VRAM proportional to its full 80B parameter count for weight loading, typically needing high-VRAM GPUs or quantized multi-GPU setups. Its low active parameter count results in fast generation speeds despite the large total model size. Released under the Apache 2.0 license.

Chat

Qwen2.5 72B Instruct

Alibaba · 72.7B

733.3K 917

Qwen2.5 72B Instruct is the flagship model of the Qwen 2.5 series from Alibaba Cloud, with 72.7 billion parameters. It is instruction-tuned for conversational use and excels across reasoning, coding, mathematics, and multilingual tasks. Qwen2.5 72B delivers performance competitive with leading open-weight 70B-class models while supporting a 128K token context window and structured output generation. The model uses a Transformer architecture with grouped-query attention and was pretrained on a diverse multilingual corpus of over 18 trillion tokens. Running it locally requires high-VRAM GPUs or multi-GPU setups, though quantized formats make it accessible on workstation-class hardware. Released under the Apache 2.0 license.

Chat

SmolLM3 3B

Hugging Face · 3B

167.8K 907

SmolLM3 3B is Hugging Face's latest-generation compact language model, representing a significant step up from the SmolLM2 series. At 3 billion parameters, it delivers considerably stronger reasoning, instruction following, and general language understanding while maintaining modest hardware requirements that keep it accessible on most consumer GPUs. This model benefits from improved training data, architectural refinements, and lessons learned from previous SmolLM generations. It is well positioned for local chatbot applications, coding assistance, and content generation tasks where you want strong performance without dedicating the resources required by 7B-class models.

Chat

Llama 3.1 70B Instruct

Meta · 70.6B

840.4K 896

Meta Llama 3.1 70B Instruct is a 70.6-billion parameter instruction-tuned model from Meta's Llama 3.1 family. It features a 128K token context window and is optimized for chat, tool use, and complex reasoning tasks. The 70B size offers a strong balance between capability and hardware requirements, running well on multi-GPU setups or high-VRAM workstation cards. This model was trained on over 15 trillion tokens and fine-tuned with reinforcement learning from human feedback (RLHF). It excels at coding assistance, mathematical reasoning, and multilingual dialogue. Released under the Llama 3.1 Community License.

Chat

OpenHermes 2.5 Mistral 7B

Teknium · 7B

151.8K 888

OpenHermes 2.5 is a community-driven fine-tune of Mistral 7B created by Teknium, trained on over 900,000 entries of high-quality synthetic data generated primarily by GPT-4. It quickly became one of the most popular open chat models of its era, consistently topping community benchmarks for 7B-class models. For local users, it offers strong instruction-following, creative writing, and coding assistance in a package that runs comfortably on a single consumer GPU with 8 GB of VRAM.

Chat

Gemma 3 1B IT

Google · 1B

3.3M 877

Google Gemma 3 1B IT is a 1-billion parameter instruction-tuned model from Google's Gemma 3 family. It is an ultra-compact text-only chat model designed for deployment on minimal hardware, including low-VRAM GPUs and edge devices. The model handles basic conversational tasks, simple instruction following, and lightweight text generation. It can run on virtually any modern GPU and even on CPU-only setups with acceptable latency. Released under the Gemma license.

Chat