All LLM Models
Browse 225 LLM models with VRAM requirements, quantization options, and hardware compatibility.
Understanding LLM VRAM Requirements
How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.
Model List
DeepSeek R1
DeepSeek · 684.5B
DeepSeek R1 is a groundbreaking reasoning model that uses reinforcement learning to develop chain-of-thought capabilities without relying on supervised fine-tuning. With 684.5 billion total parameters in a mixture-of-experts architecture (only 37 billion active per token), R1 achieves performance competitive with OpenAI's o1 on math, coding, and complex reasoning benchmarks while remaining fully open-weight. Running the full R1 locally is a serious undertaking, requiring well over 300 GB of VRAM at full precision, though quantized versions bring it within reach of multi-GPU setups. For users who want R1-level reasoning on more modest hardware, DeepSeek also released a family of distilled models that pack R1's reasoning patterns into smaller dense architectures.
Meta Llama 3 8B
Meta · 8.0B
Meta Llama 3 8B is an 8-billion parameter base (pretrained) language model from Meta's Llama 3 release. As a base model, it is not fine-tuned for chat or instructions and is intended for further fine-tuning, research, or as a foundation for custom applications. It uses grouped-query attention and was trained on over 15 trillion tokens. Llama 3 8B supports an 8K token context window and delivers strong benchmark performance across language understanding, reasoning, and coding tasks for its size. It is released under the Meta Llama 3 Community License and runs efficiently on consumer GPUs with 8GB or more of VRAM.
Llama 3.1 8B Instruct
Meta · 8B
Meta Llama 3.1 8B Instruct is an 8-billion parameter instruction-tuned language model from Meta. Part of the Llama 3.1 release, it supports a 128K token context window and is fine-tuned for conversational use, tool calling, and general assistant tasks. Its compact size makes it well-suited for local deployment on modern consumer GPUs with 8GB or more of VRAM. Llama 3.1 8B Instruct delivers strong performance for its parameter class across benchmarks in reasoning, coding, and multilingual understanding. It is released under the Llama 3.1 Community License and is widely supported by inference frameworks such as llama.cpp, vLLM, and Ollama.
Llama 2 7B Chat HF
Meta · 7B
Meta Llama 2 7B Chat is a 7-billion parameter instruction-tuned model from Meta's Llama 2 family, optimized for dialogue use cases. It was fine-tuned using supervised fine-tuning and RLHF on top of the Llama 2 7B base model, with a 4K token context window. This model is suitable for basic conversational AI tasks and runs efficiently on consumer GPUs. While newer Llama generations offer improved performance, Llama 2 7B Chat remains a well-understood and widely-supported option for local inference. Released under the Llama 2 Community License.
Mixtral 8x7B Instruct v0.1
Mistral AI · 46.7B
Mixtral 8x7B Instruct v0.1 is Mistral AI's flagship Mixture-of-Experts model, combining eight expert networks of 7 billion parameters each for a 46.7B total weight count while activating only about 12.9 billion parameters per token. This sparse architecture delivers performance that rivals much larger dense models at a fraction of the inference cost, excelling across reasoning, code generation, and multilingual tasks. Because the full weights must still be loaded into memory, you will need around 24–48 GB of VRAM depending on quantization level, making it best suited for multi-GPU desktop setups or high-VRAM workstation cards. If your hardware can accommodate it, Mixtral offers one of the best performance-per-active-parameter ratios available for local deployment.
GPT OSS 120B
OpenAI · 120.4B
GPT-OSS 120B is the larger of OpenAI's open-source model releases, bringing 120.4 billion parameters of GPT-lineage capability to the open-weight ecosystem. It represents near-frontier performance across reasoning, knowledge, code generation, and conversational tasks, rivaling top proprietary offerings in many benchmarks. Running this model locally is a serious hardware commitment, typically requiring multiple high-VRAM GPUs or a professional-grade setup with 80+ GB of combined VRAM even at aggressive quantization levels. It is best suited for enthusiasts with multi-GPU rigs or workstation hardware who want the strongest possible local model from OpenAI's catalog.
GPT OSS 20B
OpenAI · 21.5B
GPT-OSS 20B is one of OpenAI's first open-source model releases, marking a historic shift in the company's approach to open weights. At 21.5 billion parameters it delivers strong general-purpose chat and reasoning capabilities informed by the research behind the GPT family, making it a compelling option for users who want OpenAI-grade quality in a locally deployable package. The model runs comfortably on a single high-end consumer GPU such as an RTX 4090 at 4-bit quantization, or on workstation cards with 24 GB or more of VRAM at higher precision. It occupies a practical middle ground between lightweight 7B models and resource-heavy 70B+ offerings.
Meta Llama 3 8B Instruct
Meta · 8.0B
Meta Llama 3 8B Instruct is the instruction-tuned version of Meta's Llama 3 8B base model, with 8 billion parameters. It is fine-tuned for dialogue and chat use cases using supervised fine-tuning and RLHF, making it ready for conversational applications out of the box. The model supports an 8K token context window and performs well across coding, reasoning, and general knowledge tasks. Its efficient size makes it one of the most popular models for local inference on consumer hardware. Released under the Meta Llama 3 Community License.
Mistral 7B v0.1
Mistral AI · 7B
Mistral 7B v0.1 is the original base model from Mistral AI that helped reshape expectations for small open-weight language models when it launched in late 2023. As a pretrained foundation model without instruction tuning, it is designed for fine-tuning, research, and custom downstream tasks rather than direct conversational use. With 7 billion parameters and support for grouped-query attention and sliding-window attention, it remains a popular starting point for practitioners building specialized models. Its modest VRAM requirements of roughly 6 GB at 4-bit quantization keep it accessible on a wide range of consumer GPUs.
Phi 2
Microsoft · 2.8B
Microsoft Phi 2 is a 2.8-billion parameter language model from Microsoft Research that pioneered the concept of small but highly capable language models. Released in late 2023, Phi 2 demonstrated that strategic data curation and training methodology could allow a sub-3B model to outperform many 7B and 13B models on reasoning and coding benchmarks. The model runs on virtually any modern GPU and even on CPU-only setups. While succeeded by Phi 3 and Phi 4, Phi 2 remains historically significant as the model that proved small-scale language models could be genuinely useful for practical tasks. Released under the MIT license.
Gemma 7B
Google · 7B
Google Gemma 7B is a 7-billion parameter base (pretrained) model from the original Gemma generation, Google's first openly available family of language models. It represents Google's initial entry into the open-weight LLM space. While superseded by Gemma 2 and Gemma 3 in terms of benchmark performance, the original Gemma 7B remains a solid foundation model and a useful reference point in the evolution of Google's open models. Released under the Gemma license.
Gpt2
OpenAI · 137M
GPT-2 is the landmark 2019 language model from OpenAI that helped ignite widespread interest in large-scale text generation. At only 137 million parameters it is tiny by modern standards, but it holds an important place in AI history as the model that was initially deemed too dangerous to release in full. Today GPT-2 runs effortlessly on virtually any hardware, including CPUs, making it ideal for educational purposes, experimentation, and understanding transformer fundamentals. It should not be expected to match the quality of modern instruction-tuned models, but it remains a useful teaching tool and conversation starter.
DeepSeek v3 0324
DeepSeek · 684.5B
DeepSeek V3 0324 is DeepSeek's flagship general-purpose chat model, featuring a 684.5 billion parameter mixture-of-experts architecture with roughly 37 billion parameters active per token. It delivers strong performance across a wide range of tasks including conversation, writing, analysis, coding, and instruction following, competing with the best closed-source models available. Like other large MoE models, V3 requires substantial memory to load all expert weights even though only a fraction are used during inference. Quantized versions make it feasible on multi-GPU setups, and its combination of broad capability with open weights has made it one of the most widely deployed open models for local and self-hosted use.
QwQ 32B
Alibaba · 32B
QwQ 32B is a 32-billion parameter reasoning-focused model from Alibaba Cloud's Qwen family. Unlike standard chat models, QwQ is specifically optimized for step-by-step logical reasoning, complex problem solving, and mathematical tasks. It employs extended chain-of-thought processing, generating detailed internal reasoning before producing final answers, which significantly improves accuracy on challenging analytical problems. The model requires a GPU with at least 24GB of VRAM for quantized inference and delivers reasoning performance competitive with much larger models. It is particularly well suited for users who need strong analytical capabilities for math, science, coding logic, and multi-step problem solving. Released under the Apache 2.0 license.
Llama 3.3 70B Instruct
Meta · 70B
Meta Llama 3.3 70B Instruct is a 70-billion parameter large language model from Meta, released as part of the Llama 3.3 generation. It is an instruction-tuned model optimized for dialogue and chat use cases, offering strong performance across reasoning, coding, and multilingual tasks. Llama 3.3 70B delivers quality competitive with much larger models while remaining feasible to run on high-end consumer or workstation GPUs with sufficient VRAM. The model uses a grouped-query attention architecture with a 128K token context window and was trained on a massive multilingual corpus. It is released under the Llama 3.3 Community License, making it one of the most capable openly available models for local inference.
Mistral 7B Instruct v0.3
Mistral AI · 7.2B
Mistral 7B Instruct v0.3 is the latest instruction-tuned release of Mistral AI's original 7-billion-parameter model, delivering meaningful improvements in instruction following, function calling, and multilingual support over its predecessors. With an extended 32K-token vocabulary and refined chat capabilities, v0.3 remains one of the most capable sub-10B models available. At 7.2 billion parameters it sits comfortably in the sweet spot for local inference, running well on GPUs with 6–8 GB of VRAM at full precision and even on 4 GB cards with 4-bit quantization. It is an excellent default choice for anyone getting started with local LLMs who wants strong conversational performance without heavy hardware.
DeepSeek R1 0528
DeepSeek · 684.5B
DeepSeek R1 0528 is an updated release of the R1 reasoning model, incorporating improvements to training and inference that sharpen its performance on complex multi-step problems. It retains the same 684.5 billion parameter mixture-of-experts architecture as the original R1, with approximately 37 billion parameters active per forward pass. This revision addresses several edge cases where the original R1 struggled, delivering more consistent reasoning chains and fewer hallucinations on difficult math and coding tasks. Hardware requirements remain identical to the original R1, so users already set up to run the first version can swap in the 0528 weights with no changes to their infrastructure.
Llama 3.2 1B
Meta · 1.2B
Meta Llama 3.2 1B is a 1.2-billion parameter base (pretrained) model from Meta's Llama 3.2 release. It is the smallest model in the Llama 3.2 family and is designed for research, fine-tuning, and embedding into resource-constrained environments. It supports a 128K token context window. As a base model, it is not optimized for conversational use without further fine-tuning. Its minimal resource requirements make it suitable for experimentation, edge deployment, and as a starting point for domain-specific fine-tuning. Released under the Llama 3.2 Community License.
Kimi K2 Instruct
Moonshot AI · 1026.5B
Kimi K2 Instruct is Moonshot AI's massive Mixture-of-Experts model, weighing in at over one trillion total parameters. It represents one of the largest open-weight models available, delivering frontier-class performance across reasoning, coding, and multilingual tasks through its sparse MoE architecture that activates only a fraction of its full parameter count per token. Running Kimi K2 locally is an extreme undertaking, requiring professional multi-GPU setups with hundreds of gigabytes of combined VRAM even at aggressive quantization. This model is best suited for research labs, enterprise deployments, or enthusiasts with access to server-grade hardware who want to explore trillion-parameter-scale inference.
Llama 2 7B HF
Meta · 6.7B
Meta Llama 2 7B is a 6.7-billion parameter base (pretrained) language model from Meta's Llama 2 generation, provided in Hugging Face Transformers format. It was trained on 2 trillion tokens with a 4K token context window and represented a significant step in openly available large language models when released. As a base model, it is designed for further fine-tuning and research rather than direct chat use. While superseded by Llama 3 and later releases in terms of benchmark performance, Llama 2 7B remains widely used in the research community and as a baseline for comparison. Released under the Llama 2 Community License.
Phi 4
Microsoft · 14B
Microsoft Phi 4 is a 14-billion parameter language model from Microsoft Research's Phi series, designed to deliver strong reasoning, mathematical, and coding performance at an efficient size. Phi 4 continues the Phi family's focus on maximizing capability per parameter through high-quality training data curation, achieving benchmark scores that rival much larger models on reasoning and STEM tasks. The model runs well on consumer GPUs with 12-16GB of VRAM in quantized formats. It excels at mathematical problem solving, code generation, and structured reasoning. Released under the MIT license.
Llama 3.1 8B
Meta · 8B
Meta Llama 3.1 8B is an 8-billion parameter base (pretrained) model from the Llama 3.1 family. It is not instruction-tuned and is intended for fine-tuning, research, and custom downstream applications. Compared to Llama 3 8B, it extends the context window to 128K tokens and benefits from improved training data and methodology. The model uses grouped-query attention and was trained on a multilingual corpus. It is released under the Llama 3.1 Community License and is widely used as a foundation for community fine-tunes and specialized models.
Llama 3.1 Nemotron 70B Instruct HF
NVIDIA · 70B
Llama 3.1 Nemotron 70B Instruct is a 70-billion parameter chat model by NVIDIA, created by applying reinforcement learning from human feedback (RLHF) to Meta's Llama 3.1 70B base model. NVIDIA's Nemotron training pipeline focuses on improving helpfulness, accuracy, and response quality beyond the standard Llama instruction tuning. The model requires substantial VRAM for local inference, typically needing multi-GPU setups or high-end professional GPUs. In quantized formats it becomes accessible on workstation-class hardware. It is available in Hugging Face Transformers format and is supported by popular inference engines.
Llama 3.2 3B Instruct
Meta · 3B
Meta Llama 3.2 3B Instruct is a 3-billion parameter instruction-tuned model from Meta's Llama 3.2 release, designed for efficient local inference on resource-constrained hardware. It supports a 128K token context window and is optimized for conversational AI, summarization, and general assistant tasks. Despite its small footprint, Llama 3.2 3B Instruct delivers competitive performance for its size class and can run on GPUs with as little as 4GB of VRAM when quantized. It is released under the Llama 3.2 Community License and is a practical choice for edge deployment and lightweight local inference.
Qwen2.5 Coder 32B Instruct
Alibaba · 32.8B
Qwen2.5 Coder 32B Instruct is a 32.8-billion parameter code-specialized model from Alibaba Cloud, instruction-tuned for programming assistance and code generation. It is trained on a large corpus of source code alongside natural language data, making it highly capable for tasks such as code completion, debugging, code explanation, and software engineering dialogue. The model supports a 128K token context window and delivers code generation quality competitive with the best open-weight coding models at any scale. It requires a GPU with at least 24GB of VRAM for quantized inference. Released under the Apache 2.0 license.
GLM 4.7
zai-org · 358.3B
GLM 4.7 is an earlier generation of Zhipu AI's GLM foundation model series, featuring a mixture-of-experts architecture with approximately 358 billion total parameters. It delivers strong performance on reasoning, language understanding, and bilingual Chinese-English tasks while being significantly more manageable to run locally than its GLM 5 successor. For users with multi-GPU setups, GLM 4.7 offers a practical balance between capability and hardware requirements within the GLM model family.
Gemma 3 27B IT
Google · 27.4B
Google Gemma 3 27B IT is a 27.4-billion parameter multimodal instruction-tuned model from Google's Gemma 3 family. It supports both text and image inputs, making it one of the most capable openly available vision-language models for local inference. The model handles conversational AI, visual question answering, image description, and complex reasoning tasks across modalities. Gemma 3 27B IT requires a GPU with at least 24GB of VRAM for quantized inference, placing it within reach of high-end consumer cards like the RTX 4090. It uses a dense Transformer architecture with a large context window and benefits from Google's extensive pretraining pipeline. Released under the Gemma license.
Mistral 7B Instruct v0.1
Mistral AI · 7B
Mistral 7B Instruct v0.1 was the first instruction-tuned variant of the original Mistral 7B, fine-tuned for conversational and instruction-following tasks. While it has since been superseded by v0.2 and v0.3, it remains a solid lightweight chat model and an important milestone in the open-weight model ecosystem. Its hardware requirements are identical to the base Mistral 7B, running smoothly on GPUs with as little as 6 GB of VRAM when quantized. Users seeking the best Mistral 7B experience should generally prefer the newer v0.3 release, but v0.1 is still useful for reproducibility and benchmarking purposes.
GLM 5
zai-org · 753.9B
GLM 5 is Zhipu AI's flagship foundation model, a massive mixture-of-experts architecture with nearly 754 billion total parameters. It represents one of the largest open-weight models available, offering state-of-the-art performance across reasoning, coding, math, and multilingual tasks in both Chinese and English. Running GLM 5 locally requires enterprise-grade multi-GPU infrastructure, but for users with access to such hardware, it provides a locally-hosted alternative to the largest proprietary models.
TinyLlama 1.1B Chat v1.0
TinyLlama · 1.1B
TinyLlama 1.1B Chat is a 1.1-billion parameter chat model built on the Llama 2 architecture and trained on approximately 3 trillion tokens, an unusually large dataset for a model of its size. The TinyLlama project demonstrated that small models can achieve strong performance when given sufficient training compute, making it a standout in the sub-2B parameter class. The Chat variant is fine-tuned for conversational use and runs on virtually any modern GPU, including entry-level cards with 4GB of VRAM or less. It is a practical choice for lightweight local inference, edge deployment, and experimentation where hardware resources are limited.