All LLM Models

Browse 36 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Featured only

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Llama 3.2 1B Instruct

Meta · 1.2B · runs from 0.4 GB

Meta Llama 3.2 1B Instruct is a 1-billion parameter instruction-tuned model from Meta, the smallest in the Llama 3.2 family. It is designed for ultra-lightweight deployment scenarios where minimal hardware resources are available, supporting a 128K token context window despite its compact size. This model is suitable for basic conversational tasks, text summarization, and simple instruction following. It can run on virtually any modern GPU and even on CPU-only setups with acceptable performance. Released under the Llama 3.2 Community License.

Llama 3.1 8B Instruct

Meta · 8.0B · runs from 3.6 GB

Meta Llama 3.1 8B Instruct is an 8-billion parameter instruction-tuned language model from Meta. Part of the Llama 3.1 release, it supports a 128K token context window and is fine-tuned for conversational use, tool calling, and general assistant tasks. Its compact size makes it well-suited for local deployment on modern consumer GPUs with 8GB or more of VRAM. Llama 3.1 8B Instruct delivers strong performance for its parameter class across benchmarks in reasoning, coding, and multilingual understanding. It is released under the Llama 3.1 Community License and is widely supported by inference frameworks such as llama.cpp, vLLM, and Ollama.

Llama 3.2 3B Instruct

Meta · 3.2B · runs from 1.0 GB

Meta Llama 3.2 3B Instruct is a 3-billion parameter instruction-tuned model from Meta's Llama 3.2 release, designed for efficient local inference on resource-constrained hardware. It supports a 128K token context window and is optimized for conversational AI, summarization, and general assistant tasks. Despite its small footprint, Llama 3.2 3B Instruct delivers competitive performance for its size class and can run on GPUs with as little as 4GB of VRAM when quantized. It is released under the Llama 3.2 Community License and is a practical choice for edge deployment and lightweight local inference.

Meta Llama 3.1 8B Instruct

Meta · 8.0B · runs from 2.4 GB

Meta Llama 3.1 8B Instruct is a 8.0B-parameter open language model from Meta in the Llama 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Meta Llama 3 8B Instruct

Meta · 8.0B · runs from 2.6 GB

Meta Llama 3 8B Instruct is the instruction-tuned version of Meta's Llama 3 8B base model, with 8 billion parameters. It is fine-tuned for dialogue and chat use cases using supervised fine-tuning and RLHF, making it ready for conversational applications out of the box. The model supports an 8K token context window and performs well across coding, reasoning, and general knowledge tasks. Its efficient size makes it one of the most popular models for local inference on consumer hardware. Released under the Meta Llama 3 Community License.

Llama 3.3 70B Instruct

Meta · 70.6B · runs from 21.3 GB

Meta Llama 3.3 70B Instruct is a 70-billion parameter large language model from Meta, released as part of the Llama 3.3 generation. It is an instruction-tuned model optimized for dialogue and chat use cases, offering strong performance across reasoning, coding, and multilingual tasks. Llama 3.3 70B delivers quality competitive with much larger models while remaining feasible to run on high-end consumer or workstation GPUs with sufficient VRAM. The model uses a grouped-query attention architecture with a 128K token context window and was trained on a massive multilingual corpus. It is released under the Llama 3.3 Community License, making it one of the most capable openly available models for local inference.

Meta Llama 3.1 70B Instruct

Meta · 70.6B · runs from 21.3 GB

Meta Llama 3.1 70B Instruct is a 70.6B-parameter open language model from Meta in the Llama 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Meta Llama 3.1 405B Instruct

Meta · 405.9B · runs from 189.7 GB

Meta Llama 3.1 405B Instruct is a 405.9B-parameter open language model from Meta in the Llama 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama 4 Scout 17B 16E Instruct

Meta · 108.6B · runs from 32.9 GB

Llama 4 Scout 17B 16E Instruct is a 108.6B-parameter open language model from Meta in the Llama 4 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama 3.2 11B Vision Instruct

Meta · 10.7B · runs from 5.0 GB

Llama 3.2 11B Vision Instruct is a 10.7B-parameter open language model from Meta in the Llama 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama 4 Maverick 17B 128E Instruct

Meta · 401.6B · runs from 121.5 GB

Llama 4 Maverick 17B 128E Instruct is a 401.6B-parameter open language model from Meta in the Llama 4 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama 3.1 8B

Meta · 8.0B · runs from 3.8 GB

Meta Llama 3.1 8B is an 8-billion parameter base (pretrained) model from the Llama 3.1 family. It is not instruction-tuned and is intended for fine-tuning, research, and custom downstream applications. Compared to Llama 3 8B, it extends the context window to 128K tokens and benefits from improved training data and methodology. The model uses grouped-query attention and was trained on a multilingual corpus. It is released under the Llama 3.1 Community License and is widely used as a foundation for community fine-tunes and specialized models.

Meta Llama 3 70B Instruct

Meta · 70.6B · runs from 23.3 GB

Meta Llama 3 70B Instruct is a 70.6-billion parameter instruction-tuned model from Meta's Llama 3 release. It is fine-tuned for dialogue, coding assistance, and complex reasoning tasks using supervised fine-tuning and RLHF. At the time of release, it was among the most capable openly available models. The model supports an 8K token context window and requires substantial VRAM for local inference, typically needing multi-GPU setups or high-VRAM professional GPUs. It has been widely adopted for local deployment in quantized formats. Released under the Meta Llama 3 Community License.

Llama 3.2 1B

Meta · 1.2B · runs from 0.6 GB

Meta Llama 3.2 1B is a 1.2-billion parameter base (pretrained) model from Meta's Llama 3.2 release. It is the smallest model in the Llama 3.2 family and is designed for research, fine-tuning, and embedding into resource-constrained environments. It supports a 128K token context window. As a base model, it is not optimized for conversational use without further fine-tuning. Its minimal resource requirements make it suitable for experimentation, edge deployment, and as a starting point for domain-specific fine-tuning. Released under the Llama 3.2 Community License.

Llama Guard 3 8B

Meta · 8.0B · runs from 2.4 GB

Meta Llama Guard 3 8B is an 8-billion parameter safety classifier model built on the Llama 3.1 architecture. Unlike general-purpose chat models, Llama Guard is specifically designed to classify whether prompts or responses contain unsafe content across categories such as violence, sexual content, criminal planning, and other policy violations. The model is intended to be used as a moderation layer in LLM-based applications, providing input and output safety filtering. It follows a taxonomy-based classification approach and can be customized for different safety policies. Released under the Llama 3.1 Community License.

Meta Llama 3 8B

Meta · 8.0B · runs from 3.8 GB

Meta Llama 3 8B is an 8-billion parameter base (pretrained) language model from Meta's Llama 3 release. As a base model, it is not fine-tuned for chat or instructions and is intended for further fine-tuning, research, or as a foundation for custom applications. It uses grouped-query attention and was trained on over 15 trillion tokens. Llama 3 8B supports an 8K token context window and delivers strong benchmark performance across language understanding, reasoning, and coding tasks for its size. It is released under the Meta Llama 3 Community License and runs efficiently on consumer GPUs with 8GB or more of VRAM.

Llama 2 7B Chat HF

Meta · 6.7B · runs from 3.1 GB

Meta Llama 2 7B Chat is a 7-billion parameter instruction-tuned model from Meta's Llama 2 family, optimized for dialogue use cases. It was fine-tuned using supervised fine-tuning and RLHF on top of the Llama 2 7B base model, with a 4K token context window. This model is suitable for basic conversational AI tasks and runs efficiently on consumer GPUs. While newer Llama generations offer improved performance, Llama 2 7B Chat remains a well-understood and widely-supported option for local inference. Released under the Llama 2 Community License.

Llama 3.1 70B Instruct

Meta · 70.6B · runs from 33.0 GB

Meta Llama 3.1 70B Instruct is a 70.6-billion parameter instruction-tuned model from Meta's Llama 3.1 family. It features a 128K token context window and is optimized for chat, tool use, and complex reasoning tasks. The 70B size offers a strong balance between capability and hardware requirements, running well on multi-GPU setups or high-VRAM workstation cards. This model was trained on over 15 trillion tokens and fine-tuned with reinforcement learning from human feedback (RLHF). It excels at coding assistance, mathematical reasoning, and multilingual dialogue. Released under the Llama 3.1 Community License.

Opt 125M

Meta · 125M · runs from 0.3 GB

Meta OPT 125M is a 125-million parameter language model from Meta's Open Pre-trained Transformer (OPT) project. Released in 2022, it was part of Meta's effort to provide the research community with openly available large language models that replicate the performance of GPT-3 class models at various scales. As one of the smallest models in the OPT family, the 125M variant is primarily useful for research, experimentation, and educational purposes. It can run on virtually any hardware, including CPU-only setups. While significantly less capable than modern models, it remains a useful reference point in LLM research.

Meta Llama 3.1 8B

Meta · 8.0B · runs from 3.8 GB

Meta Llama 3.1 8B is a 8.0B-parameter open language model from Meta in the Llama 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama 3.2 3B

Meta · 3.2B · runs from 1.5 GB

Meta Llama 3.2 3B is a 3.2-billion parameter base (pretrained) model from Meta's Llama 3.2 family. It supports a 128K token context window and is intended for fine-tuning, research, and custom applications rather than direct conversational use. The model provides a good balance between capability and efficiency at the small model scale. It is popular as a foundation for community fine-tunes and domain-specific adaptations. Released under the Llama 3.2 Community License.

Llama 2 7B HF

Meta · 6.7B · runs from 3.1 GB

Meta Llama 2 7B is a 6.7-billion parameter base (pretrained) language model from Meta's Llama 2 generation, provided in Hugging Face Transformers format. It was trained on 2 trillion tokens with a 4K token context window and represented a significant step in openly available large language models when released. As a base model, it is designed for further fine-tuning and research rather than direct chat use. While superseded by Llama 3 and later releases in terms of benchmark performance, Llama 2 7B remains widely used in the research community and as a baseline for comparison. Released under the Llama 2 Community License.

Llama 3.1 405B

Meta · 405.9B · runs from 189.7 GB

Meta Llama 3.1 405B is the largest model in the Llama family with 405 billion parameters. It represents Meta's most capable open-weight model, delivering performance competitive with leading proprietary models across reasoning, coding, math, and multilingual tasks. It features a 128K token context window. Due to its massive size, running Llama 3.1 405B locally requires significant hardware, typically multiple high-end professional GPUs with a combined VRAM of 200GB or more at reduced precision. It is primarily used in quantized formats for local inference or via multi-node setups. Released under the Llama 3.1 Community License.

Llama 3.1 405B Instruct

Meta · 405.9B · runs from 189.7 GB

Llama 3.1 405B Instruct is a 405.9B-parameter open language model from Meta in the Llama 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Meta Llama 3 70B

Meta · 70.6B · runs from 33.0 GB

Meta Llama 3 70B is a 70.6B-parameter open language model from Meta in the Llama 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Opt 350M

Meta · 350M · runs from 0.8 GB

Meta OPT 350M is a 350-million parameter language model from Meta's Open Pre-trained Transformer (OPT) project, released in 2022 as part of a suite of models ranging from 125M to 175B parameters. It was designed to provide researchers with open access to models comparable to GPT-3 at various scales. The 350M variant runs on minimal hardware and is suitable for research, prototyping, and educational use. While it has been surpassed by modern architectures in terms of capability, it remains a lightweight option for basic text generation experiments and as a benchmark baseline.

Llama 2 13B Chat HF

Meta · 13.0B · runs from 6.1 GB

Meta Llama 2 13B Chat is a 13-billion parameter instruction-tuned model from Meta's Llama 2 family, fine-tuned for dialogue and chat applications. It offers improved reasoning and generation quality over the 7B variant while maintaining manageable hardware requirements with a 4K token context window. The model was fine-tuned using supervised fine-tuning and RLHF. It can run on consumer GPUs with 16GB or more of VRAM at reduced precision. Released under the Llama 2 Community License.

Llama 3.1 70B

Meta · 70.6B · runs from 33.0 GB

Meta Llama 3.1 70B is a 70.6-billion parameter base (pretrained) model from the Llama 3.1 family. It supports a 128K token context window and was trained on a massive multilingual corpus. As a base model, it is designed for fine-tuning and research rather than direct conversational use. The model serves as the foundation for the Llama 3.1 70B Instruct variant and numerous community fine-tunes. It delivers strong performance across language understanding and generation benchmarks. Released under the Llama 3.1 Community License.

Llama 2 70B Chat HF

Meta · 69.0B · runs from 32.3 GB

Llama 2 70B Chat HF is a 69.0B-parameter open language model from Meta in the Llama 2 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama Guard 3 1B

Meta · 1.5B · runs from 3.3 GB

Llama Guard 3 1B is a 1.5B-parameter open language model from Meta in the Llama family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.