All LLM Models

Browse 856 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

Qwen3.5 4B

Alibaba · 4.7B · runs from 2.5 GB

9.0M 632

Qwen3.5 4B is a 4.7B-parameter open language model from Alibaba in the Qwen 3.5 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision

Qwen3.5 9B

Alibaba · 9.7B · runs from 4.7 GB

8.5M 1.6K

Qwen3.5 9B is a 9.7B-parameter open language model from Alibaba in the Qwen 3.5 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision

Tiny Qwen2ForCausalLM 2.5

trl-internal-testing · 2M · runs from 0.3 GB

5.5M 7

Tiny Qwen2ForCausalLM 2.5 is a 2M-parameter open language model from trl-internal-testing in the Qwen 2 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Pythia 160M

EleutherAI · 160M · runs from 0.1 GB

2.6M 39

Pythia 160M is part of EleutherAI's Pythia training suite, a collection of models trained on the same data in the same order at multiple scales to enable rigorous scientific research into how language models learn. At 160 million parameters, it is the smallest model in the suite and runs on virtually any hardware. This model is primarily valuable for researchers studying scaling laws, training dynamics, and emergent capabilities across model sizes. EleutherAI released full training checkpoints, data, and code, making Pythia 160M one of the most transparent and reproducible models available for academic study.

Chat

Qwen3.5 0.8B

Alibaba · 873M · runs from 0.7 GB

2.4M 570

Qwen3.5 0.8B is a 873M-parameter open language model from Alibaba in the Qwen 3.5 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision

Meta Llama 3.1 8B

Meta · 8.0B · runs from 3.8 GB

1.3M 2.3K

Meta Llama 3.1 8B is a 8.0B-parameter open language model from Meta in the Llama 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen2.5 1.5B Quantized.w8a8

RedHatAI · 1.8B · runs from 1.1 GB

1.3M 4

Qwen2.5 1.5B Quantized.w8a8 is a 1.8B-parameter open language model from RedHatAI in the Qwen 2.5 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Llama 3.2 3B

Meta · 3.2B · runs from 1.5 GB

817.3K 814

Meta Llama 3.2 3B is a 3.2-billion parameter base (pretrained) model from Meta's Llama 3.2 family. It supports a 128K token context window and is intended for fine-tuning, research, and custom applications rather than direct conversational use. The model provides a good balance between capability and efficiency at the small model scale. It is popular as a foundation for community fine-tunes and domain-specific adaptations. Released under the Llama 3.2 Community License.

Chat

Llama 2 7B HF

Meta · 6.7B · runs from 3.1 GB

713.5K 2.3K

Meta Llama 2 7B is a 6.7-billion parameter base (pretrained) language model from Meta's Llama 2 generation, provided in Hugging Face Transformers format. It was trained on 2 trillion tokens with a 4K token context window and represented a significant step in openly available large language models when released. As a base model, it is designed for further fine-tuning and research rather than direct chat use. While superseded by Llama 3 and later releases in terms of benchmark performance, Llama 2 7B remains widely used in the research community and as a baseline for comparison. Released under the Llama 2 Community License.

Chat

Chatglm2 6B

zai-org · 6B · runs from 2.8 GB

434.3K 2.1K

Chatglm2 6B is a 6B-parameter open language model from zai-org in the GLM family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

DeepSeek v2 Lite

DeepSeek · 15.7B · runs from 7.4 GB

339.9K 178

DeepSeek V2 Lite is a compact mixture-of-experts model with 15.7 billion total parameters, designed to deliver a strong quality-to-compute ratio for general chat and instruction following. It uses the same innovative MLA (Multi-Head Latent Attention) architecture as the larger V2, which reduces memory requirements during inference. With its modest parameter count, V2 Lite runs comfortably on a single consumer GPU, making it accessible to users who want to try DeepSeek's MoE approach without needing specialized hardware. It handles everyday conversational tasks, summarization, and light analysis well, offering a practical entry point into the DeepSeek model family.

Chat

GPT J 6B

EleutherAI · 6B · runs from 2.8 GB

242.5K 1.5K

GPT J 6B is a 6B-parameter open language model from EleutherAI. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Openai GPT

OpenAI · 120M · runs from 0.1 GB

231.9K 288

OpenAI GPT is the original 2018 transformer-based language model that started the GPT lineage, based on the paper "Improving Language Understanding by Generative Pre-Training." At just 120 million parameters, it is a historically significant model that demonstrated the power of unsupervised pretraining followed by supervised fine-tuning. This model is primarily of academic and historical interest today. It runs on essentially any hardware and can be useful for educational exploration of transformer architectures, but it should not be compared to modern instruction-tuned models in terms of practical capability.

Chat

Llama 3.1 405B

Meta · 405.9B · runs from 189.7 GB

224.6K 977

Meta Llama 3.1 405B is the largest model in the Llama family with 405 billion parameters. It represents Meta's most capable open-weight model, delivering performance competitive with leading proprietary models across reasoning, coding, math, and multilingual tasks. It features a 128K token context window. Due to its massive size, running Llama 3.1 405B locally requires significant hardware, typically multiple high-end professional GPUs with a combined VRAM of 200GB or more at reduced precision. It is primarily used in quantized formats for local inference or via multi-node setups. Released under the Llama 3.1 Community License.

Chat

Gemma 2 2B

Google · 2.6B · runs from 1.2 GB

206.8K 655

Google Gemma 2 2B is a 2-billion parameter base (pretrained) model from Google's Gemma 2 family. As a base model, it is not instruction-tuned and is intended for fine-tuning, research, and custom downstream applications. Its compact size makes it suitable for experimentation, rapid prototyping, and domain-specific fine-tuning on consumer hardware with minimal VRAM. Released under the Gemma license.

Chat

Gemma 4 12B

Google · 12.0B · runs from 6.1 GB

198.3K 525

Gemma 4 12B is a 12.0B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Llama 3.1 405B Instruct

Meta · 405.9B · runs from 189.7 GB

197.6K 595

Llama 3.1 405B Instruct is a 405.9B-parameter open language model from Meta in the Llama 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Meta Llama 3 70B

Meta · 70.6B · runs from 33.0 GB

183.1K 878

Meta Llama 3 70B is a 70.6B-parameter open language model from Meta in the Llama 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen1.5 MoE A2.7B

Alibaba · 14.3B · runs from 6.8 GB

181.8K 225

Qwen1.5 MoE A2.7B is a Mixture of Experts (MoE) model from Alibaba Cloud's Qwen 1.5 generation, with 14.3 billion total parameters but only 2.7 billion active parameters per forward pass. The MoE architecture allows it to deliver performance closer to dense 7B models while requiring less compute during inference, as only a subset of expert layers are activated for each token. The model supports a 32K token context window and requires VRAM proportional to its total parameter count for loading, despite lower compute cost per token. It is an interesting architectural variant for users exploring efficient inference and MoE models locally. Released under a custom Qwen license.

Chat

DeepSeek V3.2 Exp

DeepSeek · 685.4B · runs from 295.2 GB

174.1K 992

DeepSeek V3.2 Exp is a 685.4B-parameter open language model from DeepSeek in the DeepSeek V3 family. It supports a context window of up to 163,840 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

MiMo 7B Base

XiaomiMiMo · 7.8B · runs from 3.9 GB

162.1K 134

MiMo 7B Base is a 7.8B-parameter open language model from XiaomiMiMo. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Opt 350M

Meta · 350M · runs from 0.8 GB

156.3K 149

Meta OPT 350M is a 350-million parameter language model from Meta's Open Pre-trained Transformer (OPT) project, released in 2022 as part of a suite of models ranging from 125M to 175B parameters. It was designed to provide researchers with open access to models comparable to GPT-3 at various scales. The 350M variant runs on minimal hardware and is suitable for research, prototyping, and educational use. While it has been surpassed by modern architectures in terms of capability, it remains a lightweight option for basic text generation experiments and as a benchmark baseline.

Chat

Wildguard

Allen AI · 7.2B · runs from 3.4 GB

154.1K 49

Wildguard is a 7.2B-parameter open language model from Allen AI. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Llama 7B

huggyllama · 6.7B · runs from 3.1 GB

152.1K 356

This is a community reupload of Meta's original Llama 1 7B model, published by the huggyllama account on Hugging Face. The original Llama 1 was a 6.7-billion parameter base model released by Meta in early 2023, trained on 1 trillion tokens of publicly available data. It pioneered the wave of open-weight large language models. As a first-generation Llama model, it has been superseded by Llama 2 and Llama 3 in terms of quality and capability. It remains of historical and research interest as the model that catalyzed the open-source LLM ecosystem. This upload provides convenient access in Hugging Face Transformers format.

Chat

Mistral Small Instruct 2409

Mistral AI · 22.2B · runs from 10.2 GB

127.5K 393

Mistral Small Instruct 2409 is a 22.2B-parameter open language model from Mistral AI in the Mistral family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

TinyStories 1M

roneneldan · 1M · runs from 0.0 GB

113.3K 67

TinyStories 1M is a 1M-parameter open language model from roneneldan. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Llama 2 13B Chat HF

Meta · 13.0B · runs from 6.1 GB

108.6K 1.1K

Meta Llama 2 13B Chat is a 13-billion parameter instruction-tuned model from Meta's Llama 2 family, fine-tuned for dialogue and chat applications. It offers improved reasoning and generation quality over the 7B variant while maintaining manageable hardware requirements with a 4K token context window. The model was fine-tuned using supervised fine-tuning and RLHF. It can run on consumer GPUs with 16GB or more of VRAM at reduced precision. Released under the Llama 2 Community License.

Chat

Qwen2 1.5B

Alibaba · 1.5B · runs from 1.0 GB

108.4K 100

Qwen2 1.5B is a 1.5-billion parameter base (pretrained) model from Alibaba Cloud's older Qwen 2 generation. It was trained on a multilingual corpus and supports a context window of up to 32K tokens. As a base model, it is designed for fine-tuning and research rather than direct conversational use. While superseded by the Qwen 2.5 series in terms of training data quality and benchmark performance, Qwen2 1.5B remains a lightweight option for experimentation and as a baseline for comparison. Released under the Apache 2.0 license.

Chat

MiMo 7B RL

XiaomiMiMo · 7.8B · runs from 3.9 GB

102.5K 276

MiMo 7B RL is a 7.8B-parameter open language model from XiaomiMiMo. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Baichuan 7B

baichuan-inc · 7B · runs from 15.4 GB

98.8K 841

Baichuan 7B is a 7B-parameter open language model from baichuan-inc in the Baichuan family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat