All LLM Models

Browse 739 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

Falcon 7B

TII UAE · 7.2B · runs from 3.4 GB

378.1K 1.1K

Falcon 7B was one of the first truly competitive open-source large language models, released in mid-2023 by the Technology Innovation Institute in Abu Dhabi. Trained on the massive RefinedWeb dataset, it demonstrated that carefully curated web data could rival models trained on more traditionally assembled corpora. At 7 billion parameters, Falcon 7B helped establish the 7B class as the sweet spot for local inference, offering genuine language understanding on consumer GPUs with as little as 6 GB of VRAM.

Chat

Qwen1.5 14B

Alibaba · 14.2B · runs from 8 GB

9.9K 41

Qwen1.5 14B is a 14.2B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen 14B

Alibaba · 14.2B · runs from 6.6 GB

1.8K 214

Qwen 14B is a 14.2B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Distilgpt2

distilbert · 88M · runs from 0.0 GB

2.3M 618

DistilGPT-2 is a distilled version of OpenAI's GPT-2 model, compressed to just 88 million parameters while retaining much of the original model's text generation ability. Created using knowledge distillation techniques, it offers significantly faster inference than the full GPT-2 with only a modest reduction in output quality. This model is one of the lightest autoregressive language models available and can run on virtually any hardware, including CPUs. It is a practical choice for educational projects, quick prototyping, and applications where inference speed and minimal resource usage are more important than state-of-the-art generation quality.

Chat

Qwen1.5 32B

Alibaba · 32.5B · runs from 14.3 GB

9.5K 85

Qwen1.5 32B is a 32.5B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

QwQ 32B Preview

Alibaba · 32.8B · runs from 14.8 GB

20.8K 1.7K

QwQ 32B Preview is a 32.8B-parameter open language model from Alibaba in the QwQ family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoning

Qwen 1 8B

Alibaba · 1.8B · runs from 0.9 GB

1.7K 73

Qwen 1 8B is a 1.8B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Baichuan2 13B Base

baichuan-inc · 13B · runs from 6.1 GB

1.6K 82

Baichuan2 13B Base is a 13B-parameter open language model from baichuan-inc in the Baichuan family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Baichuan 13B Base

baichuan-inc · 13B · runs from 6.1 GB

990 187

Baichuan 13B Base is a 13B-parameter open language model from baichuan-inc in the Baichuan family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Opt 125M

Meta · 125M · runs from 0.3 GB

10.5M 266

Meta OPT 125M is a 125-million parameter language model from Meta's Open Pre-trained Transformer (OPT) project. Released in 2022, it was part of Meta's effort to provide the research community with openly available large language models that replicate the performance of GPT-3 class models at various scales. As one of the smallest models in the OPT family, the 125M variant is primarily useful for research, experimentation, and educational purposes. It can run on virtually any hardware, including CPU-only setups. While significantly less capable than modern models, it remains a useful reference point in LLM research.

Chat

Qwen3.5 4B

Alibaba · 4.7B · runs from 2.5 GB

9.0M 632

Qwen3.5 4B is a 4.7B-parameter open language model from Alibaba in the Qwen 3.5 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision

Qwen3.5 9B

Alibaba · 9.7B · runs from 4.7 GB

8.5M 1.6K

Qwen3.5 9B is a 9.7B-parameter open language model from Alibaba in the Qwen 3.5 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision

Tiny Qwen2ForCausalLM 2.5

trl-internal-testing · 2M · runs from 0.3 GB

5.5M 7

Tiny Qwen2ForCausalLM 2.5 is a 2M-parameter open language model from trl-internal-testing in the Qwen 2 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Pythia 160M

EleutherAI · 160M · runs from 0.1 GB

2.6M 39

Pythia 160M is part of EleutherAI's Pythia training suite, a collection of models trained on the same data in the same order at multiple scales to enable rigorous scientific research into how language models learn. At 160 million parameters, it is the smallest model in the suite and runs on virtually any hardware. This model is primarily valuable for researchers studying scaling laws, training dynamics, and emergent capabilities across model sizes. EleutherAI released full training checkpoints, data, and code, making Pythia 160M one of the most transparent and reproducible models available for academic study.

Chat

Qwen3.5 0.8B

Alibaba · 873M · runs from 0.7 GB

2.4M 570

Qwen3.5 0.8B is a 873M-parameter open language model from Alibaba in the Qwen 3.5 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision

Meta Llama 3.1 8B

Meta · 8.0B · runs from 3.8 GB

1.3M 2.3K

Meta Llama 3.1 8B is a 8.0B-parameter open language model from Meta in the Llama 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen2.5 1.5B Quantized.w8a8

RedHatAI · 1.8B · runs from 1.1 GB

1.3M 4

Qwen2.5 1.5B Quantized.w8a8 is a 1.8B-parameter open language model from RedHatAI in the Qwen 2.5 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Llama 3.2 3B

Meta · 3.2B · runs from 1.5 GB

817.3K 814

Meta Llama 3.2 3B is a 3.2-billion parameter base (pretrained) model from Meta's Llama 3.2 family. It supports a 128K token context window and is intended for fine-tuning, research, and custom applications rather than direct conversational use. The model provides a good balance between capability and efficiency at the small model scale. It is popular as a foundation for community fine-tunes and domain-specific adaptations. Released under the Llama 3.2 Community License.

Chat

Llama 2 7B HF

Meta · 6.7B · runs from 3.1 GB

713.5K 2.3K

Meta Llama 2 7B is a 6.7-billion parameter base (pretrained) language model from Meta's Llama 2 generation, provided in Hugging Face Transformers format. It was trained on 2 trillion tokens with a 4K token context window and represented a significant step in openly available large language models when released. As a base model, it is designed for further fine-tuning and research rather than direct chat use. While superseded by Llama 3 and later releases in terms of benchmark performance, Llama 2 7B remains widely used in the research community and as a baseline for comparison. Released under the Llama 2 Community License.

Chat

Chatglm2 6B

zai-org · 6B · runs from 2.8 GB

434.3K 2.1K

Chatglm2 6B is a 6B-parameter open language model from zai-org in the GLM family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

DeepSeek v2 Lite

DeepSeek · 15.7B · runs from 7.4 GB

339.9K 178

DeepSeek V2 Lite is a compact mixture-of-experts model with 15.7 billion total parameters, designed to deliver a strong quality-to-compute ratio for general chat and instruction following. It uses the same innovative MLA (Multi-Head Latent Attention) architecture as the larger V2, which reduces memory requirements during inference. With its modest parameter count, V2 Lite runs comfortably on a single consumer GPU, making it accessible to users who want to try DeepSeek's MoE approach without needing specialized hardware. It handles everyday conversational tasks, summarization, and light analysis well, offering a practical entry point into the DeepSeek model family.

Chat

GPT J 6B

EleutherAI · 6B · runs from 2.8 GB

242.5K 1.5K

GPT J 6B is a 6B-parameter open language model from EleutherAI. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Openai GPT

OpenAI · 120M · runs from 0.1 GB

231.9K 288

OpenAI GPT is the original 2018 transformer-based language model that started the GPT lineage, based on the paper "Improving Language Understanding by Generative Pre-Training." At just 120 million parameters, it is a historically significant model that demonstrated the power of unsupervised pretraining followed by supervised fine-tuning. This model is primarily of academic and historical interest today. It runs on essentially any hardware and can be useful for educational exploration of transformer architectures, but it should not be compared to modern instruction-tuned models in terms of practical capability.

Chat

Gemma 2 2B

Google · 2.6B · runs from 1.2 GB

206.8K 655

Google Gemma 2 2B is a 2-billion parameter base (pretrained) model from Google's Gemma 2 family. As a base model, it is not instruction-tuned and is intended for fine-tuning, research, and custom downstream applications. Its compact size makes it suitable for experimentation, rapid prototyping, and domain-specific fine-tuning on consumer hardware with minimal VRAM. Released under the Gemma license.

Chat

Gemma 4 12B

Google · 12.0B · runs from 6.1 GB

198.3K 525

Gemma 4 12B is a 12.0B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen1.5 MoE A2.7B

Alibaba · 14.3B · runs from 6.8 GB

181.8K 225

Qwen1.5 MoE A2.7B is a Mixture of Experts (MoE) model from Alibaba Cloud's Qwen 1.5 generation, with 14.3 billion total parameters but only 2.7 billion active parameters per forward pass. The MoE architecture allows it to deliver performance closer to dense 7B models while requiring less compute during inference, as only a subset of expert layers are activated for each token. The model supports a 32K token context window and requires VRAM proportional to its total parameter count for loading, despite lower compute cost per token. It is an interesting architectural variant for users exploring efficient inference and MoE models locally. Released under a custom Qwen license.

Chat

MiMo 7B Base

XiaomiMiMo · 7.8B · runs from 3.9 GB

162.1K 134

MiMo 7B Base is a 7.8B-parameter open language model from XiaomiMiMo. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Opt 350M

Meta · 350M · runs from 0.8 GB

156.3K 149

Meta OPT 350M is a 350-million parameter language model from Meta's Open Pre-trained Transformer (OPT) project, released in 2022 as part of a suite of models ranging from 125M to 175B parameters. It was designed to provide researchers with open access to models comparable to GPT-3 at various scales. The 350M variant runs on minimal hardware and is suitable for research, prototyping, and educational use. While it has been surpassed by modern architectures in terms of capability, it remains a lightweight option for basic text generation experiments and as a benchmark baseline.

Chat

Wildguard

Allen AI · 7.2B · runs from 3.4 GB

154.1K 49

Wildguard is a 7.2B-parameter open language model from Allen AI. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Llama 7B

huggyllama · 6.7B · runs from 3.1 GB

152.1K 356

This is a community reupload of Meta's original Llama 1 7B model, published by the huggyllama account on Hugging Face. The original Llama 1 was a 6.7-billion parameter base model released by Meta in early 2023, trained on 1 trillion tokens of publicly available data. It pioneered the wave of open-weight large language models. As a first-generation Llama model, it has been superseded by Llama 2 and Llama 3 in terms of quality and capability. It remains of historical and research interest as the model that catalyzed the open-source LLM ecosystem. This upload provides convenient access in Hugging Face Transformers format.

Chat