All LLM Models

Browse 739 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Featured only

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Falcon 7B

TII UAE · 7.2B · runs from 3.4 GB

Falcon 7B was one of the first truly competitive open-source large language models, released in mid-2023 by the Technology Innovation Institute in Abu Dhabi. Trained on the massive RefinedWeb dataset, it demonstrated that carefully curated web data could rival models trained on more traditionally assembled corpora. At 7 billion parameters, Falcon 7B helped establish the 7B class as the sweet spot for local inference, offering genuine language understanding on consumer GPUs with as little as 6 GB of VRAM.

Qwen1.5 14B

Alibaba · 14.2B · runs from 8 GB

Qwen1.5 14B is a 14.2B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen 14B

Alibaba · 14.2B · runs from 6.6 GB

Qwen 14B is a 14.2B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Distilgpt2

distilbert · 88M · runs from 0.0 GB

DistilGPT-2 is a distilled version of OpenAI's GPT-2 model, compressed to just 88 million parameters while retaining much of the original model's text generation ability. Created using knowledge distillation techniques, it offers significantly faster inference than the full GPT-2 with only a modest reduction in output quality. This model is one of the lightest autoregressive language models available and can run on virtually any hardware, including CPUs. It is a practical choice for educational projects, quick prototyping, and applications where inference speed and minimal resource usage are more important than state-of-the-art generation quality.

Qwen1.5 32B

Alibaba · 32.5B · runs from 14.3 GB

Qwen1.5 32B is a 32.5B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

QwQ 32B Preview

Alibaba · 32.8B · runs from 14.8 GB

QwQ 32B Preview is a 32.8B-parameter open language model from Alibaba in the QwQ family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen 1 8B

Alibaba · 1.8B · runs from 0.9 GB

Qwen 1 8B is a 1.8B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Baichuan2 13B Base

baichuan-inc · 13B · runs from 6.1 GB

Baichuan2 13B Base is a 13B-parameter open language model from baichuan-inc in the Baichuan family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Baichuan 13B Base

baichuan-inc · 13B · runs from 6.1 GB

Baichuan 13B Base is a 13B-parameter open language model from baichuan-inc in the Baichuan family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Opt 125M

Meta · 125M · runs from 0.3 GB

Meta OPT 125M is a 125-million parameter language model from Meta's Open Pre-trained Transformer (OPT) project. Released in 2022, it was part of Meta's effort to provide the research community with openly available large language models that replicate the performance of GPT-3 class models at various scales. As one of the smallest models in the OPT family, the 125M variant is primarily useful for research, experimentation, and educational purposes. It can run on virtually any hardware, including CPU-only setups. While significantly less capable than modern models, it remains a useful reference point in LLM research.

Qwen3.5 4B

Alibaba · 4.7B · runs from 2.5 GB

Qwen3.5 4B is a 4.7B-parameter open language model from Alibaba in the Qwen 3.5 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen3.5 9B

Alibaba · 9.7B · runs from 4.7 GB

Qwen3.5 9B is a 9.7B-parameter open language model from Alibaba in the Qwen 3.5 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Tiny Qwen2ForCausalLM 2.5

trl-internal-testing · 2M · runs from 0.3 GB

Tiny Qwen2ForCausalLM 2.5 is a 2M-parameter open language model from trl-internal-testing in the Qwen 2 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Pythia 160M

EleutherAI · 160M · runs from 0.1 GB

Pythia 160M is part of EleutherAI's Pythia training suite, a collection of models trained on the same data in the same order at multiple scales to enable rigorous scientific research into how language models learn. At 160 million parameters, it is the smallest model in the suite and runs on virtually any hardware. This model is primarily valuable for researchers studying scaling laws, training dynamics, and emergent capabilities across model sizes. EleutherAI released full training checkpoints, data, and code, making Pythia 160M one of the most transparent and reproducible models available for academic study.

Qwen3.5 0.8B

Alibaba · 873M · runs from 0.7 GB

Qwen3.5 0.8B is a 873M-parameter open language model from Alibaba in the Qwen 3.5 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Meta Llama 3.1 8B

Meta · 8.0B · runs from 3.8 GB

Meta Llama 3.1 8B is a 8.0B-parameter open language model from Meta in the Llama 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen2.5 1.5B Quantized.w8a8

RedHatAI · 1.8B · runs from 1.1 GB

Qwen2.5 1.5B Quantized.w8a8 is a 1.8B-parameter open language model from RedHatAI in the Qwen 2.5 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama 3.2 3B

Meta · 3.2B · runs from 1.5 GB

Meta Llama 3.2 3B is a 3.2-billion parameter base (pretrained) model from Meta's Llama 3.2 family. It supports a 128K token context window and is intended for fine-tuning, research, and custom applications rather than direct conversational use. The model provides a good balance between capability and efficiency at the small model scale. It is popular as a foundation for community fine-tunes and domain-specific adaptations. Released under the Llama 3.2 Community License.

Llama 2 7B HF

Meta · 6.7B · runs from 3.1 GB

Meta Llama 2 7B is a 6.7-billion parameter base (pretrained) language model from Meta's Llama 2 generation, provided in Hugging Face Transformers format. It was trained on 2 trillion tokens with a 4K token context window and represented a significant step in openly available large language models when released. As a base model, it is designed for further fine-tuning and research rather than direct chat use. While superseded by Llama 3 and later releases in terms of benchmark performance, Llama 2 7B remains widely used in the research community and as a baseline for comparison. Released under the Llama 2 Community License.

Chatglm2 6B

zai-org · 6B · runs from 2.8 GB

Chatglm2 6B is a 6B-parameter open language model from zai-org in the GLM family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

DeepSeek v2 Lite

DeepSeek · 15.7B · runs from 7.4 GB

DeepSeek V2 Lite is a compact mixture-of-experts model with 15.7 billion total parameters, designed to deliver a strong quality-to-compute ratio for general chat and instruction following. It uses the same innovative MLA (Multi-Head Latent Attention) architecture as the larger V2, which reduces memory requirements during inference. With its modest parameter count, V2 Lite runs comfortably on a single consumer GPU, making it accessible to users who want to try DeepSeek's MoE approach without needing specialized hardware. It handles everyday conversational tasks, summarization, and light analysis well, offering a practical entry point into the DeepSeek model family.

GPT J 6B

EleutherAI · 6B · runs from 2.8 GB

GPT J 6B is a 6B-parameter open language model from EleutherAI. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Openai GPT

OpenAI · 120M · runs from 0.1 GB

OpenAI GPT is the original 2018 transformer-based language model that started the GPT lineage, based on the paper "Improving Language Understanding by Generative Pre-Training." At just 120 million parameters, it is a historically significant model that demonstrated the power of unsupervised pretraining followed by supervised fine-tuning. This model is primarily of academic and historical interest today. It runs on essentially any hardware and can be useful for educational exploration of transformer architectures, but it should not be compared to modern instruction-tuned models in terms of practical capability.

Gemma 2 2B

Google · 2.6B · runs from 1.2 GB

Google Gemma 2 2B is a 2-billion parameter base (pretrained) model from Google's Gemma 2 family. As a base model, it is not instruction-tuned and is intended for fine-tuning, research, and custom downstream applications. Its compact size makes it suitable for experimentation, rapid prototyping, and domain-specific fine-tuning on consumer hardware with minimal VRAM. Released under the Gemma license.

Gemma 4 12B

Google · 12.0B · runs from 6.1 GB

Gemma 4 12B is a 12.0B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen1.5 MoE A2.7B

Alibaba · 14.3B · runs from 6.8 GB

Qwen1.5 MoE A2.7B is a Mixture of Experts (MoE) model from Alibaba Cloud's Qwen 1.5 generation, with 14.3 billion total parameters but only 2.7 billion active parameters per forward pass. The MoE architecture allows it to deliver performance closer to dense 7B models while requiring less compute during inference, as only a subset of expert layers are activated for each token. The model supports a 32K token context window and requires VRAM proportional to its total parameter count for loading, despite lower compute cost per token. It is an interesting architectural variant for users exploring efficient inference and MoE models locally. Released under a custom Qwen license.

MiMo 7B Base

XiaomiMiMo · 7.8B · runs from 3.9 GB

MiMo 7B Base is a 7.8B-parameter open language model from XiaomiMiMo. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Opt 350M

Meta · 350M · runs from 0.8 GB

Meta OPT 350M is a 350-million parameter language model from Meta's Open Pre-trained Transformer (OPT) project, released in 2022 as part of a suite of models ranging from 125M to 175B parameters. It was designed to provide researchers with open access to models comparable to GPT-3 at various scales. The 350M variant runs on minimal hardware and is suitable for research, prototyping, and educational use. While it has been surpassed by modern architectures in terms of capability, it remains a lightweight option for basic text generation experiments and as a benchmark baseline.

Wildguard

Allen AI · 7.2B · runs from 3.4 GB

Wildguard is a 7.2B-parameter open language model from Allen AI. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama 7B

huggyllama · 6.7B · runs from 3.1 GB

This is a community reupload of Meta's original Llama 1 7B model, published by the huggyllama account on Hugging Face. The original Llama 1 was a 6.7-billion parameter base model released by Meta in early 2023, trained on 1 trillion tokens of publicly available data. It pioneered the wave of open-weight large language models. As a first-generation Llama model, it has been superseded by Llama 2 and Llama 3 in terms of quality and capability. It remains of historical and research interest as the model that catalyzed the open-source LLM ecosystem. This upload provides convenient access in Hugging Face Transformers format.