All LLM Models

Browse 856 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Featured only

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Qwen3.5 4B

Alibaba · 4.7B · runs from 2.5 GB

Qwen3.5 4B is a 4.7B-parameter open language model from Alibaba in the Qwen 3.5 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen3.5 9B

Alibaba · 9.7B · runs from 4.7 GB

Qwen3.5 9B is a 9.7B-parameter open language model from Alibaba in the Qwen 3.5 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Tiny Qwen2ForCausalLM 2.5

trl-internal-testing · 2M · runs from 0.3 GB

Tiny Qwen2ForCausalLM 2.5 is a 2M-parameter open language model from trl-internal-testing in the Qwen 2 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Pythia 160M

EleutherAI · 160M · runs from 0.1 GB

Pythia 160M is part of EleutherAI's Pythia training suite, a collection of models trained on the same data in the same order at multiple scales to enable rigorous scientific research into how language models learn. At 160 million parameters, it is the smallest model in the suite and runs on virtually any hardware. This model is primarily valuable for researchers studying scaling laws, training dynamics, and emergent capabilities across model sizes. EleutherAI released full training checkpoints, data, and code, making Pythia 160M one of the most transparent and reproducible models available for academic study.

Qwen3.5 0.8B

Alibaba · 873M · runs from 0.7 GB

Qwen3.5 0.8B is a 873M-parameter open language model from Alibaba in the Qwen 3.5 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Meta Llama 3.1 8B

Meta · 8.0B · runs from 3.8 GB

Meta Llama 3.1 8B is a 8.0B-parameter open language model from Meta in the Llama 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen2.5 1.5B Quantized.w8a8

RedHatAI · 1.8B · runs from 1.1 GB

Qwen2.5 1.5B Quantized.w8a8 is a 1.8B-parameter open language model from RedHatAI in the Qwen 2.5 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama 3.2 3B

Meta · 3.2B · runs from 1.5 GB

Meta Llama 3.2 3B is a 3.2-billion parameter base (pretrained) model from Meta's Llama 3.2 family. It supports a 128K token context window and is intended for fine-tuning, research, and custom applications rather than direct conversational use. The model provides a good balance between capability and efficiency at the small model scale. It is popular as a foundation for community fine-tunes and domain-specific adaptations. Released under the Llama 3.2 Community License.

Llama 2 7B HF

Meta · 6.7B · runs from 3.1 GB

Meta Llama 2 7B is a 6.7-billion parameter base (pretrained) language model from Meta's Llama 2 generation, provided in Hugging Face Transformers format. It was trained on 2 trillion tokens with a 4K token context window and represented a significant step in openly available large language models when released. As a base model, it is designed for further fine-tuning and research rather than direct chat use. While superseded by Llama 3 and later releases in terms of benchmark performance, Llama 2 7B remains widely used in the research community and as a baseline for comparison. Released under the Llama 2 Community License.

Chatglm2 6B

zai-org · 6B · runs from 2.8 GB

Chatglm2 6B is a 6B-parameter open language model from zai-org in the GLM family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

DeepSeek v2 Lite

DeepSeek · 15.7B · runs from 7.4 GB

DeepSeek V2 Lite is a compact mixture-of-experts model with 15.7 billion total parameters, designed to deliver a strong quality-to-compute ratio for general chat and instruction following. It uses the same innovative MLA (Multi-Head Latent Attention) architecture as the larger V2, which reduces memory requirements during inference. With its modest parameter count, V2 Lite runs comfortably on a single consumer GPU, making it accessible to users who want to try DeepSeek's MoE approach without needing specialized hardware. It handles everyday conversational tasks, summarization, and light analysis well, offering a practical entry point into the DeepSeek model family.

GPT J 6B

EleutherAI · 6B · runs from 2.8 GB

GPT J 6B is a 6B-parameter open language model from EleutherAI. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Openai GPT

OpenAI · 120M · runs from 0.1 GB

OpenAI GPT is the original 2018 transformer-based language model that started the GPT lineage, based on the paper "Improving Language Understanding by Generative Pre-Training." At just 120 million parameters, it is a historically significant model that demonstrated the power of unsupervised pretraining followed by supervised fine-tuning. This model is primarily of academic and historical interest today. It runs on essentially any hardware and can be useful for educational exploration of transformer architectures, but it should not be compared to modern instruction-tuned models in terms of practical capability.

Llama 3.1 405B

Meta · 405.9B · runs from 189.7 GB

Meta Llama 3.1 405B is the largest model in the Llama family with 405 billion parameters. It represents Meta's most capable open-weight model, delivering performance competitive with leading proprietary models across reasoning, coding, math, and multilingual tasks. It features a 128K token context window. Due to its massive size, running Llama 3.1 405B locally requires significant hardware, typically multiple high-end professional GPUs with a combined VRAM of 200GB or more at reduced precision. It is primarily used in quantized formats for local inference or via multi-node setups. Released under the Llama 3.1 Community License.

Gemma 2 2B

Google · 2.6B · runs from 1.2 GB

Google Gemma 2 2B is a 2-billion parameter base (pretrained) model from Google's Gemma 2 family. As a base model, it is not instruction-tuned and is intended for fine-tuning, research, and custom downstream applications. Its compact size makes it suitable for experimentation, rapid prototyping, and domain-specific fine-tuning on consumer hardware with minimal VRAM. Released under the Gemma license.

Gemma 4 12B

Google · 12.0B · runs from 6.1 GB

Gemma 4 12B is a 12.0B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama 3.1 405B Instruct

Meta · 405.9B · runs from 189.7 GB

Llama 3.1 405B Instruct is a 405.9B-parameter open language model from Meta in the Llama 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Meta Llama 3 70B

Meta · 70.6B · runs from 33.0 GB

Meta Llama 3 70B is a 70.6B-parameter open language model from Meta in the Llama 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen1.5 MoE A2.7B

Alibaba · 14.3B · runs from 6.8 GB

Qwen1.5 MoE A2.7B is a Mixture of Experts (MoE) model from Alibaba Cloud's Qwen 1.5 generation, with 14.3 billion total parameters but only 2.7 billion active parameters per forward pass. The MoE architecture allows it to deliver performance closer to dense 7B models while requiring less compute during inference, as only a subset of expert layers are activated for each token. The model supports a 32K token context window and requires VRAM proportional to its total parameter count for loading, despite lower compute cost per token. It is an interesting architectural variant for users exploring efficient inference and MoE models locally. Released under a custom Qwen license.

DeepSeek V3.2 Exp

DeepSeek · 685.4B · runs from 295.2 GB

DeepSeek V3.2 Exp is a 685.4B-parameter open language model from DeepSeek in the DeepSeek V3 family. It supports a context window of up to 163,840 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

MiMo 7B Base

XiaomiMiMo · 7.8B · runs from 3.9 GB

MiMo 7B Base is a 7.8B-parameter open language model from XiaomiMiMo. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Opt 350M

Meta · 350M · runs from 0.8 GB

Meta OPT 350M is a 350-million parameter language model from Meta's Open Pre-trained Transformer (OPT) project, released in 2022 as part of a suite of models ranging from 125M to 175B parameters. It was designed to provide researchers with open access to models comparable to GPT-3 at various scales. The 350M variant runs on minimal hardware and is suitable for research, prototyping, and educational use. While it has been surpassed by modern architectures in terms of capability, it remains a lightweight option for basic text generation experiments and as a benchmark baseline.

Wildguard

Allen AI · 7.2B · runs from 3.4 GB

Wildguard is a 7.2B-parameter open language model from Allen AI. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama 7B

huggyllama · 6.7B · runs from 3.1 GB

This is a community reupload of Meta's original Llama 1 7B model, published by the huggyllama account on Hugging Face. The original Llama 1 was a 6.7-billion parameter base model released by Meta in early 2023, trained on 1 trillion tokens of publicly available data. It pioneered the wave of open-weight large language models. As a first-generation Llama model, it has been superseded by Llama 2 and Llama 3 in terms of quality and capability. It remains of historical and research interest as the model that catalyzed the open-source LLM ecosystem. This upload provides convenient access in Hugging Face Transformers format.

Mistral Small Instruct 2409

Mistral AI · 22.2B · runs from 10.2 GB

Mistral Small Instruct 2409 is a 22.2B-parameter open language model from Mistral AI in the Mistral family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

TinyStories 1M

roneneldan · 1M · runs from 0.0 GB

TinyStories 1M is a 1M-parameter open language model from roneneldan. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama 2 13B Chat HF

Meta · 13.0B · runs from 6.1 GB

Meta Llama 2 13B Chat is a 13-billion parameter instruction-tuned model from Meta's Llama 2 family, fine-tuned for dialogue and chat applications. It offers improved reasoning and generation quality over the 7B variant while maintaining manageable hardware requirements with a 4K token context window. The model was fine-tuned using supervised fine-tuning and RLHF. It can run on consumer GPUs with 16GB or more of VRAM at reduced precision. Released under the Llama 2 Community License.

Qwen2 1.5B

Alibaba · 1.5B · runs from 1.0 GB

Qwen2 1.5B is a 1.5-billion parameter base (pretrained) model from Alibaba Cloud's older Qwen 2 generation. It was trained on a multilingual corpus and supports a context window of up to 32K tokens. As a base model, it is designed for fine-tuning and research rather than direct conversational use. While superseded by the Qwen 2.5 series in terms of training data quality and benchmark performance, Qwen2 1.5B remains a lightweight option for experimentation and as a baseline for comparison. Released under the Apache 2.0 license.

MiMo 7B RL

XiaomiMiMo · 7.8B · runs from 3.9 GB

MiMo 7B RL is a 7.8B-parameter open language model from XiaomiMiMo. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Baichuan 7B

baichuan-inc · 7B · runs from 15.4 GB

Baichuan 7B is a 7B-parameter open language model from baichuan-inc in the Baichuan family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.