All LLM Models

Browse 719 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

Gpt2 Medium

OpenAI · 380M · runs from 0.2 GB

716.0K 196

GPT-2 Medium scales the original GPT-2 architecture to 380 million parameters, offering noticeably improved text generation quality over the base 137M variant while remaining extremely lightweight by current standards. It supports the same autoregressive language modeling tasks as its smaller and larger siblings. Like all GPT-2 variants, it runs comfortably on virtually any modern hardware including CPU-only setups, making it an accessible option for learning, prototyping, and lightweight text generation experiments without needing a dedicated GPU.

Chat

OLMoE 1B 7B 0125 Instruct

Allen AI · 6.9B · runs from 2.5 GB

102.2K 65

OLMoE 1B 7B 0125 Instruct is a 6.9B-parameter open language model from Allen AI in the OLMo family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Mamba 130M HF

State Spaces · 129M · runs from 0.1 GB

795.6K 73

Mamba 130M is a state-space model developed by State Spaces that offers a fundamentally different architecture from the Transformer-based models that dominate the LLM landscape. Using selective state-space layers instead of attention, Mamba achieves linear-time inference scaling with sequence length, making it particularly efficient for processing long inputs. At 130 million parameters this is primarily a research and demonstration model, but it showcases the potential of state-space architectures for local deployment. Users interested in exploring alternatives to Transformer-based language models will find Mamba 130M a lightweight and accessible entry point for experimentation.

Chat

Qwen2.5 1.5B

Alibaba · 1.5B · runs from 1 GB

1.2M 187

Qwen2.5 1.5B is a 1.5B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen1.5 0.5B Chat

Alibaba · 620M · runs from 0.8 GB

85.9K 95

Qwen1.5 0.5B Chat is an early-generation small language model from Alibaba's Qwen series with just 620 million parameters. As one of the smallest models in the Qwen family, it was designed to demonstrate that useful conversational ability is possible even at sub-billion parameter scales. This model runs easily on virtually any hardware including CPUs, older GPUs, and even mobile devices. While its capabilities are limited compared to larger Qwen models, it remains a useful option for embedded applications, rapid prototyping, or situations where minimal resource consumption is the top priority.

Chat

Llama3 OpenBioLLM 8B

aaditya · 8B · runs from 3.9 GB

58.0K 242

Llama3 OpenBioLLM 8B is a 8B-parameter open language model from aaditya in the Llama 3 family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Meta Llama 3 8B Instruct

Nous Research · 8B · runs from 3.9 GB

27.5K 103

Meta Llama 3 8B Instruct is a 8B-parameter open language model from Nous Research in the Llama 3 family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

OpenELM 1 1B Instruct

Apple · 1.1B · runs from 0.5 GB

1.5M 75

OpenELM 1 1B Instruct is a 1.1B-parameter open language model from Apple. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Gemma 7B IT

Google · 8.5B · runs from 4.0 GB

21.4K 1.2K

Google Gemma 7B IT is a 7-billion parameter instruction-tuned model from the original Gemma generation. It is fine-tuned for conversational use and general instruction following, running efficiently on consumer GPUs with 8GB or more of VRAM. As a first-generation Gemma model, it has been superseded by Gemma 2 and Gemma 3 models in quality and capability, but it remains well-supported by inference frameworks. Released under the Gemma license.

Chat

Yi 9B

01.AI · 8.8B · runs from 4.1 GB

7.8K 187

Yi 9B is a 8.8B-parameter open language model from 01.AI in the Yi family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Salamandra 7B Instruct

BSC-LT · 7.8B · runs from 3.8 GB

70.3K 76

Salamandra 7B Instruct is a 7.8-billion-parameter multilingual model developed by the Barcelona Supercomputing Center (BSC-LT) as part of a European initiative to build high-quality open language models. It has particular strength in Iberian languages including Spanish, Catalan, Portuguese, and Basque, while also supporting English and other major European languages. This model is an excellent choice for users who need strong performance in Spanish or other Iberian languages that are often underserved by mainstream LLMs. Running it locally ensures data privacy for sensitive multilingual workflows, and at 7B parameters it fits comfortably on a single consumer GPU with 8 GB or more of VRAM.

Chat

Tiny LLM

arnir0 · 13M · runs from 0.3 GB

53.5K 52

Tiny LLM is a 13M-parameter open language model from arnir0. It supports a context window of up to 1,024 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Bloom 560M

BigScience · 559M · runs from 0.3 GB

497.7K 374

Bloom 560M is a 559M-parameter open language model from BigScience. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Sarvam 30B

sarvamai · 32.2B · runs from 14 GB

54.6K 205

Sarvam 30B is a 32.2B-parameter open language model from sarvamai. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

LocoTrainer 4B

LocoreMind · 4.0B · runs from 2.2 GB

1.2K 163

LocoTrainer 4B is a 4.0B-parameter open language model from LocoreMind. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCodeFunctions

NVIDIA Nemotron 3 Nano 4B BF16

NVIDIA · 4.0B · runs from 2.2 GB

342.4K 90

NVIDIA Nemotron 3 Nano 4B BF16 is a 4.0B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Prometheus 7B V2.0

prometheus-eval · 7.2B · runs from 3.6 GB

75.4K 101

Prometheus 7B V2.0 is a specialized judge model trained by prometheus-eval to evaluate the quality of outputs from other language models. At 7.2 billion parameters, it is designed to score and critique LLM responses against custom rubrics, making it a valuable tool for automated evaluation pipelines and benchmarking. Unlike general-purpose chat models, Prometheus is purpose-built for assessment tasks. It can provide structured feedback on dimensions like helpfulness, accuracy, and coherence. Useful for researchers, developers building LLM applications, and anyone who needs consistent automated evaluation without relying on paid API calls to frontier models. Runs comfortably on most modern GPUs with 8 GB or more of VRAM.

Chat

ReaderLM v2

jinaai · 1.5B · runs from 1.0 GB

361.3K 792

ReaderLM v2 is a 1.5B-parameter open language model from jinaai. It supports a context window of up to 512,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Llama 68M

JackFram · 68M · runs from 0.0 GB

186.6K 39

Llama 68M is a 68M-parameter open language model from JackFram in the Llama family. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

C4ai Command R V01

Cohere · 35.0B · runs from 15.9 GB

28.5K 1.1K

C4ai Command R V01 is a 35.0B-parameter open language model from Cohere in the Command R family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

StableBeluga2

Stability AI · 70B · runs from 20.2 GB

829 882

StableBeluga2 is a 70B-parameter open language model from Stability AI. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Amber

LLM360 · 6.7B · runs from 3.2 GB

53.0K 72

Amber is a 6.7 billion parameter model from LLM360, an initiative dedicated to full transparency in large language model training. Every aspect of Amber's creation has been publicly documented and released, including the complete training data, all intermediate checkpoints, training code, and evaluation results. This level of openness makes Amber uniquely valuable for researchers studying training dynamics, data influence, and model behavior at scale. For local deployment, it offers solid general-purpose text generation at a size that fits comfortably on mid-range consumer GPUs, though users primarily seeking chat performance may prefer models specifically tuned for instruction following.

Chat

Mellum 4B Base

JetBrains · 4.0B · runs from 2.8 GB

2.4K 449

Mellum 4B Base is a 4.0B-parameter open language model from JetBrains in the Mellum family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

Baichuan2 13B Chat

baichuan-inc · 13B · runs from 3.9 GB

10.5K 432

Baichuan2 13B Chat is a 13B-parameter open language model from baichuan-inc in the Baichuan family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

SmolLM3 3B Base

Hugging Face · 3B · runs from 1.3 GB

89.8K 150

SmolLM3 3B Base is the pretrained foundation model from Hugging Face's third-generation SmolLM family. Without instruction tuning or chat alignment, it serves as a versatile starting point for researchers and developers who want to fine-tune the model for specific domains, tasks, or behavioral profiles. With 3 billion parameters and the architectural improvements introduced in SmolLM3, this base model offers strong general language capabilities in a package that remains practical to train and adapt on consumer-grade hardware. It is an excellent choice for custom fine-tuning projects where off-the-shelf chat behavior is not needed.

Chat

Gpt2 Large

OpenAI · 812M · runs from 0.4 GB

2.1M 353

Gpt2 Large is a 812M-parameter open language model from OpenAI. It supports a context window of up to 1,024 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Pythia 410M

EleutherAI · 506M · runs from 0.2 GB

104.6K 37

Pythia 410M is a 506M-parameter open language model from EleutherAI. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Dolphin 2.9.1 Yi 1.5 34B

dphn · 34.4B · runs from 10.3 GB

4.7M 57

Dolphin 2.9.1 Yi 1.5 34B is a 34.4-billion parameter chat model created by Eric Hartford's Dolphin project, fine-tuned from 01.AI's Yi 1.5 34B base. The Dolphin series is known for producing uncensored fine-tunes that remove alignment-based refusals, giving users more direct and unrestricted model responses. This model combines the strong bilingual capabilities of Yi 1.5 with Dolphin's open fine-tuning approach. It requires a GPU with at least 24GB of VRAM for quantized local inference and is popular among users who prefer models without built-in content restrictions.

Chat

Falcon 11B

TII UAE · 11.1B · runs from 5.0 GB

4.5K 219

Falcon 11B is a 11.1B-parameter open language model from TII UAE in the Falcon family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen2.5 0.5B

Alibaba · 494M · runs from 0.5 GB

2.0M 421

Qwen2.5 0.5B is the smallest base (pretrained) model in Alibaba Cloud's Qwen 2.5 family, with 494 million parameters. As a base model, it is not instruction-tuned and is intended for fine-tuning, research, and as a foundation for custom applications. It supports a 128K token context window. Its minimal size makes it suitable for experimentation, rapid prototyping, and resource-constrained fine-tuning tasks. The model can run on virtually any hardware. Released under the Apache 2.0 license.

Chat