All LLM Models

Browse 593 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

Gemma 2B

Google · 2.5B · runs from 1.2 GB

248.0K 1.2K

Gemma 2B is a 2.5B-parameter open language model from Google in the Gemma 2 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

DeepSeek v2 Lite Chat

DeepSeek · 15.7B · runs from 5.1 GB

968.2K 141

DeepSeek v2 Lite Chat is a 15.7B-parameter open language model from DeepSeek in the DeepSeek V2 family. It supports a context window of up to 163,840 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Gemma 2 9B

Google · 9.2B · runs from 4.2 GB

75.4K 709

Gemma 2 9B is a 9.2B-parameter open language model from Google in the Gemma 2 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen2.5 7B

Alibaba · 7.6B · runs from 3.6 GB

802.3K 291

Qwen2.5 7B is a 7.6B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

EXAONE 4.0 1.2B

LGAI-EXAONE · 1.3B · runs from 1.0 GB

16.3K 184

EXAONE 4.0 1.2B is a 1.3B-parameter open language model from LGAI-EXAONE in the EXAONE family. It supports a context window of up to 65,536 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

LFM2 1.2B RAG

LiquidAI · 1.2B · runs from 0.9 GB

534 122

LFM2 1.2B RAG is a 1.2B-parameter open language model from LiquidAI. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Racka 4B

elte-nlp · 4.0B · runs from 1.9 GB

633 20

Racka 4B is a 4.0B-parameter open language model from elte-nlp. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

SmolLM 135M

Hugging Face · 135M · runs from 0.4 GB

210.8K 260

SmolLM 135M is the original first-generation small language model from Hugging Face, designed to push the boundaries of what is achievable at extremely low parameter counts. With just 135 million parameters, it was a pioneering effort in making capable language models accessible on the most resource-constrained hardware. While the SmolLM2 and SmolLM3 families have since surpassed it in quality, the original SmolLM 135M remains a useful reference point for research and a practical option for ultra-lightweight deployment scenarios where every megabyte of memory counts.

Chat

Supra 50M Reasoning

SupraLabs · 52M · runs from 0.3 GB

3.0K 44

Supra 50M Reasoning is a 52M-parameter open language model from SupraLabs. It supports a context window of up to 1,024 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoning

Bitnet B1.58 2B 4T

Microsoft · 850M · runs from 2.2 GB

8.8K 1.5K

Bitnet B1.58 2B 4T is a 850M-parameter open language model from Microsoft. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen2.5 14B

Alibaba · 14.8B · runs from 6.8 GB

60.9K 154

Qwen2.5 14B is a 14.8B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen3 4B Base

Alibaba · 4.0B · runs from 2.2 GB

758.6K 95

Qwen3 4B Base is a 4.0B-parameter open language model from Alibaba in the Qwen 3 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Gemma 4 E2B IT Qat Q4 0 Unquantized Heretic

coder3101 · 5.1B · runs from 2.5 GB

1.5K 4

Gemma 4 E2B IT Qat Q4 0 Unquantized Heretic is a 5.1B-parameter open language model from coder3101 in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Code

Eurus 2 7B PRIME

PRIME-RL · 7.6B · runs from 3.0 GB

1.5K 62

Eurus 2 7B PRIME is a 7.6B-parameter open language model from PRIME-RL. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Saiga Llama3 8B

IlyaGusev · 8.0B · runs from 4.0 GB

407.0K 141

Saiga Llama3 8B is a 8.0B-parameter open language model from IlyaGusev in the Llama 3 family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Jan v3 4B Base Instruct

janhq · 4.4B · runs from 2.0 GB

1.9K 59

Jan v3 4B Base Instruct is a 4.4B-parameter open language model from janhq. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

Sqlcoder 7B 2

defog · 6.7B · runs from 4.2 GB

24.1K 436

SQLCoder 7B 2 is a 6.7-billion-parameter model from Defog, purpose-built for converting natural-language questions into SQL queries. Fine-tuned specifically on text-to-SQL tasks, it consistently outperforms much larger general-purpose models when the job is generating accurate, executable SQL against real database schemas. For developers and data analysts who regularly query databases, running SQLCoder locally means fast, private SQL generation without sending proprietary schema details to an external API. It works best when provided with table definitions as context and is particularly strong on PostgreSQL, MySQL, and SQLite dialects.

ChatCode

Llama 2 7B Chat HF

Meta · 6.7B · runs from 3.1 GB

258.1K 4.8K

Meta Llama 2 7B Chat is a 7-billion parameter instruction-tuned model from Meta's Llama 2 family, optimized for dialogue use cases. It was fine-tuned using supervised fine-tuning and RLHF on top of the Llama 2 7B base model, with a 4K token context window. This model is suitable for basic conversational AI tasks and runs efficiently on consumer GPUs. While newer Llama generations offer improved performance, Llama 2 7B Chat remains a well-understood and widely-supported option for local inference. Released under the Llama 2 Community License.

Chat

Yi 6B Chat

01.AI · 6.1B · runs from 2.9 GB

36.7K 70

Yi 6B Chat is a 6.1B-parameter open language model from 01.AI in the Yi family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Gpt2 Medium

OpenAI · 380M · runs from 0.2 GB

716.0K 196

GPT-2 Medium scales the original GPT-2 architecture to 380 million parameters, offering noticeably improved text generation quality over the base 137M variant while remaining extremely lightweight by current standards. It supports the same autoregressive language modeling tasks as its smaller and larger siblings. Like all GPT-2 variants, it runs comfortably on virtually any modern hardware including CPU-only setups, making it an accessible option for learning, prototyping, and lightweight text generation experiments without needing a dedicated GPU.

Chat

OLMoE 1B 7B 0125 Instruct

Allen AI · 6.9B · runs from 2.5 GB

102.2K 65

OLMoE 1B 7B 0125 Instruct is a 6.9B-parameter open language model from Allen AI in the OLMo family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Mamba 130M HF

State Spaces · 129M · runs from 0.1 GB

795.6K 73

Mamba 130M is a state-space model developed by State Spaces that offers a fundamentally different architecture from the Transformer-based models that dominate the LLM landscape. Using selective state-space layers instead of attention, Mamba achieves linear-time inference scaling with sequence length, making it particularly efficient for processing long inputs. At 130 million parameters this is primarily a research and demonstration model, but it showcases the potential of state-space architectures for local deployment. Users interested in exploring alternatives to Transformer-based language models will find Mamba 130M a lightweight and accessible entry point for experimentation.

Chat

Qwen2.5 1.5B

Alibaba · 1.5B · runs from 1 GB

1.2M 187

Qwen2.5 1.5B is a 1.5B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen1.5 0.5B Chat

Alibaba · 620M · runs from 0.8 GB

85.9K 95

Qwen1.5 0.5B Chat is an early-generation small language model from Alibaba's Qwen series with just 620 million parameters. As one of the smallest models in the Qwen family, it was designed to demonstrate that useful conversational ability is possible even at sub-billion parameter scales. This model runs easily on virtually any hardware including CPUs, older GPUs, and even mobile devices. While its capabilities are limited compared to larger Qwen models, it remains a useful option for embedded applications, rapid prototyping, or situations where minimal resource consumption is the top priority.

Chat

Llama3 OpenBioLLM 8B

aaditya · 8B · runs from 3.9 GB

58.0K 242

Llama3 OpenBioLLM 8B is a 8B-parameter open language model from aaditya in the Llama 3 family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Meta Llama 3 8B Instruct

Nous Research · 8B · runs from 3.9 GB

27.5K 103

Meta Llama 3 8B Instruct is a 8B-parameter open language model from Nous Research in the Llama 3 family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

OpenELM 1 1B Instruct

Apple · 1.1B · runs from 0.5 GB

1.5M 75

OpenELM 1 1B Instruct is a 1.1B-parameter open language model from Apple. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Gemma 7B IT

Google · 8.5B · runs from 4.0 GB

21.4K 1.2K

Google Gemma 7B IT is a 7-billion parameter instruction-tuned model from the original Gemma generation. It is fine-tuned for conversational use and general instruction following, running efficiently on consumer GPUs with 8GB or more of VRAM. As a first-generation Gemma model, it has been superseded by Gemma 2 and Gemma 3 models in quality and capability, but it remains well-supported by inference frameworks. Released under the Gemma license.

Chat

Yi 9B

01.AI · 8.8B · runs from 4.1 GB

7.8K 187

Yi 9B is a 8.8B-parameter open language model from 01.AI in the Yi family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Salamandra 7B Instruct

BSC-LT · 7.8B · runs from 3.8 GB

70.3K 76

Salamandra 7B Instruct is a 7.8-billion-parameter multilingual model developed by the Barcelona Supercomputing Center (BSC-LT) as part of a European initiative to build high-quality open language models. It has particular strength in Iberian languages including Spanish, Catalan, Portuguese, and Basque, while also supporting English and other major European languages. This model is an excellent choice for users who need strong performance in Spanish or other Iberian languages that are often underserved by mainstream LLMs. Running it locally ensures data privacy for sensitive multilingual workflows, and at 7B parameters it fits comfortably on a single consumer GPU with 8 GB or more of VRAM.

Chat