All LLM Models

Browse 671 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

HRM Text 1B

sapientinc · 1.2B · runs from 1 GB

123.0K 752

HRM Text 1B is a 1.2B-parameter open language model from sapientinc. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Falcon H1 7B Instruct

TII UAE · 7.6B · runs from 2.6 GB

13.4K 33

Falcon H1 7B Instruct is a 7.6B-parameter open language model from TII UAE in the Falcon family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Gemma 4 31B IT Speculator.eagle3

RedHatAI · 31B · runs from 14.5 GB

100.2K 49

Gemma 4 31B IT Speculator.eagle3 is a 31B-parameter open language model from RedHatAI in the Gemma 4 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Llama 3.2 1B

Meta · 1.2B · runs from 0.6 GB

2.1M 2.4K

Meta Llama 3.2 1B is a 1.2-billion parameter base (pretrained) model from Meta's Llama 3.2 release. It is the smallest model in the Llama 3.2 family and is designed for research, fine-tuning, and embedding into resource-constrained environments. It supports a 128K token context window. As a base model, it is not optimized for conversational use without further fine-tuning. Its minimal resource requirements make it suitable for experimentation, edge deployment, and as a starting point for domain-specific fine-tuning. Released under the Llama 3.2 Community License.

Chat

Nanbeige4.1 3B Heretic

heretic-org · 3.9B · runs from 2.1 GB

3.0K 41

Nanbeige4.1 3B Heretic is a 3.9B-parameter open language model from heretic-org. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Medgemma 27B Text IT

Google · 27.0B · runs from 8.2 GB

26.0K 440

Google MedGemma 27B Text IT is a 27-billion parameter instruction-tuned model specialized for the medical domain, built on the Gemma architecture by Google. It is fine-tuned on medical and clinical text data to provide improved performance on healthcare-related tasks such as medical question answering, clinical reasoning, and health information summarization. The model requires a GPU with at least 24GB of VRAM for quantized inference. Its domain specialization makes it notably more capable than general models on clinical benchmarks, though it should not be used as a substitute for professional medical advice. Released under the Gemma license.

Chat

Qwopus3.6 27B v2

Jackrong · 27.8B · runs from 12.6 GB

9.9K 40

Qwopus3.6 27B v2 is a 27.8B-parameter open language model from Jackrong. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

VisionReasoningFunctions

Hy MT2 30B A3B

tencent · 30.1B · runs from 13.2 GB

5.7K 454

Hy MT2 30B A3B is a 30.1B-parameter open language model from tencent. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Translation

Cogito V1 Preview Qwen 32B

deepcogito · 32B · runs from 10.4 GB

43.2K 116

Cogito V1 Preview Qwen 32B is a 32B-parameter open language model from deepcogito in the Qwen family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

OmniCoder 9B

Tesslate · 9.4B · runs from 3.5 GB

5.9K 645

OmniCoder 9B is a 9.4B-parameter open language model from Tesslate. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCodeFunctions

Hermes 4.3 36B

Nous Research · 36.2B · runs from 10.5 GB

27.5K 239

Hermes 4.3 36B is a 36.2B-parameter open language model from Nous Research in the Hermes family. It supports a context window of up to 524,288 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoningRoleplay

North Mini Code 1.0

Cohere · 30.5B · runs from 8.8 GB

4.1K 330

North Mini Code 1.0 is a 30.5B-parameter open language model from Cohere. It supports a context window of up to 500,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCodeFunctions

Zephyr 7B Beta

Hugging Face · 7.2B · runs from 3.6 GB

144.5K 1.8K

Zephyr 7B Beta is a 7.2B-parameter open language model from Hugging Face. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Phi 4 Reasoning

Microsoft · 14.7B · runs from 4.8 GB

8.2K 227

Phi 4 Reasoning is a 14.7B-parameter open language model from Microsoft in the Phi 4 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatMathCodeReasoning

Huihui MiniCPM5 1B Abliterated

huihui-ai · 1.1B · runs from 0.6 GB

124 6

Huihui MiniCPM5 1B Abliterated is a 1.1B-parameter open language model from huihui-ai in the MiniCPM family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

ERNIE 4.5 21B A3B PT

Baidu · 21B · runs from 6.2 GB

25.6K 165

ERNIE 4.5 21B A3B PT is a 21B-parameter open language model from Baidu in the ERNIE family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

INTELLECT 1 Instruct

PrimeIntellect · 10.2B · runs from 3.7 GB

248 125

INTELLECT 1 Instruct is a 10.2B-parameter open language model from PrimeIntellect. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Mellum2 12B A2.5B Instruct

JetBrains · 12.1B · runs from 5.5 GB

990 64

Mellum2 12B A2.5B Instruct is a 12.1B-parameter open language model from JetBrains in the Mellum family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Apertus 8B Instruct 2509

swiss-ai · 8B · runs from 2.8 GB

117.9K 439

Apertus 8B Instruct is an open-source instruction-tuned model from Swiss AI, a collaborative research initiative. Built on an 8 billion parameter base, it emphasizes transparency, open data, and European AI sovereignty. For local users, it delivers solid general-purpose chat and instruction-following in a standard 8B footprint that runs well on consumer GPUs with 8 to 10 GB of VRAM, making it a practical choice for those who value open, community-driven model development.

Chat

Starcoder2 7B

BigCode · 7.2B · runs from 3.5 GB

12.3K 215

Starcoder2 7B is a 7.2B-parameter open language model from BigCode in the StarCoder family. It supports a context window of up to 16,384 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

Qwen3.5 9B Claude 4.6 Opus Reasoning Distilled

Jackrong · 9.7B · runs from 4.7 GB

5.0K 29

Qwen3.5 9B Claude 4.6 Opus Reasoning Distilled is a 9.7B-parameter open language model from Jackrong in the Qwen 3.5 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoning

Starcoder2 3B

BigCode · 3.0B · runs from 1.6 GB

123.0K 219

Starcoder2 3B is a 3.0B-parameter open language model from BigCode in the StarCoder family. It supports a context window of up to 16,384 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

Carnice V1 9B Hermes Agent Stage2 Merged

kai-os · 9.0B · runs from 4.4 GB

2.1K 183

Carnice V1 9B Hermes Agent Stage2 Merged is a 9.0B-parameter open language model from kai-os in the Hermes family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatFunctionsReasoning

BioMistral 7B

BioMistral · 7B · runs from 3.5 GB

102.4K 506

BioMistral 7B is a 7B-parameter open language model from BioMistral in the Mistral family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

LFM2.5 1.2B JP 202606

LiquidAI · 1.2B · runs from 0.9 GB

2.7K 61

LFM2.5 1.2B JP 202606 is a 1.2B-parameter open language model from LiquidAI. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Nemotron Mini 4B Instruct

NVIDIA · 4B · runs from 1.8 GB

473.9K 182

Nemotron Mini 4B Instruct is a 4B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Phi 3.5 MoE Instruct

Microsoft · 41.9B · runs from 12.1 GB

123.9K 574

Phi 3.5 MoE Instruct is a 41.9B-parameter open language model from Microsoft in the Phi 3 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

Granite 3.3 8B Instruct

IBM · 8.2B · runs from 2.9 GB

70.4K 157

Granite 3.3 8B Instruct is a 8.2B-parameter open language model from IBM in the Granite family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Llama Guard 3 8B

Meta · 8.0B · runs from 2.4 GB

53.9K 299

Meta Llama Guard 3 8B is an 8-billion parameter safety classifier model built on the Llama 3.1 architecture. Unlike general-purpose chat models, Llama Guard is specifically designed to classify whether prompts or responses contain unsafe content across categories such as violence, sexual content, criminal planning, and other policy violations. The model is intended to be used as a moderation layer in LLM-based applications, providing input and output safety filtering. It follows a taxonomy-based classification approach and can be customized for different safety policies. Released under the Llama 3.1 Community License.

Chat

WhiteRabbitNeo 13B V1

WhiteRabbitNeo · 13B · runs from 7.5 GB

2.9K 459

WhiteRabbitNeo 13B V1 is a 13B-parameter open language model from WhiteRabbitNeo. It supports a context window of up to 16,384 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat