All LLM Models
Browse 856 LLM models with VRAM requirements, quantization options, and hardware compatibility.
Understanding LLM VRAM Requirements
How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.
Model List
Kimi K2 Instruct 0905
Moonshot AI · 1026.5B · runs from 286.2 GB
Kimi K2 Instruct 0905 is a 1026.5B-parameter open language model from Moonshot AI in the Kimi K2 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Qwopus3.6 27B v2
Jackrong · 27.8B · runs from 12.6 GB
Qwopus3.6 27B v2 is a 27.8B-parameter open language model from Jackrong. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Hy MT2 30B A3B
tencent · 30.1B · runs from 13.2 GB
Hy MT2 30B A3B is a 30.1B-parameter open language model from tencent. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Cogito V1 Preview Qwen 32B
deepcogito · 32B · runs from 10.4 GB
Cogito V1 Preview Qwen 32B is a 32B-parameter open language model from deepcogito in the Qwen family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
OmniCoder 9B
Tesslate · 9.4B · runs from 3.5 GB
OmniCoder 9B is a 9.4B-parameter open language model from Tesslate. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Hermes 4.3 36B
Nous Research · 36.2B · runs from 10.5 GB
Hermes 4.3 36B is a 36.2B-parameter open language model from Nous Research in the Hermes family. It supports a context window of up to 524,288 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
North Mini Code 1.0
Cohere · 30.5B · runs from 8.8 GB
North Mini Code 1.0 is a 30.5B-parameter open language model from Cohere. It supports a context window of up to 500,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Zephyr 7B Beta
Hugging Face · 7.2B · runs from 3.6 GB
Zephyr 7B Beta is a 7.2B-parameter open language model from Hugging Face. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Phi 4 Reasoning
Microsoft · 14.7B · runs from 4.8 GB
Phi 4 Reasoning is a 14.7B-parameter open language model from Microsoft in the Phi 4 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Huihui MiniCPM5 1B Abliterated
huihui-ai · 1.1B · runs from 0.6 GB
Huihui MiniCPM5 1B Abliterated is a 1.1B-parameter open language model from huihui-ai in the MiniCPM family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
ERNIE 4.5 21B A3B PT
Baidu · 21B · runs from 6.2 GB
ERNIE 4.5 21B A3B PT is a 21B-parameter open language model from Baidu in the ERNIE family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
INTELLECT 1 Instruct
PrimeIntellect · 10.2B · runs from 3.7 GB
INTELLECT 1 Instruct is a 10.2B-parameter open language model from PrimeIntellect. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Mellum2 12B A2.5B Instruct
JetBrains · 12.1B · runs from 5.5 GB
Mellum2 12B A2.5B Instruct is a 12.1B-parameter open language model from JetBrains in the Mellum family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Apertus 8B Instruct 2509
swiss-ai · 8B · runs from 2.8 GB
Apertus 8B Instruct is an open-source instruction-tuned model from Swiss AI, a collaborative research initiative. Built on an 8 billion parameter base, it emphasizes transparency, open data, and European AI sovereignty. For local users, it delivers solid general-purpose chat and instruction-following in a standard 8B footprint that runs well on consumer GPUs with 8 to 10 GB of VRAM, making it a practical choice for those who value open, community-driven model development.
Starcoder2 7B
BigCode · 7.2B · runs from 3.5 GB
Starcoder2 7B is a 7.2B-parameter open language model from BigCode in the StarCoder family. It supports a context window of up to 16,384 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Qwen3.5 9B Claude 4.6 Opus Reasoning Distilled
Jackrong · 9.7B · runs from 4.7 GB
Qwen3.5 9B Claude 4.6 Opus Reasoning Distilled is a 9.7B-parameter open language model from Jackrong in the Qwen 3.5 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Starcoder2 3B
BigCode · 3.0B · runs from 1.6 GB
Starcoder2 3B is a 3.0B-parameter open language model from BigCode in the StarCoder family. It supports a context window of up to 16,384 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Carnice V1 9B Hermes Agent Stage2 Merged
kai-os · 9.0B · runs from 4.4 GB
Carnice V1 9B Hermes Agent Stage2 Merged is a 9.0B-parameter open language model from kai-os in the Hermes family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
BioMistral 7B
BioMistral · 7B · runs from 3.5 GB
BioMistral 7B is a 7B-parameter open language model from BioMistral in the Mistral family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
LFM2.5 1.2B JP 202606
LiquidAI · 1.2B · runs from 0.9 GB
LFM2.5 1.2B JP 202606 is a 1.2B-parameter open language model from LiquidAI. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Nemotron Mini 4B Instruct
NVIDIA · 4B · runs from 1.8 GB
Nemotron Mini 4B Instruct is a 4B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Phi 3.5 MoE Instruct
Microsoft · 41.9B · runs from 12.1 GB
Phi 3.5 MoE Instruct is a 41.9B-parameter open language model from Microsoft in the Phi 3 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Hermes 3 Llama 3.1 70B
Nous Research · 70.6B · runs from 20.4 GB
Hermes 3 Llama 3.1 70B is a 70.6B-parameter open language model from Nous Research in the Llama 3 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Granite 3.3 8B Instruct
IBM · 8.2B · runs from 2.9 GB
Granite 3.3 8B Instruct is a 8.2B-parameter open language model from IBM in the Granite family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Llama Guard 3 8B
Meta · 8.0B · runs from 2.4 GB
Meta Llama Guard 3 8B is an 8-billion parameter safety classifier model built on the Llama 3.1 architecture. Unlike general-purpose chat models, Llama Guard is specifically designed to classify whether prompts or responses contain unsafe content across categories such as violence, sexual content, criminal planning, and other policy violations. The model is intended to be used as a moderation layer in LLM-based applications, providing input and output safety filtering. It follows a taxonomy-based classification approach and can be customized for different safety policies. Released under the Llama 3.1 Community License.
WhiteRabbitNeo 13B V1
WhiteRabbitNeo · 13B · runs from 7.5 GB
WhiteRabbitNeo 13B V1 is a 13B-parameter open language model from WhiteRabbitNeo. It supports a context window of up to 16,384 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Laguna XS.2
poolside · 33.4B · runs from 14.6 GB
Laguna XS.2 is a 33.4B-parameter open language model from poolside. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Qwen3.5 35B A3B Claude 4.6 Opus Reasoning Distilled
Jackrong · 36.0B · runs from 72.3 GB
Qwen3.5 35B A3B Claude 4.6 Opus Reasoning Distilled is a 36.0B-parameter open language model from Jackrong in the Qwen 3.5 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
DeepSeek V4 Flash 162B
0xSero · 92.2B · runs from 39.5 GB
DeepSeek V4 Flash 162B is a 92.2B-parameter open language model from 0xSero in the DeepSeek V4 family. It supports a context window of up to 1,048,576 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
DeepSeek TNG R1T2 Chimera
tngtech · 684.5B · runs from 192.1 GB
DeepSeek TNG R1T2 Chimera is a 684.5B-parameter open language model from tngtech in the DeepSeek family. It supports a context window of up to 163,840 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.