All LLM Models
Browse 739 LLM models with VRAM requirements, quantization options, and hardware compatibility.
Understanding LLM VRAM Requirements
How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.
Model List
ERNIE 4.5 0.3B Paddle
Baidu · 361M · runs from 1.0 GB
ERNIE 4.5 0.3B Paddle is a 361M-parameter open language model from Baidu in the ERNIE family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
XiYanSQL QwenCoder 32B 2504
XGenerationLab · 32B · runs from 14.4 GB
XiYanSQL QwenCoder 32B 2504 is a 32B-parameter open language model from XGenerationLab in the Qwen family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
OpenPangu 7B Diffusion DeepDiver
DLLM-Agent · 8.0B · runs from 16.6 GB
OpenPangu 7B Diffusion DeepDiver is a 8.0B-parameter open language model from DLLM-Agent. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
GLM 4.7 Flash Ultimate Irrefusable Heretic
llmfan46 · 29.9B · runs from 13.8 GB
GLM 4.7 Flash Ultimate Irrefusable Heretic is a 29.9B-parameter open language model from llmfan46 in the GLM 4 family. It supports a context window of up to 202,752 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Qwen3.5 9B Humanize DPO Round2
XiangJinYu · 9B · runs from 19.8 GB
Qwen3.5 9B Humanize DPO Round2 is a 9B-parameter open language model from XiangJinYu in the Qwen 3.5 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Valuoty Industry Plc 4B
ICSFR-HF-ORG-01 · 4.4B · runs from 2.4 GB
Valuoty Industry Plc 4B is a 4.4B-parameter open language model from ICSFR-HF-ORG-01. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Styx 12B
DarkArtsForge · 12.2B · runs from 5.9 GB
Styx 12B is a 12.2B-parameter open language model from DarkArtsForge. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Llama 3.3 Nemotron 70B Reward
NVIDIA · 70.6B · runs from 31.0 GB
Llama 3.3 Nemotron 70B Reward is a 70.6B-parameter open language model from NVIDIA in the Llama 3 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
MedScholar 1.5B
yasserrmd · 1.5B · runs from 1.0 GB
MedScholar 1.5B is a 1.5B-parameter open language model from yasserrmd. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
GPT S 1.4M
AxiomicLabs · 1M · runs from 0.3 GB
GPT S 1.4M is a 1M-parameter open language model from AxiomicLabs. It supports a context window of up to 384 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Ethereal Stardust 12B
Vortex5 · 12.2B · runs from 5.9 GB
Ethereal Stardust 12B is a 12.2B-parameter open language model from Vortex5. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Human Like LLama3 8B Instruct
HumanLLMs · 8.0B · runs from 4.0 GB
Human Like LLama3 8B Instruct is a 8.0B-parameter open language model from HumanLLMs in the Llama 3 family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Qwen3 1.7B
AXERA-TECH · 1.7B · runs from 0.8 GB
Qwen3 1.7B is a 1.7B-parameter open language model from AXERA-TECH in the Qwen 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Qwen3.5 2B Text Only
principled-intelligence · 1.9B · runs from 4.2 GB
Qwen3.5 2B Text Only is a 1.9B-parameter open language model from principled-intelligence in the Qwen 3.5 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Qwen3 4B Hindi Instruct v2
pankajpandey-dev · 4.0B · runs from 2.2 GB
Qwen3 4B Hindi Instruct v2 is a 4.0B-parameter open language model from pankajpandey-dev in the Qwen 3 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
LFM2.5 1.2B Instruct Uncensored
zaakirio · 1.2B · runs from 0.9 GB
LFM2.5 1.2B Instruct Uncensored is a 1.2B-parameter open language model from zaakirio. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Qwen3 4B Domino B16
Huang2020 · 588M · runs from 0.6 GB
Qwen3 4B Domino B16 is a 588M-parameter open language model from Huang2020 in the Qwen 3 family. It supports a context window of up to 40,960 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
MobileLLM R1.5 950M
Meta · 950M · runs from 2.1 GB
MobileLLM R1.5 950M is a 950M-parameter open language model from Meta. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Deepseek Coder 1.3B Kexer
JetBrains · 1.3B · runs from 1.3 GB
Deepseek Coder 1.3B Kexer is a 1.3B-parameter open language model from JetBrains in the DeepSeek Coder family. It supports a context window of up to 16,384 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.