All LLM Models

Browse 719 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

Qehwa Pashto Llm

junaid008 · 7.6B · runs from 3.6 GB

158 2

Qehwa Pashto Llm is a 7.6B-parameter open language model from junaid008. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Nemotron Flash 3B

NVIDIA · 2.7B · runs from 6.0 GB

157 17

Nemotron Flash 3B is a 2.7B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 29,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Apex 1 Instruct 350M

LH-Tech-AI · 350M · runs from 0.8 GB

156 10

Apex 1 Instruct 350M is a 350M-parameter open language model from LH-Tech-AI. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qliphoth 12B V1.2

OccultAI · 12.2B · runs from 5.9 GB

156 5

Qliphoth 12B V1.2 is a 12.2B-parameter open language model from OccultAI. It supports a context window of up to 1,024,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Ouro Hybrid 1.4B

chili-lab · 1.5B · runs from 3.7 GB

152 8

Ouro Hybrid 1.4B is a 1.5B-parameter open language model from chili-lab. It supports a context window of up to 65,536 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

KernelLLM

Meta · 8.0B · runs from 4.0 GB

137 202

KernelLLM is a 8.0B-parameter open language model from Meta. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

OpenCodeReasoning Nemotron 1.1 32B

NVIDIA · 32.8B · runs from 14.8 GB

136 48

OpenCodeReasoning Nemotron 1.1 32B is a 32.8B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 65,536 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCodeReasoning

NextCoder 7B

Microsoft · 7.6B · runs from 3.6 GB

135 33

NextCoder 7B is a 7.6B-parameter open language model from Microsoft. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

Riva Translate 4B Instruct

NVIDIA · 4.2B · runs from 2.3 GB

131 18

Riva Translate 4B Instruct is a 4.2B-parameter open language model from NVIDIA. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Meme Trix MoE 14B A8B V1

Naphula · 13.7B · runs from 6.4 GB

130 8

Meme Trix MoE 14B A8B V1 is a 13.7B-parameter open language model from Naphula. It supports a context window of up to 1,073,152 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Supra Mini V6 1M

SupraLabs · 1M · runs from 0.3 GB

129 4

Supra Mini V6 1M is a 1M-parameter open language model from SupraLabs. It supports a context window of up to 1,024 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

ERNIE 4.5 0.3B Paddle

Baidu · 361M · runs from 1.0 GB

127 28

ERNIE 4.5 0.3B Paddle is a 361M-parameter open language model from Baidu in the ERNIE family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

XiYanSQL QwenCoder 32B 2504

XGenerationLab · 32B · runs from 14.4 GB

127 19

XiYanSQL QwenCoder 32B 2504 is a 32B-parameter open language model from XGenerationLab in the Qwen family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

OpenPangu 7B Diffusion DeepDiver

DLLM-Agent · 8.0B · runs from 16.6 GB

125 5

OpenPangu 7B Diffusion DeepDiver is a 8.0B-parameter open language model from DLLM-Agent. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

GLM 4.7 Flash Ultimate Irrefusable Heretic

llmfan46 · 29.9B · runs from 13.8 GB

125 2

GLM 4.7 Flash Ultimate Irrefusable Heretic is a 29.9B-parameter open language model from llmfan46 in the GLM 4 family. It supports a context window of up to 202,752 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen3.5 9B Humanize DPO Round2

XiangJinYu · 9B · runs from 19.8 GB

121 4

Qwen3.5 9B Humanize DPO Round2 is a 9B-parameter open language model from XiangJinYu in the Qwen 3.5 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Valuoty Industry Plc 4B

ICSFR-HF-ORG-01 · 4.4B · runs from 2.4 GB

119 5

Valuoty Industry Plc 4B is a 4.4B-parameter open language model from ICSFR-HF-ORG-01. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Styx 12B

DarkArtsForge · 12.2B · runs from 5.9 GB

115 4

Styx 12B is a 12.2B-parameter open language model from DarkArtsForge. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

MedScholar 1.5B

yasserrmd · 1.5B · runs from 1.0 GB

109 24

MedScholar 1.5B is a 1.5B-parameter open language model from yasserrmd. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

GPT S 1.4M

AxiomicLabs · 1M · runs from 0.3 GB

108 6

GPT S 1.4M is a 1M-parameter open language model from AxiomicLabs. It supports a context window of up to 384 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Ethereal Stardust 12B

Vortex5 · 12.2B · runs from 5.9 GB

108 5

Ethereal Stardust 12B is a 12.2B-parameter open language model from Vortex5. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatRoleplay

Human Like LLama3 8B Instruct

HumanLLMs · 8.0B · runs from 4.0 GB

105 24

Human Like LLama3 8B Instruct is a 8.0B-parameter open language model from HumanLLMs in the Llama 3 family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen3 1.7B

AXERA-TECH · 1.7B · runs from 0.8 GB

105 2

Qwen3 1.7B is a 1.7B-parameter open language model from AXERA-TECH in the Qwen 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen3.5 2B Text Only

principled-intelligence · 1.9B · runs from 4.2 GB

104 3

Qwen3.5 2B Text Only is a 1.9B-parameter open language model from principled-intelligence in the Qwen 3.5 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen3 4B Hindi Instruct v2

pankajpandey-dev · 4.0B · runs from 2.2 GB

102 4

Qwen3 4B Hindi Instruct v2 is a 4.0B-parameter open language model from pankajpandey-dev in the Qwen 3 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

LFM2.5 1.2B Instruct Uncensored

zaakirio · 1.2B · runs from 0.9 GB

100 2

LFM2.5 1.2B Instruct Uncensored is a 1.2B-parameter open language model from zaakirio. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen3 4B Domino B16

Huang2020 · 588M · runs from 0.6 GB

89 3

Qwen3 4B Domino B16 is a 588M-parameter open language model from Huang2020 in the Qwen 3 family. It supports a context window of up to 40,960 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

MobileLLM R1.5 950M

Meta · 950M · runs from 2.1 GB

56 19

MobileLLM R1.5 950M is a 950M-parameter open language model from Meta. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoning

Deepseek Coder 1.3B Kexer

JetBrains · 1.3B · runs from 1.3 GB

18 9

Deepseek Coder 1.3B Kexer is a 1.3B-parameter open language model from JetBrains in the DeepSeek Coder family. It supports a context window of up to 16,384 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode