All LLM Models
Browse 856 LLM models with VRAM requirements, quantization options, and hardware compatibility.
Understanding LLM VRAM Requirements
How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.
Model List
NVIDIA Nemotron Nano 9B v2 Japanese
NVIDIA · 8.9B · runs from 4.4 GB
NVIDIA Nemotron Nano 9B v2 Japanese is a specialized variant of the Nemotron Nano 9B v2, fine-tuned for Japanese language understanding and generation. At 8.9 billion parameters, it maintains the same hardware-friendly footprint as the English version while delivering natural Japanese conversational ability. For users looking to run a Japanese-language assistant locally, this model offers a rare combination of compact size and dedicated language optimization from a major hardware vendor. It handles Japanese text with the fluency you'd expect from a purpose-built model rather than a multilingual afterthought.
Cydonia 24B V4.3
TheDrummer · 23.6B · runs from 7.8 GB
Cydonia 24B V4.3 is a 23.6B-parameter open language model from TheDrummer. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Gemma 4 E2B IT Qat Mobile Transformers
Google · 2.3B · runs from 1.4 GB
Gemma 4 E2B IT Qat Mobile Transformers is a 2.3B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Qwen3.6 27B DFlash
z-lab · 27B · runs from 12.6 GB
Qwen3.6 27B DFlash is a 27B-parameter open language model from z-lab in the Qwen 3.6 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Llama 3.3 70B Instruct Abliterated
huihui-ai · 70.6B · runs from 20.4 GB
Llama 3.3 70B Instruct Abliterated is a 70.6B-parameter open language model from huihui-ai in the Llama 3 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Mixtral 8x7B Instruct v0.1
Mistral AI · 46.7B · runs from 20.4 GB
Mixtral 8x7B Instruct v0.1 is Mistral AI's flagship Mixture-of-Experts model, combining eight expert networks of 7 billion parameters each for a 46.7B total weight count while activating only about 12.9 billion parameters per token. This sparse architecture delivers performance that rivals much larger dense models at a fraction of the inference cost, excelling across reasoning, code generation, and multilingual tasks. Because the full weights must still be loaded into memory, you will need around 24–48 GB of VRAM depending on quantization level, making it best suited for multi-GPU desktop setups or high-VRAM workstation cards. If your hardware can accommodate it, Mixtral offers one of the best performance-per-active-parameter ratios available for local deployment.
Gemma 4 12B OBLITERATED
OBLITERATUS · 12.0B · runs from 4.3 GB
Gemma 4 12B OBLITERATED is a 12.0B-parameter open language model from OBLITERATUS in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Gemma 4 26B A4B IT Uncensored Heretic
llmfan46 · 25.8B · runs from 11.6 GB
Gemma 4 26B A4B IT Uncensored Heretic is a 25.8B-parameter open language model from llmfan46 in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Gemma 4 26B A4B IT Assistant
Google · 26B · runs from 11.4 GB
Gemma 4 26B A4B IT Assistant is a 26B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
MiniMax M2.1
MiniMaxAI · 228.7B · runs from 63.5 GB
MiniMax M2.1 is an earlier generation of MiniMax's large mixture-of-experts model series, featuring the same 228 billion total parameter architecture as its successor. It offers strong multilingual performance across Chinese and English tasks, including conversation, reasoning, and content generation. While M2.5 refines the formula, M2.1 remains a capable option for users with the multi-GPU hardware needed to host a model of this scale locally.
LFM2 8B A1B
LiquidAI · 8.3B · runs from 2.7 GB
LFM2 8B A1B is Liquid AI's larger mixture-of-experts model, combining the company's novel hybrid architecture with approximately 8 billion total parameters. It uses a MoE design to keep active compute per token low while maintaining strong general performance across chat and reasoning tasks. For local users, it offers an intriguing alternative to conventional 8B transformers, with Liquid AI's architecture promising improved efficiency and throughput on consumer-grade hardware.
Deepseek Coder 33B Instruct
DeepSeek · 33.3B · runs from 14.6 GB
Deepseek Coder 33B Instruct is a 33.3B-parameter open language model from DeepSeek in the DeepSeek Coder family. It supports a context window of up to 16,384 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
NVIDIA Nemotron 3 Ultra 550B A55B BF16
NVIDIA · 560.5B · runs from 169.6 GB
NVIDIA Nemotron 3 Ultra 550B A55B BF16 is a 560.5B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
SmolLM3 3B
Hugging Face · 3.1B · runs from 1.3 GB
SmolLM3 3B is Hugging Face's latest-generation compact language model, representing a significant step up from the SmolLM2 series. At 3 billion parameters, it delivers considerably stronger reasoning, instruction following, and general language understanding while maintaining modest hardware requirements that keep it accessible on most consumer GPUs. This model benefits from improved training data, architectural refinements, and lessons learned from previous SmolLM generations. It is well positioned for local chatbot applications, coding assistance, and content generation tasks where you want strong performance without dedicating the resources required by 7B-class models.
GLM 4.5 Air
zai-org · 110.5B · runs from 30.8 GB
GLM 4.5 Air is a 110.5B-parameter open language model from zai-org in the GLM 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
DeepSeek V3.1
DeepSeek · 684.5B · runs from 192.1 GB
DeepSeek V3.1 is a 684.5B-parameter open language model from DeepSeek in the DeepSeek V3 family. It supports a context window of up to 163,840 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Gemma 3n E2B IT
Google · 5.4B · runs from 1.6 GB
Gemma 3n E2B IT is a 5.4B-parameter open language model from Google in the Gemma 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Gemma 4 E4B IT Assistant
Google · 4B · runs from 2 GB
Gemma 4 E4B IT Assistant is a 4B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Llama 3.2 11B Vision Instruct
Meta · 10.7B · runs from 5.0 GB
Llama 3.2 11B Vision Instruct is a 10.7B-parameter open language model from Meta in the Llama 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Jan Code 4B
janhq · 4.4B · runs from 2.4 GB
Jan Code 4B is a 4.4B-parameter open language model from janhq. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Qwen2.5 Coder 32B
Alibaba · 32.8B · runs from 9.8 GB
Qwen2.5 Coder 32B is a 32.8B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
MiniMax M2.7 BF16 Ultra Uncensored Heretic
llmfan46 · 228.7B · runs from 97.8 GB
MiniMax M2.7 BF16 Ultra Uncensored Heretic is a 228.7B-parameter open language model from llmfan46 in the MiniMax family. It supports a context window of up to 204,800 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Gemma 3n E4B IT
Google · 7.8B · runs from 2.4 GB
Gemma 3n E4B IT is a 7.8B-parameter open language model from Google in the Gemma 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Qwen3 235B A22B
Alibaba · 235.1B · runs from 100.4 GB
Qwen3 235B A22B is the largest model in Alibaba Cloud's Qwen 3 series, a Mixture of Experts (MoE) model with 235 billion total parameters and approximately 22 billion active parameters per forward pass. The MoE architecture enables it to deliver performance competitive with the best available open-weight models while requiring significantly less compute per token than a comparably sized dense model. It supports hybrid thinking mode for flexible chain-of-thought reasoning. Due to its massive total parameter count, running Qwen3 235B A22B locally requires substantial VRAM to load all expert weights, typically needing multiple high-end professional GPUs even at reduced precision. In heavily quantized formats it becomes accessible on workstation-class multi-GPU setups. Released under the Apache 2.0 license.
Hermes 4 14B
Nous Research · 14.8B · runs from 5.1 GB
Hermes 4 14B is a 14.8B-parameter open language model from Nous Research in the Hermes family. It supports a context window of up to 40,960 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
GLM 4.6
zai-org · 356.8B · runs from 98.7 GB
GLM 4.6 is a 356.8B-parameter open language model from zai-org in the GLM 4 family. It supports a context window of up to 202,752 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
LFM2.5 350M
LiquidAI · 354M · runs from 0.5 GB
LFM2.5 350M is a 354M-parameter open language model from LiquidAI. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
LFM2.5 1.2B Thinking
LiquidAI · 1.2B · runs from 0.9 GB
LFM2.5 1.2B Thinking is a 1.2B-parameter open language model from LiquidAI. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.
Qwen2.5 Coder 7B
Alibaba · 7.6B · runs from 3.6 GB
Qwen2.5 Coder 7B is a 7.6-billion parameter code-specialized base (pretrained) model from Alibaba Cloud's Qwen 2.5 Coder series. It is trained on a large dataset of source code and natural language but is not instruction-tuned, making it suitable for fine-tuning, code-related research, and custom downstream applications. The model supports a 128K token context window and runs efficiently on consumer GPUs. It serves as the foundation for the Qwen2.5 Coder 7B Instruct variant and community fine-tunes targeting specific programming languages or workflows. Released under the Apache 2.0 license.
Hermes 3 Llama 3.1 8B
Nous Research · 8.0B · runs from 3.3 GB
Hermes 3 Llama 3.1 8B is an 8-billion parameter instruction-tuned model by Nous Research, built on Meta's Llama 3.1 8B base. It is fine-tuned for advanced instruction following, multi-turn conversation, structured output, and creative roleplay scenarios. The Hermes series is known for producing highly steerable models that respond well to system prompts. This model supports a 128K token context window inherited from the Llama 3.1 architecture and runs efficiently on consumer GPUs with 8GB or more of VRAM. It is a popular choice among local inference enthusiasts who value strong instruction adherence and versatile conversational ability.