All LLM Models

Browse 719 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Featured only

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Llama 3.1 Nemotron 70B Instruct HF

NVIDIA · 70.6B · runs from 20.4 GB

Llama 3.1 Nemotron 70B Instruct is a 70-billion parameter chat model by NVIDIA, created by applying reinforcement learning from human feedback (RLHF) to Meta's Llama 3.1 70B base model. NVIDIA's Nemotron training pipeline focuses on improving helpfulness, accuracy, and response quality beyond the standard Llama instruction tuning. The model requires substantial VRAM for local inference, typically needing multi-GPU setups or high-end professional GPUs. In quantized formats it becomes accessible on workstation-class hardware. It is available in Hugging Face Transformers format and is supported by popular inference engines.

Moonlight 16B A3B Instruct

Moonshot AI · 16.0B · runs from 5.1 GB

Moonlight 16B A3B Instruct is a 16.0B-parameter open language model from Moonshot AI in the Moonlight family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Granite 4.0 Micro

IBM · 3.4B · runs from 1.4 GB

Granite 4.0 Micro is a 3.4B-parameter open language model from IBM in the Granite family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 4 31B IT Qat Q4 0 Unquantized Assistant

Google · 31B · runs from 13.5 GB

Gemma 4 31B IT Qat Q4 0 Unquantized Assistant is a 31B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen2.5 Coder 1.5B

Alibaba · 1.5B · runs from 1 GB

Qwen2.5 Coder 1.5B is a 1.5-billion parameter code-specialized model from Alibaba Cloud's Qwen 2.5 Coder series. It is the smallest Coder variant that balances meaningful code generation capability with extremely low resource requirements, running on GPUs with as little as 2-4GB of VRAM. The model is suitable for lightweight code completion, simple code generation tasks, and as a compact local coding assistant in resource-constrained environments. It supports a 128K token context window. Released under the Apache 2.0 license.

Pantheon Reasoning 27B

Gryphe · 27.8B · runs from 8.4 GB

Pantheon Reasoning 27B is a 27.8B-parameter open language model from Gryphe. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatRoleplayReasoning

Nanbeige4.1 3B

Nanbeige · 3.9B · runs from 2.1 GB

Nanbeige4.1 3B is a compact chat model from Nanbeige, a Chinese AI startup focused on building efficient small-scale language models. At just under 4 billion parameters, it is designed to run on virtually any modern GPU or even on CPU, making it one of the more accessible options for users with limited hardware. Despite its small size, it handles basic conversation, simple reasoning, and Chinese-English bilingual tasks, serving as a practical entry point for local LLM experimentation.

Starcoder2 15B

BigCode · 16.0B · runs from 7.3 GB

Starcoder2 15B is a 16.0B-parameter open language model from BigCode in the StarCoder family. It supports a context window of up to 16,384 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Huihui Qwen3.6 35B A3B Claude 4.7 Opus Abliterated

huihui-ai · 36.0B · runs from 15.7 GB

Huihui Qwen3.6 35B A3B Claude 4.7 Opus Abliterated is a 36.0B-parameter open language model from huihui-ai in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Phi 3 Mini 4k Instruct

Microsoft · 3.8B · runs from 2.7 GB

Microsoft Phi 3 Mini 4K Instruct is a 3.8-billion parameter instruction-tuned model from Microsoft Research's Phi 3 generation, with a 4K token context window. The Phi 3 family demonstrated that small models trained on carefully curated, high-quality data can achieve performance competitive with models several times their size. The model runs on consumer GPUs with as little as 4-6GB of VRAM when quantized, making it one of the most accessible capable chat models for local deployment. Released under the MIT license.

Qwopus3.5 9B v3

Jackrong · 9.7B · runs from 19.9 GB

Qwopus3.5 9B v3 is a 9.7B-parameter open language model from Jackrong. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

VisionReasoning

Qwen2.5 3B

Alibaba · 3.1B · runs from 1.6 GB

Qwen2.5 3B is a 3.1B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Magistral Small 2506

Mistral AI · 23.6B · runs from 7.2 GB

Magistral Small 2506 is a 23.6B-parameter open language model from Mistral AI in the Mistral family. It supports a context window of up to 40,960 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

HRM Text 1B

sapientinc · 1.2B · runs from 1 GB

HRM Text 1B is a 1.2B-parameter open language model from sapientinc. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Meta Llama 3 70B Instruct

Meta · 70.6B · runs from 23.3 GB

Meta Llama 3 70B Instruct is a 70.6-billion parameter instruction-tuned model from Meta's Llama 3 release. It is fine-tuned for dialogue, coding assistance, and complex reasoning tasks using supervised fine-tuning and RLHF. At the time of release, it was among the most capable openly available models. The model supports an 8K token context window and requires substantial VRAM for local inference, typically needing multi-GPU setups or high-VRAM professional GPUs. It has been widely adopted for local deployment in quantized formats. Released under the Meta Llama 3 Community License.

Falcon H1 7B Instruct

TII UAE · 7.6B · runs from 2.6 GB

Falcon H1 7B Instruct is a 7.6B-parameter open language model from TII UAE in the Falcon family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 4 31B IT Speculator.eagle3

RedHatAI · 31B · runs from 14.5 GB

Gemma 4 31B IT Speculator.eagle3 is a 31B-parameter open language model from RedHatAI in the Gemma 4 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama 3.2 1B

Meta · 1.2B · runs from 0.6 GB

Meta Llama 3.2 1B is a 1.2-billion parameter base (pretrained) model from Meta's Llama 3.2 release. It is the smallest model in the Llama 3.2 family and is designed for research, fine-tuning, and embedding into resource-constrained environments. It supports a 128K token context window. As a base model, it is not optimized for conversational use without further fine-tuning. Its minimal resource requirements make it suitable for experimentation, edge deployment, and as a starting point for domain-specific fine-tuning. Released under the Llama 3.2 Community License.

Nanbeige4.1 3B Heretic

heretic-org · 3.9B · runs from 2.1 GB

Nanbeige4.1 3B Heretic is a 3.9B-parameter open language model from heretic-org. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Medgemma 27B Text IT

Google · 27.0B · runs from 8.2 GB

Google MedGemma 27B Text IT is a 27-billion parameter instruction-tuned model specialized for the medical domain, built on the Gemma architecture by Google. It is fine-tuned on medical and clinical text data to provide improved performance on healthcare-related tasks such as medical question answering, clinical reasoning, and health information summarization. The model requires a GPU with at least 24GB of VRAM for quantized inference. Its domain specialization makes it notably more capable than general models on clinical benchmarks, though it should not be used as a substitute for professional medical advice. Released under the Gemma license.

Qwopus3.6 27B v2

Jackrong · 27.8B · runs from 12.6 GB

Qwopus3.6 27B v2 is a 27.8B-parameter open language model from Jackrong. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

VisionReasoningFunctions

Hy MT2 30B A3B

tencent · 30.1B · runs from 13.2 GB

Hy MT2 30B A3B is a 30.1B-parameter open language model from tencent. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Cogito V1 Preview Qwen 32B

deepcogito · 32B · runs from 10.4 GB

Cogito V1 Preview Qwen 32B is a 32B-parameter open language model from deepcogito in the Qwen family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

OmniCoder 9B

Tesslate · 9.4B · runs from 3.5 GB

OmniCoder 9B is a 9.4B-parameter open language model from Tesslate. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCodeFunctions

Hermes 4.3 36B

Nous Research · 36.2B · runs from 10.5 GB

Hermes 4.3 36B is a 36.2B-parameter open language model from Nous Research in the Hermes family. It supports a context window of up to 524,288 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoningRoleplay

North Mini Code 1.0

Cohere · 30.5B · runs from 8.8 GB

North Mini Code 1.0 is a 30.5B-parameter open language model from Cohere. It supports a context window of up to 500,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCodeFunctions

Zephyr 7B Beta

Hugging Face · 7.2B · runs from 3.6 GB

Zephyr 7B Beta is a 7.2B-parameter open language model from Hugging Face. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Phi 4 Reasoning

Microsoft · 14.7B · runs from 4.8 GB

Phi 4 Reasoning is a 14.7B-parameter open language model from Microsoft in the Phi 4 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatMathCodeReasoning

Huihui MiniCPM5 1B Abliterated

huihui-ai · 1.1B · runs from 0.6 GB

Huihui MiniCPM5 1B Abliterated is a 1.1B-parameter open language model from huihui-ai in the MiniCPM family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ERNIE 4.5 21B A3B PT

Baidu · 21B · runs from 6.2 GB

ERNIE 4.5 21B A3B PT is a 21B-parameter open language model from Baidu in the ERNIE family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.