All LLM Models

Browse 671 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Featured only

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

MiniCPM5 1B

openbmb · 1.1B · runs from 0.6 GB

MiniCPM5 1B is a 1.1B-parameter open language model from openbmb in the MiniCPM family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 4 E4B IT OBLITERATED

OBLITERATUS · 8.0B · runs from 2.7 GB

Gemma 4 E4B IT OBLITERATED is a 8.0B-parameter open language model from OBLITERATUS in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen3 30B A3B Thinking 2507

Alibaba · 30.5B · runs from 8.8 GB

Qwen3 30B A3B Thinking 2507 is the reasoning-focused variant of Alibaba's 30-billion-parameter mixture-of-experts model, updated in July 2025. Like its instruct sibling, it activates only about 3 billion parameters per token, keeping resource demands low while enabling multi-step reasoning and chain-of-thought problem solving. This thinking variant is designed for tasks that benefit from deliberate, step-by-step logic such as math, coding puzzles, and analytical questions. Its efficient MoE design means users with modest GPUs can still access strong reasoning capabilities without needing datacenter-class hardware.

Diffusiongemma 26B A4B IT

Google · 25.8B · runs from 11.6 GB

Diffusiongemma 26B A4B IT is a 25.8B-parameter open language model from Google in the Gemma family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 4 12B IT AEON Abliterated K4 BF16

AEON-7 · 12.0B · runs from 6.1 GB

Gemma 4 12B IT AEON Abliterated K4 BF16 is a 12.0B-parameter open language model from AEON-7 in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoningFunctions

Hermes 3 Llama 3.2 3B

Nous Research · 3B · runs from 1.6 GB

Hermes 3 Llama 3.2 3B is a 3-billion parameter instruction-tuned model by Nous Research, fine-tuned from Meta's Llama 3.2 3B base. It applies the Hermes training methodology to a compact model, targeting strong instruction following and conversational quality at minimal hardware cost. Despite its small size, this model benefits from the Hermes fine-tuning approach that emphasizes system prompt adherence and structured output. It can run on GPUs with as little as 4GB of VRAM when quantized, making it suitable for lightweight local deployments and resource-constrained environments.

Ternary Bonsai 8B Unpacked

prism-ml · 8.2B · runs from 4.1 GB

Ternary Bonsai 8B Unpacked is a 8.2B-parameter open language model from prism-ml. It supports a context window of up to 65,536 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Nex N2 Mini

nex-agi · 35.1B · runs from 14.0 GB

Nex N2 Mini is a 35.1B-parameter open language model from nex-agi. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

OpenHermes 2.5 Mistral 7B

Teknium · 7B · runs from 3.5 GB

OpenHermes 2.5 is a community-driven fine-tune of Mistral 7B created by Teknium, trained on over 900,000 entries of high-quality synthetic data generated primarily by GPT-4. It quickly became one of the most popular open chat models of its era, consistently topping community benchmarks for 7B-class models. For local users, it offers strong instruction-following, creative writing, and coding assistance in a package that runs comfortably on a single consumer GPU with 8 GB of VRAM.

Mistral 7B v0.1

Mistral AI · 7B · runs from 3.5 GB

Mistral 7B v0.1 is the original base model from Mistral AI that helped reshape expectations for small open-weight language models when it launched in late 2023. As a pretrained foundation model without instruction tuning, it is designed for fine-tuning, research, and custom downstream tasks rather than direct conversational use. With 7 billion parameters and support for grouped-query attention and sliding-window attention, it remains a popular starting point for practitioners building specialized models. Its modest VRAM requirements of roughly 6 GB at 4-bit quantization keep it accessible on a wide range of consumer GPUs.

Qwen3.6 28B REAP20 A3B

0xSero · 28.2B · runs from 11.3 GB

Qwen3.6 28B REAP20 A3B is a 28.2B-parameter open language model from 0xSero in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 2 27B IT

Google · 27.2B · runs from 9.0 GB

Google Gemma 2 27B IT is a 27.2-billion parameter instruction-tuned model from Google's Gemma 2 generation. It is a text-only chat model optimized for conversational use, reasoning, and instruction following. Gemma 2 27B IT was one of the strongest openly available models in its size class at release. The model requires a GPU with at least 24GB of VRAM for quantized local inference. It is widely supported by popular inference engines and remains a strong choice for users seeking high-quality local chat without needing 70B-class hardware. Released under the Gemma license.

Gemma 3 270M

Google · 268M · runs from 0.1 GB

Google Gemma 3 270M is a 270-million parameter base (pretrained) model from Google's Gemma 3 family. It is an experimental release intended for research, fine-tuning, and exploring the capabilities of ultra-small language models. The model runs on virtually any hardware with negligible resource requirements. Released under the Gemma license.

Qwen2.5 Coder 3B

Alibaba · 3.1B · runs from 1.4 GB

Qwen2.5 Coder 3B is a 3.1B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Phi 4 Reasoning Plus

Microsoft · 14.7B · runs from 4.8 GB

Phi 4 Reasoning Plus is a 14.7B-parameter open language model from Microsoft in the Phi 4 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatMathCodeReasoning

Llama 3.1 8B

Meta · 8.0B · runs from 3.8 GB

Meta Llama 3.1 8B is an 8-billion parameter base (pretrained) model from the Llama 3.1 family. It is not instruction-tuned and is intended for fine-tuning, research, and custom downstream applications. Compared to Llama 3 8B, it extends the context window to 128K tokens and benefits from improved training data and methodology. The model uses grouped-query attention and was trained on a multilingual corpus. It is released under the Llama 3.1 Community License and is widely used as a foundation for community fine-tunes and specialized models.

NeuralDaredevil 8B Abliterated

mlabonne · 8.0B · runs from 4.0 GB

NeuralDaredevil 8B Abliterated is a 8.0B-parameter open language model from mlabonne. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Deepseek Coder 1.3B Instruct

DeepSeek · 1.3B · runs from 1.3 GB

DeepSeek Coder 1.3B Instruct is an ultra-compact code model designed for environments where hardware resources are extremely limited. Despite having just 1.3 billion parameters, it can handle basic code completion, simple generation tasks, and code Q&A across common programming languages. This is one of the smallest viable code models available, capable of running on integrated graphics or very low-end dedicated GPUs. It is well suited for edge deployment, embedded development environments, or as a fast local autocomplete engine where response speed matters more than handling complex multi-file reasoning tasks.

Olmo 3 7B Instruct

Allen AI · 7.3B · runs from 3.4 GB

OLMo 3 7B Instruct is an instruction-tuned language model from the Allen Institute for AI, built as part of their Open Language Model initiative. Like all OLMo releases, it comes with fully open training data, code, and intermediate checkpoints, setting a high standard for reproducibility and scientific transparency in the LLM space. At roughly 7 billion parameters, this model delivers competitive performance on instruction following, reasoning, and general knowledge tasks while remaining runnable on consumer GPUs with 8 GB or more of VRAM. It is an excellent choice for users who value open science and want a capable, well-documented model for local chat and assistant applications.

Moonlight 16B A3B Instruct

Moonshot AI · 16.0B · runs from 5.1 GB

Moonlight 16B A3B Instruct is a 16.0B-parameter open language model from Moonshot AI in the Moonlight family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Granite 4.0 Micro

IBM · 3.4B · runs from 1.4 GB

Granite 4.0 Micro is a 3.4B-parameter open language model from IBM in the Granite family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 4 31B IT Qat Q4 0 Unquantized Assistant

Google · 31B · runs from 13.5 GB

Gemma 4 31B IT Qat Q4 0 Unquantized Assistant is a 31B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen2.5 Coder 1.5B

Alibaba · 1.5B · runs from 1 GB

Qwen2.5 Coder 1.5B is a 1.5-billion parameter code-specialized model from Alibaba Cloud's Qwen 2.5 Coder series. It is the smallest Coder variant that balances meaningful code generation capability with extremely low resource requirements, running on GPUs with as little as 2-4GB of VRAM. The model is suitable for lightweight code completion, simple code generation tasks, and as a compact local coding assistant in resource-constrained environments. It supports a 128K token context window. Released under the Apache 2.0 license.

Pantheon Reasoning 27B

Gryphe · 27.8B · runs from 8.4 GB

Pantheon Reasoning 27B is a 27.8B-parameter open language model from Gryphe. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatRoleplayReasoning

Nanbeige4.1 3B

Nanbeige · 3.9B · runs from 2.1 GB

Nanbeige4.1 3B is a compact chat model from Nanbeige, a Chinese AI startup focused on building efficient small-scale language models. At just under 4 billion parameters, it is designed to run on virtually any modern GPU or even on CPU, making it one of the more accessible options for users with limited hardware. Despite its small size, it handles basic conversation, simple reasoning, and Chinese-English bilingual tasks, serving as a practical entry point for local LLM experimentation.

Starcoder2 15B

BigCode · 16.0B · runs from 7.3 GB

Starcoder2 15B is a 16.0B-parameter open language model from BigCode in the StarCoder family. It supports a context window of up to 16,384 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Huihui Qwen3.6 35B A3B Claude 4.7 Opus Abliterated

huihui-ai · 36.0B · runs from 15.7 GB

Huihui Qwen3.6 35B A3B Claude 4.7 Opus Abliterated is a 36.0B-parameter open language model from huihui-ai in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Phi 3 Mini 4k Instruct

Microsoft · 3.8B · runs from 2.7 GB

Microsoft Phi 3 Mini 4K Instruct is a 3.8-billion parameter instruction-tuned model from Microsoft Research's Phi 3 generation, with a 4K token context window. The Phi 3 family demonstrated that small models trained on carefully curated, high-quality data can achieve performance competitive with models several times their size. The model runs on consumer GPUs with as little as 4-6GB of VRAM when quantized, making it one of the most accessible capable chat models for local deployment. Released under the MIT license.

Qwen2.5 3B

Alibaba · 3.1B · runs from 1.6 GB

Qwen2.5 3B is a 3.1B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Magistral Small 2506

Mistral AI · 23.6B · runs from 7.2 GB

Magistral Small 2506 is a 23.6B-parameter open language model from Mistral AI in the Mistral family. It supports a context window of up to 40,960 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.