All LLM Models

Browse 593 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Featured only

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Gemma 4 E2B IT Qat Mobile Transformers

Google · 2.3B · runs from 1.4 GB

Gemma 4 E2B IT Qat Mobile Transformers is a 2.3B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 4 12B OBLITERATED

OBLITERATUS · 12.0B · runs from 4.3 GB

Gemma 4 12B OBLITERATED is a 12.0B-parameter open language model from OBLITERATUS in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 4 26B A4B IT Uncensored Heretic

llmfan46 · 25.8B · runs from 11.6 GB

Gemma 4 26B A4B IT Uncensored Heretic is a 25.8B-parameter open language model from llmfan46 in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 4 26B A4B IT Assistant

Google · 26B · runs from 11.4 GB

Gemma 4 26B A4B IT Assistant is a 26B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

LFM2 8B A1B

LiquidAI · 8.3B · runs from 2.7 GB

LFM2 8B A1B is Liquid AI's larger mixture-of-experts model, combining the company's novel hybrid architecture with approximately 8 billion total parameters. It uses a MoE design to keep active compute per token low while maintaining strong general performance across chat and reasoning tasks. For local users, it offers an intriguing alternative to conventional 8B transformers, with Liquid AI's architecture promising improved efficiency and throughput on consumer-grade hardware.

SmolLM3 3B

Hugging Face · 3.1B · runs from 1.3 GB

SmolLM3 3B is Hugging Face's latest-generation compact language model, representing a significant step up from the SmolLM2 series. At 3 billion parameters, it delivers considerably stronger reasoning, instruction following, and general language understanding while maintaining modest hardware requirements that keep it accessible on most consumer GPUs. This model benefits from improved training data, architectural refinements, and lessons learned from previous SmolLM generations. It is well positioned for local chatbot applications, coding assistance, and content generation tasks where you want strong performance without dedicating the resources required by 7B-class models.

Gemma 3n E2B IT

Google · 5.4B · runs from 1.6 GB

Gemma 3n E2B IT is a 5.4B-parameter open language model from Google in the Gemma 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 4 E4B IT Assistant

Google · 4B · runs from 2 GB

Gemma 4 E4B IT Assistant is a 4B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama 3.2 11B Vision Instruct

Meta · 10.7B · runs from 5.0 GB

Llama 3.2 11B Vision Instruct is a 10.7B-parameter open language model from Meta in the Llama 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Jan Code 4B

janhq · 4.4B · runs from 2.4 GB

Jan Code 4B is a 4.4B-parameter open language model from janhq. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatFunctionsCode

Qwen2.5 Coder 32B

Alibaba · 32.8B · runs from 9.8 GB

Qwen2.5 Coder 32B is a 32.8B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 3n E4B IT

Google · 7.8B · runs from 2.4 GB

Gemma 3n E4B IT is a 7.8B-parameter open language model from Google in the Gemma 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Hermes 4 14B

Nous Research · 14.8B · runs from 5.1 GB

Hermes 4 14B is a 14.8B-parameter open language model from Nous Research in the Hermes family. It supports a context window of up to 40,960 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoningRoleplay

LFM2.5 350M

LiquidAI · 354M · runs from 0.5 GB

LFM2.5 350M is a 354M-parameter open language model from LiquidAI. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

LFM2.5 1.2B Thinking

LiquidAI · 1.2B · runs from 0.9 GB

LFM2.5 1.2B Thinking is a 1.2B-parameter open language model from LiquidAI. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen2.5 Coder 7B

Alibaba · 7.6B · runs from 3.6 GB

Qwen2.5 Coder 7B is a 7.6-billion parameter code-specialized base (pretrained) model from Alibaba Cloud's Qwen 2.5 Coder series. It is trained on a large dataset of source code and natural language but is not instruction-tuned, making it suitable for fine-tuning, code-related research, and custom downstream applications. The model supports a 128K token context window and runs efficiently on consumer GPUs. It serves as the foundation for the Qwen2.5 Coder 7B Instruct variant and community fine-tunes targeting specific programming languages or workflows. Released under the Apache 2.0 license.

Hermes 3 Llama 3.1 8B

Nous Research · 8.0B · runs from 3.3 GB

Hermes 3 Llama 3.1 8B is an 8-billion parameter instruction-tuned model by Nous Research, built on Meta's Llama 3.1 8B base. It is fine-tuned for advanced instruction following, multi-turn conversation, structured output, and creative roleplay scenarios. The Hermes series is known for producing highly steerable models that respond well to system prompts. This model supports a 128K token context window inherited from the Llama 3.1 architecture and runs efficiently on consumer GPUs with 8GB or more of VRAM. It is a popular choice among local inference enthusiasts who value strong instruction adherence and versatile conversational ability.

Llama 3.1 8B Lexi Uncensored v2

Orenguteng · 8.0B · runs from 3.3 GB

Llama 3.1 8B Lexi Uncensored v2 is a 8.0B-parameter open language model from Orenguteng in the Llama 3 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Mellum2 12B A2.5B Thinking

JetBrains · 12.1B · runs from 5.5 GB

Mellum2 12B A2.5B Thinking is a 12.1B-parameter open language model from JetBrains in the Mellum family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Huihui GPT OSS 20B BF16 Abliterated

huihui-ai · 20.9B · runs from 9.3 GB

Huihui GPT OSS 20B BF16 Abliterated is a 20.9B-parameter open language model from huihui-ai in the GPT-OSS family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Functiongemma 270M IT

Google · 268M · runs from 0.1 GB

Functiongemma 270M IT is a 268M-parameter open language model from Google in the Gemma family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

MiniCPM5 1B

openbmb · 1.1B · runs from 0.6 GB

MiniCPM5 1B is a 1.1B-parameter open language model from openbmb in the MiniCPM family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 4 E4B IT OBLITERATED

OBLITERATUS · 8.0B · runs from 2.7 GB

Gemma 4 E4B IT OBLITERATED is a 8.0B-parameter open language model from OBLITERATUS in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen3 30B A3B Thinking 2507

Alibaba · 30.5B · runs from 8.8 GB

Qwen3 30B A3B Thinking 2507 is the reasoning-focused variant of Alibaba's 30-billion-parameter mixture-of-experts model, updated in July 2025. Like its instruct sibling, it activates only about 3 billion parameters per token, keeping resource demands low while enabling multi-step reasoning and chain-of-thought problem solving. This thinking variant is designed for tasks that benefit from deliberate, step-by-step logic such as math, coding puzzles, and analytical questions. Its efficient MoE design means users with modest GPUs can still access strong reasoning capabilities without needing datacenter-class hardware.

Diffusiongemma 26B A4B IT

Google · 25.8B · runs from 11.6 GB

Diffusiongemma 26B A4B IT is a 25.8B-parameter open language model from Google in the Gemma family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 4 12B IT AEON Abliterated K4 BF16

AEON-7 · 12.0B · runs from 6.1 GB

Gemma 4 12B IT AEON Abliterated K4 BF16 is a 12.0B-parameter open language model from AEON-7 in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoningFunctions

Hermes 3 Llama 3.2 3B

Nous Research · 3B · runs from 1.6 GB

Hermes 3 Llama 3.2 3B is a 3-billion parameter instruction-tuned model by Nous Research, fine-tuned from Meta's Llama 3.2 3B base. It applies the Hermes training methodology to a compact model, targeting strong instruction following and conversational quality at minimal hardware cost. Despite its small size, this model benefits from the Hermes fine-tuning approach that emphasizes system prompt adherence and structured output. It can run on GPUs with as little as 4GB of VRAM when quantized, making it suitable for lightweight local deployments and resource-constrained environments.

Ternary Bonsai 8B Unpacked

prism-ml · 8.2B · runs from 4.1 GB

Ternary Bonsai 8B Unpacked is a 8.2B-parameter open language model from prism-ml. It supports a context window of up to 65,536 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

OpenHermes 2.5 Mistral 7B

Teknium · 7B · runs from 3.5 GB

OpenHermes 2.5 is a community-driven fine-tune of Mistral 7B created by Teknium, trained on over 900,000 entries of high-quality synthetic data generated primarily by GPT-4. It quickly became one of the most popular open chat models of its era, consistently topping community benchmarks for 7B-class models. For local users, it offers strong instruction-following, creative writing, and coding assistance in a package that runs comfortably on a single consumer GPU with 8 GB of VRAM.

Mistral 7B v0.1

Mistral AI · 7B · runs from 3.5 GB

Mistral 7B v0.1 is the original base model from Mistral AI that helped reshape expectations for small open-weight language models when it launched in late 2023. As a pretrained foundation model without instruction tuning, it is designed for fine-tuning, research, and custom downstream tasks rather than direct conversational use. With 7 billion parameters and support for grouped-query attention and sliding-window attention, it remains a popular starting point for practitioners building specialized models. Its modest VRAM requirements of roughly 6 GB at 4-bit quantization keep it accessible on a wide range of consumer GPUs.