All LLM Models

Browse 856 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

Gemma 4 31B IT Uncensored

TrevorJS · 32.7B · runs from 15.5 GB

6.7K 24

Gemma 4 31B IT Uncensored is a 32.7B-parameter open language model from TrevorJS in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Llama 4 Maverick 17B 128E Instruct

Meta · 401.6B · runs from 121.5 GB

26.4K 494

Llama 4 Maverick 17B 128E Instruct is a 401.6B-parameter open language model from Meta in the Llama 4 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision

Llama 3.1 8B Lexi Uncensored v2

Orenguteng · 8.0B · runs from 3.3 GB

27.9K 303

Llama 3.1 8B Lexi Uncensored v2 is a 8.0B-parameter open language model from Orenguteng in the Llama 3 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Mellum2 12B A2.5B Thinking

JetBrains · 12.1B · runs from 5.5 GB

2.6K 283

Mellum2 12B A2.5B Thinking is a 12.1B-parameter open language model from JetBrains in the Mellum family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Huihui GPT OSS 20B BF16 Abliterated

huihui-ai · 20.9B · runs from 9.3 GB

40.6K 216

Huihui GPT OSS 20B BF16 Abliterated is a 20.9B-parameter open language model from huihui-ai in the GPT-OSS family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Functiongemma 270M IT

Google · 268M · runs from 0.1 GB

133.8K 1.0K

Functiongemma 270M IT is a 268M-parameter open language model from Google in the Gemma family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen3.6 40B Claude 4.6 Opus Deckard Heretic Uncensored Thinking

DavidAU · 39.5B · runs from 80.0 GB

42.3K 76

Qwen3.6 40B Claude 4.6 Opus Deckard Heretic Uncensored Thinking is a 39.5B-parameter open language model from DavidAU in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

VisionRoleplay

MiniCPM5 1B

openbmb · 1.1B · runs from 0.6 GB

78.9K 797

MiniCPM5 1B is a 1.1B-parameter open language model from openbmb in the MiniCPM family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Gemma 4 E4B IT OBLITERATED

OBLITERATUS · 8.0B · runs from 2.7 GB

303.3K 702

Gemma 4 E4B IT OBLITERATED is a 8.0B-parameter open language model from OBLITERATUS in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen3 30B A3B Thinking 2507

Alibaba · 30.5B · runs from 8.8 GB

138.1K 379

Qwen3 30B A3B Thinking 2507 is the reasoning-focused variant of Alibaba's 30-billion-parameter mixture-of-experts model, updated in July 2025. Like its instruct sibling, it activates only about 3 billion parameters per token, keeping resource demands low while enabling multi-step reasoning and chain-of-thought problem solving. This thinking variant is designed for tasks that benefit from deliberate, step-by-step logic such as math, coding puzzles, and analytical questions. Its efficient MoE design means users with modest GPUs can still access strong reasoning capabilities without needing datacenter-class hardware.

Chat

Diffusiongemma 26B A4B IT

Google · 25.8B · runs from 11.6 GB

20.7K 593

Diffusiongemma 26B A4B IT is a 25.8B-parameter open language model from Google in the Gemma family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision

Gemma 4 12B IT AEON Abliterated K4 BF16

AEON-7 · 12.0B · runs from 6.1 GB

2.3K 25

Gemma 4 12B IT AEON Abliterated K4 BF16 is a 12.0B-parameter open language model from AEON-7 in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoningFunctions

Hermes 3 Llama 3.2 3B

Nous Research · 3B · runs from 1.6 GB

77.3K 175

Hermes 3 Llama 3.2 3B is a 3-billion parameter instruction-tuned model by Nous Research, fine-tuned from Meta's Llama 3.2 3B base. It applies the Hermes training methodology to a compact model, targeting strong instruction following and conversational quality at minimal hardware cost. Despite its small size, this model benefits from the Hermes fine-tuning approach that emphasizes system prompt adherence and structured output. It can run on GPUs with as little as 4GB of VRAM when quantized, making it suitable for lightweight local deployments and resource-constrained environments.

ChatRoleplay

GLM 4.7

zai-org · 358.3B · runs from 99.2 GB

68.6K 2.0K

GLM 4.7 is an earlier generation of Zhipu AI's GLM foundation model series, featuring a mixture-of-experts architecture with approximately 358 billion total parameters. It delivers strong performance on reasoning, language understanding, and bilingual Chinese-English tasks while being significantly more manageable to run locally than its GLM 5 successor. For users with multi-GPU setups, GLM 4.7 offers a practical balance between capability and hardware requirements within the GLM model family.

Chat

MiMo v2 Flash

XiaomiMiMo · 309.8B · runs from 85.6 GB

66.5K 738

MiMo V2 Flash is Xiaomi's large-scale mixture-of-experts language model, built with nearly 310 billion total parameters. Designed for fast inference despite its size, the Flash variant prioritizes throughput and responsiveness, making it well-suited for interactive chat and real-time applications. Running it locally is a serious undertaking that demands high-end multi-GPU configurations, but it brings flagship-level Chinese and English language capabilities to users who have the hardware to support it.

Chat

Qwen3 Coder 480B A35B Instruct

Alibaba · 480.2B · runs from 144.6 GB

41.6K 1.3K

Qwen3 Coder 480B A35B Instruct is Alibaba's largest code-specialized model, a massive 480.2-billion-parameter mixture-of-experts system with roughly 35 billion parameters active per token. This is the most powerful open-weight coding model in the Qwen3 family, designed for professional-grade code generation, analysis, and software engineering tasks. Running this model locally is a serious undertaking that requires multi-GPU server-class hardware with several hundred gigabytes of combined VRAM. For users with access to such infrastructure, it offers exceptional code quality and understanding that rivals leading proprietary coding assistants, all while keeping data and computation entirely under local control.

ChatCode

Ternary Bonsai 8B Unpacked

prism-ml · 8.2B · runs from 4.1 GB

226.6K 12

Ternary Bonsai 8B Unpacked is a 8.2B-parameter open language model from prism-ml. It supports a context window of up to 65,536 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Nex N2 Mini

nex-agi · 35.1B · runs from 14.0 GB

2.8K 178

Nex N2 Mini is a 35.1B-parameter open language model from nex-agi. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

OpenHermes 2.5 Mistral 7B

Teknium · 7B · runs from 3.5 GB

151.8K 888

OpenHermes 2.5 is a community-driven fine-tune of Mistral 7B created by Teknium, trained on over 900,000 entries of high-quality synthetic data generated primarily by GPT-4. It quickly became one of the most popular open chat models of its era, consistently topping community benchmarks for 7B-class models. For local users, it offers strong instruction-following, creative writing, and coding assistance in a package that runs comfortably on a single consumer GPU with 8 GB of VRAM.

Chat

Mistral 7B v0.1

Mistral AI · 7B · runs from 3.5 GB

539.9K 4.1K

Mistral 7B v0.1 is the original base model from Mistral AI that helped reshape expectations for small open-weight language models when it launched in late 2023. As a pretrained foundation model without instruction tuning, it is designed for fine-tuning, research, and custom downstream tasks rather than direct conversational use. With 7 billion parameters and support for grouped-query attention and sliding-window attention, it remains a popular starting point for practitioners building specialized models. Its modest VRAM requirements of roughly 6 GB at 4-bit quantization keep it accessible on a wide range of consumer GPUs.

Chat

Qwen3.6 28B REAP20 A3B

0xSero · 28.2B · runs from 11.3 GB

1.1K 27

Qwen3.6 28B REAP20 A3B is a 28.2B-parameter open language model from 0xSero in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Gemma 2 27B IT

Google · 27.2B · runs from 9.0 GB

128.3K 568

Google Gemma 2 27B IT is a 27.2-billion parameter instruction-tuned model from Google's Gemma 2 generation. It is a text-only chat model optimized for conversational use, reasoning, and instruction following. Gemma 2 27B IT was one of the strongest openly available models in its size class at release. The model requires a GPU with at least 24GB of VRAM for quantized local inference. It is widely supported by popular inference engines and remains a strong choice for users seeking high-quality local chat without needing 70B-class hardware. Released under the Gemma license.

Chat

Gemma 3 270M

Google · 268M · runs from 0.1 GB

7.5M 1.0K

Google Gemma 3 270M is a 270-million parameter base (pretrained) model from Google's Gemma 3 family. It is an experimental release intended for research, fine-tuning, and exploring the capabilities of ultra-small language models. The model runs on virtually any hardware with negligible resource requirements. Released under the Gemma license.

Chat

Qwen2.5 Coder 3B

Alibaba · 3.1B · runs from 1.4 GB

717.7K 51

Qwen2.5 Coder 3B is a 3.1B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

MiMo V2.5 Pro

XiaomiMiMo · 1023.2B · runs from 281.9 GB

52.5K 619

MiMo V2.5 Pro is a 1023.2B-parameter open language model from XiaomiMiMo. It supports a context window of up to 1,048,576 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatFunctionsCode

Qwen3 235B A22B Instruct 2507

Alibaba · 235.1B · runs from 71.0 GB

123.5K 784

Qwen3 235B A22B Instruct 2507 is Alibaba's flagship instruction-tuned model from the July 2025 update, featuring 235 billion total parameters with approximately 22 billion active during inference. As the largest instruct model in the Qwen3 lineup, it delivers top-tier conversational quality, knowledge depth, and instruction following. Despite its massive total parameter count, the MoE architecture keeps active compute manageable. Running this model locally still requires substantial hardware, typically multi-GPU setups with 48 GB or more of total VRAM, but the 2507 refresh makes it one of the most capable open-weight models available for users with high-end local infrastructure.

Chat

Phi 4 Reasoning Plus

Microsoft · 14.7B · runs from 4.8 GB

24.7K 343

Phi 4 Reasoning Plus is a 14.7B-parameter open language model from Microsoft in the Phi 4 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatMathCodeReasoning

MiMo V2.5

XiaomiMiMo · 310.8B · runs from 85.9 GB

141.2K 297

MiMo V2.5 is a 310.8B-parameter open language model from XiaomiMiMo. It supports a context window of up to 1,048,576 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Functions

Mixtral 8x7B v0.1

Mistral AI · 46.7B · runs from 19.8 GB

58.2K 1.8K

Mixtral 8x7B v0.1 is a 46.7B-parameter open language model from Mistral AI in the Mixtral family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Llama 3.1 8B

Meta · 8.0B · runs from 3.8 GB

1.3M 2.3K

Meta Llama 3.1 8B is an 8-billion parameter base (pretrained) model from the Llama 3.1 family. It is not instruction-tuned and is intended for fine-tuning, research, and custom downstream applications. Compared to Llama 3 8B, it extends the context window to 128K tokens and benefits from improved training data and methodology. The model uses grouped-query attention and was trained on a multilingual corpus. It is released under the Llama 3.1 Community License and is widely used as a foundation for community fine-tunes and specialized models.

Chat