All LLM Models

Browse 739 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

Gemma 4 12B IT Heretic

igorls · 12.0B · runs from 6.1 GB

1.4K 11

Gemma 4 12B IT Heretic is a 12.0B-parameter open language model from igorls in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Huihui Qwen3 Coder 30B A3B Instruct Abliterated

huihui-ai · 30.5B · runs from 8.8 GB

3.1K 35

Huihui Qwen3 Coder 30B A3B Instruct Abliterated is a 30.5B-parameter open language model from huihui-ai in the Qwen 3 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

Qwen2.5 72B Instruct

Alibaba · 72.7B · runs from 21.0 GB

455.1K 951

Qwen2.5 72B Instruct is the flagship model of the Qwen 2.5 series from Alibaba Cloud, with 72.7 billion parameters. It is instruction-tuned for conversational use and excels across reasoning, coding, mathematics, and multilingual tasks. Qwen2.5 72B delivers performance competitive with leading open-weight 70B-class models while supporting a 128K token context window and structured output generation. The model uses a Transformer architecture with grouped-query attention and was pretrained on a diverse multilingual corpus of over 18 trillion tokens. Running it locally requires high-VRAM GPUs or multi-GPU setups, though quantized formats make it accessible on workstation-class hardware. Released under the Apache 2.0 license.

Chat

Gemma 2 9B IT

Google · 9.2B · runs from 3.0 GB

391.0K 826

Google Gemma 2 9B IT is a 9.2-billion parameter instruction-tuned model from Google's Gemma 2 series. It is a text-only chat model optimized for conversational tasks, instruction following, and general-purpose assistance. At release, it was recognized for delivering unusually strong performance relative to its parameter count. The model runs efficiently on consumer GPUs with 8-12GB of VRAM in quantized formats, making it accessible on mainstream hardware. It is a popular choice for local inference among users who want strong quality without the VRAM demands of larger models. Released under the Gemma license.

Chat

DeepSeek Coder v2 Lite Instruct

DeepSeek · 15.7B · runs from 7.2 GB

894.1K 609

DeepSeek Coder V2 Lite Instruct is a code-focused mixture-of-experts model with 15.7 billion total parameters, trained to handle both programming tasks and general conversation. It supports a wide range of programming languages and excels at code generation, debugging, explanation, and refactoring. The MoE architecture keeps compute costs manageable despite the model's broad capabilities, and the Lite variant is sized to run on a single consumer GPU. For developers looking for a capable local coding assistant that can also handle general chat, this model offers an appealing combination of code specialization and practical hardware requirements.

ChatCode

Gemma 4 E2B IT Uncensored

TrevorJS · 5.1B · runs from 2.5 GB

1.3K 20

Gemma 4 E2B IT Uncensored is a 5.1B-parameter open language model from TrevorJS in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

SmolLM2 360M Instruct

Hugging Face · 362M · runs from 0.5 GB

283.9K 193

SmolLM2 360M Instruct is an instruction-tuned model from Hugging Face that occupies the sweet spot between the 135M and 1.7B entries in the SmolLM2 lineup. At 360 million parameters, it offers noticeably better coherence and instruction-following ability than the smallest variants while still running comfortably on virtually any modern GPU or even on CPU. This model is well suited for on-device assistants, embedded applications, and rapid prototyping where you need real conversational ability without dedicating significant hardware resources. It handles short-form generation, summarization, and basic reasoning tasks with reasonable quality.

Chat

Gemma 4 26B A4B IT Uncensored

TrevorJS · 25.8B · runs from 11.6 GB

218.7K 41

Gemma 4 26B A4B IT Uncensored is a 25.8B-parameter open language model from TrevorJS in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Nemotron 3 Nano Omni 30B A3B Reasoning BF16

NVIDIA · 33.0B · runs from 10.0 GB

340.0K 343

Nemotron 3 Nano Omni 30B A3B Reasoning BF16 is a 33.0B-parameter open language model from NVIDIA in the Nemotron family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Reasoning

Phi 2

Microsoft · 2.8B · runs from 2.1 GB

437.6K 3.5K

Microsoft Phi 2 is a 2.8-billion parameter language model from Microsoft Research that pioneered the concept of small but highly capable language models. Released in late 2023, Phi 2 demonstrated that strategic data curation and training methodology could allow a sub-3B model to outperform many 7B and 13B models on reasoning and coding benchmarks. The model runs on virtually any modern GPU and even on CPU-only setups. While succeeded by Phi 3 and Phi 4, Phi 2 remains historically significant as the model that proved small-scale language models could be genuinely useful for practical tasks. Released under the MIT license.

ChatCode

Granite 4.1 3B

IBM · 3.4B · runs from 1.6 GB

197.9K 75

Granite 4.1 3B is a 3.4B-parameter open language model from IBM in the Granite family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

NVIDIA Nemotron Nano 9B v2 Japanese

NVIDIA · 8.9B · runs from 4.4 GB

281.4K 124

NVIDIA Nemotron Nano 9B v2 Japanese is a specialized variant of the Nemotron Nano 9B v2, fine-tuned for Japanese language understanding and generation. At 8.9 billion parameters, it maintains the same hardware-friendly footprint as the English version while delivering natural Japanese conversational ability. For users looking to run a Japanese-language assistant locally, this model offers a rare combination of compact size and dedicated language optimization from a major hardware vendor. It handles Japanese text with the fluency you'd expect from a purpose-built model rather than a multilingual afterthought.

Chat

Cydonia 24B V4.3

TheDrummer · 23.6B · runs from 7.8 GB

6.0K 118

Cydonia 24B V4.3 is a 23.6B-parameter open language model from TheDrummer. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Gemma 4 E2B IT Qat Mobile Transformers

Google · 2.3B · runs from 1.4 GB

1.7K 28

Gemma 4 E2B IT Qat Mobile Transformers is a 2.3B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen3.6 27B DFlash

z-lab · 27B · runs from 12.6 GB

68.7K 345

Qwen3.6 27B DFlash is a 27B-parameter open language model from z-lab in the Qwen 3.6 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Llama 3.3 70B Instruct Abliterated

huihui-ai · 70.6B · runs from 20.4 GB

4.3K 74

Llama 3.3 70B Instruct Abliterated is a 70.6B-parameter open language model from huihui-ai in the Llama 3 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Mixtral 8x7B Instruct v0.1

Mistral AI · 46.7B · runs from 20.4 GB

806.7K 4.7K

Mixtral 8x7B Instruct v0.1 is Mistral AI's flagship Mixture-of-Experts model, combining eight expert networks of 7 billion parameters each for a 46.7B total weight count while activating only about 12.9 billion parameters per token. This sparse architecture delivers performance that rivals much larger dense models at a fraction of the inference cost, excelling across reasoning, code generation, and multilingual tasks. Because the full weights must still be loaded into memory, you will need around 24–48 GB of VRAM depending on quantization level, making it best suited for multi-GPU desktop setups or high-VRAM workstation cards. If your hardware can accommodate it, Mixtral offers one of the best performance-per-active-parameter ratios available for local deployment.

Chat

Gemma 4 12B OBLITERATED

OBLITERATUS · 12.0B · runs from 4.3 GB

43.6K 250

Gemma 4 12B OBLITERATED is a 12.0B-parameter open language model from OBLITERATUS in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Gemma 4 26B A4B IT Uncensored Heretic

llmfan46 · 25.8B · runs from 11.6 GB

2.1K 12

Gemma 4 26B A4B IT Uncensored Heretic is a 25.8B-parameter open language model from llmfan46 in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision

Gemma 4 26B A4B IT Assistant

Google · 26B · runs from 11.4 GB

126.5K 162

Gemma 4 26B A4B IT Assistant is a 26B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

LFM2 8B A1B

LiquidAI · 8.3B · runs from 2.7 GB

46.0K 367

LFM2 8B A1B is Liquid AI's larger mixture-of-experts model, combining the company's novel hybrid architecture with approximately 8 billion total parameters. It uses a MoE design to keep active compute per token low while maintaining strong general performance across chat and reasoning tasks. For local users, it offers an intriguing alternative to conventional 8B transformers, with Liquid AI's architecture promising improved efficiency and throughput on consumer-grade hardware.

Chat

Deepseek Coder 33B Instruct

DeepSeek · 33.3B · runs from 14.6 GB

6.2K 573

Deepseek Coder 33B Instruct is a 33.3B-parameter open language model from DeepSeek in the DeepSeek Coder family. It supports a context window of up to 16,384 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

SmolLM3 3B

Hugging Face · 3.1B · runs from 1.3 GB

516.4K 970

SmolLM3 3B is Hugging Face's latest-generation compact language model, representing a significant step up from the SmolLM2 series. At 3 billion parameters, it delivers considerably stronger reasoning, instruction following, and general language understanding while maintaining modest hardware requirements that keep it accessible on most consumer GPUs. This model benefits from improved training data, architectural refinements, and lessons learned from previous SmolLM generations. It is well positioned for local chatbot applications, coding assistance, and content generation tasks where you want strong performance without dedicating the resources required by 7B-class models.

Chat

GLM 4.5 Air

zai-org · 110.5B · runs from 30.8 GB

310.3K 608

GLM 4.5 Air is a 110.5B-parameter open language model from zai-org in the GLM 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Gemma 3n E2B IT

Google · 5.4B · runs from 1.6 GB

355.6K 305

Gemma 3n E2B IT is a 5.4B-parameter open language model from Google in the Gemma 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision

Gemma 4 E4B IT Assistant

Google · 4B · runs from 2 GB

52.2K 108

Gemma 4 E4B IT Assistant is a 4B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Llama 3.2 11B Vision Instruct

Meta · 10.7B · runs from 5.0 GB

138.3K 1.6K

Llama 3.2 11B Vision Instruct is a 10.7B-parameter open language model from Meta in the Llama 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision

Jan Code 4B

janhq · 4.4B · runs from 2.4 GB

2.1K 68

Jan Code 4B is a 4.4B-parameter open language model from janhq. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatFunctionsCode

Qwen2.5 Coder 32B

Alibaba · 32.8B · runs from 9.8 GB

3.6K 156

Qwen2.5 Coder 32B is a 32.8B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

Gemma 3n E4B IT

Google · 7.8B · runs from 2.4 GB

16.9K 917

Gemma 3n E4B IT is a 7.8B-parameter open language model from Google in the Gemma 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision