All LLM Models

Browse 856 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

Gemma 4 12B IT Qat Q4 0 Unquantized

Google · 12.0B · runs from 6.1 GB

17.7K 44

Gemma 4 12B IT Qat Q4 0 Unquantized is a 12.0B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Gemma 4 26B A4B IT Qat Q4 0 Unquantized

Google · 26.5B · runs from 11.9 GB

4.3K 22

Gemma 4 26B A4B IT Qat Q4 0 Unquantized is a 26.5B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision

Kimi K2.5

Moonshot AI · 1058.6B · runs from 295.0 GB

1.7M 2.8K

Kimi K2.5 is a 1058.6B-parameter open language model from Moonshot AI in the Kimi K2 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision

Gemma 3 4B IT

Google · 4.3B · runs from 1.3 GB

1.5M 1.4K

Gemma 3 4B IT is a 4.3B-parameter open language model from Google in the Gemma 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision

Qwen3 1.7B

Alibaba · 2.0B · runs from 1.1 GB

4.7M 484

Qwen3 1.7B is a 1.7-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 3 series. It is a lightweight model designed for deployment on minimal hardware, including low-VRAM GPUs and even CPU-only configurations with acceptable latency. Despite its compact size, it supports hybrid thinking mode and handles basic conversational tasks, simple question answering, and text generation. The model is useful for edge deployment, embedded applications, and scenarios where fast inference with minimal resource consumption is the priority. It represents a significant quality improvement over Qwen 2.5 at the sub-2B scale. Released under the Apache 2.0 license.

Chat

MiniMax M2.7

MiniMaxAI · 228.7B · runs from 63.5 GB

2.6M 1.2K

MiniMax M2.7 is a 228.7B-parameter open language model from MiniMaxAI in the MiniMax family. It supports a context window of up to 204,800 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen2.5 3B Instruct

Alibaba · 3.1B · runs from 1.4 GB

12.7M 499

Qwen2.5 3B Instruct is a 3.1-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 2.5 family. It is designed for efficient local inference on consumer hardware, supporting a 128K token context window despite its compact footprint. The model can run on GPUs with as little as 4GB of VRAM when quantized. Despite its small size, Qwen2.5 3B Instruct delivers competitive performance for basic conversational tasks, summarization, and simple instruction following. It is a good option for edge deployment and resource-constrained environments. Released under the Apache 2.0 license.

Chat

DeepSeek R1 0528 Qwen3 8B

DeepSeek · 8.2B · runs from 2.9 GB

337.8K 1.1K

DeepSeek R1 0528 Qwen3 8B is a 8.2B-parameter open language model from DeepSeek in the DeepSeek R1 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoning

Phi 3.5 Mini Instruct

Microsoft · 3.8B · runs from 2.3 GB

901.4K 987

Phi 3.5 Mini Instruct is a 3.8B-parameter open language model from Microsoft in the Phi 3 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

Gemma 3 12B IT

Google · 12.2B · runs from 3.7 GB

2.6M 749

Google Gemma 3 12B IT is a 12-billion parameter multimodal instruction-tuned model from Google's Gemma 3 series. It supports both text and image inputs, offering vision-language capabilities at a more accessible size point than the 27B variant. Gemma 3 12B IT runs on consumer GPUs with 12-16GB of VRAM in quantized formats, making it a practical choice for local multimodal inference without requiring top-tier hardware. Released under the Gemma license.

Vision

GLM 4.7 Flash

zai-org · 31.2B · runs from 9.7 GB

1.2M 1.7K

GLM 4.7 Flash is a 31.2B-parameter open language model from zai-org in the GLM 4 family. It supports a context window of up to 202,752 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen2.5 Coder 32B Instruct

Alibaba · 32.8B · runs from 9.8 GB

1.6M 2.0K

Qwen2.5 Coder 32B Instruct is a 32.8-billion parameter code-specialized model from Alibaba Cloud, instruction-tuned for programming assistance and code generation. It is trained on a large corpus of source code alongside natural language data, making it highly capable for tasks such as code completion, debugging, code explanation, and software engineering dialogue. The model supports a 128K token context window and delivers code generation quality competitive with the best open-weight coding models at any scale. It requires a GPU with at least 24GB of VRAM for quantized inference. Released under the Apache 2.0 license.

ChatCode

Meta Llama 3 8B Instruct

Meta · 8.0B · runs from 2.6 GB

1.3M 4.6K

Meta Llama 3 8B Instruct is the instruction-tuned version of Meta's Llama 3 8B base model, with 8 billion parameters. It is fine-tuned for dialogue and chat use cases using supervised fine-tuning and RLHF, making it ready for conversational applications out of the box. The model supports an 8K token context window and performs well across coding, reasoning, and general knowledge tasks. Its efficient size makes it one of the most popular models for local inference on consumer hardware. Released under the Meta Llama 3 Community License.

Chat

Qwen3 4B Instruct 2507

Alibaba · 4.0B · runs from 1.6 GB

4.4M 876

Qwen3 4B Instruct 2507 is a July 2025 refresh of Alibaba's compact 4-billion-parameter chat model from the Qwen3 family. This updated release brings improved instruction following and conversational quality while remaining lightweight enough to run on most modern GPUs and even some higher-end integrated graphics setups. With its modest size, the 4B Instruct 2507 strikes a practical balance between capability and resource efficiency. It is well suited for everyday chat, summarization, and light assistant tasks on consumer hardware, making it one of the more accessible entry points into the Qwen3 lineup.

Chat

Llama 3.3 70B Instruct

Meta · 70.6B · runs from 21.3 GB

658.5K 2.8K

Meta Llama 3.3 70B Instruct is a 70-billion parameter large language model from Meta, released as part of the Llama 3.3 generation. It is an instruction-tuned model optimized for dialogue and chat use cases, offering strong performance across reasoning, coding, and multilingual tasks. Llama 3.3 70B delivers quality competitive with much larger models while remaining feasible to run on high-end consumer or workstation GPUs with sufficient VRAM. The model uses a grouped-query attention architecture with a 128K token context window and was trained on a massive multilingual corpus. It is released under the Llama 3.3 Community License, making it one of the most capable openly available models for local inference.

Chat

Qwen2.5 Coder 14B Instruct

Alibaba · 14.8B · runs from 5.1 GB

3.0M 162

Qwen2.5 Coder 14B Instruct is a 14.8B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

Qwen2.5 0.5B Instruct

Alibaba · 494M · runs from 0.5 GB

4.2M 530

Qwen2.5 0.5B Instruct is the smallest instruction-tuned model in Alibaba Cloud's Qwen 2.5 family, with just 494 million parameters. It is designed for ultra-lightweight deployment scenarios where minimal hardware resources are available, running comfortably on virtually any modern GPU or even CPU-only configurations. Despite its tiny footprint, the model supports a 128K token context window and can handle basic chat, simple summarization, and lightweight instruction following. It is primarily useful for edge deployment, experimentation, and prototyping where model size is a critical constraint. Released under the Apache 2.0 license.

Chat

Phi 4 Mini Instruct

Microsoft · 3.8B · runs from 2.2 GB

1.1M 764

Microsoft Phi 4 Mini Instruct is a 3.8-billion parameter instruction-tuned model from Microsoft Research's Phi 4 family. It applies the Phi series' data-centric training philosophy to a compact model, delivering strong performance in coding, reasoning, and chat tasks relative to its small footprint. The model runs on consumer GPUs with as little as 4-6GB of VRAM when quantized, making it accessible on mainstream and even entry-level hardware. Released under the MIT license.

ChatCode

Gemma 3 1B IT

Google · 1000M · runs from 0.3 GB

1.8M 1.0K

Google Gemma 3 1B IT is a 1-billion parameter instruction-tuned model from Google's Gemma 3 family. It is an ultra-compact text-only chat model designed for deployment on minimal hardware, including low-VRAM GPUs and edge devices. The model handles basic conversational tasks, simple instruction following, and lightweight text generation. It can run on virtually any modern GPU and even on CPU-only setups with acceptable latency. Released under the Gemma license.

Chat

Mistral 7B Instruct v0.2

Mistral AI · 7.2B · runs from 3.6 GB

1.4M 3.2K

Mistral 7B Instruct v0.2 is a 7.2B-parameter open language model from Mistral AI in the Mistral family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Gemma 4 31B IT Qat Q4 0 Unquantized

Google · 32.7B · runs from 15.5 GB

5.6K 24

Gemma 4 31B IT Qat Q4 0 Unquantized is a 32.7B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision

DeepSeek R1 0528

DeepSeek · 684.5B · runs from 192.1 GB

6.8M 2.5K

DeepSeek R1 0528 is an updated release of the R1 reasoning model, incorporating improvements to training and inference that sharpen its performance on complex multi-step problems. It retains the same 684.5 billion parameter mixture-of-experts architecture as the original R1, with approximately 37 billion parameters active per forward pass. This revision addresses several edge cases where the original R1 struggled, delivering more consistent reasoning chains and fewer hallucinations on difficult math and coding tasks. Hardware requirements remain identical to the original R1, so users already set up to run the first version can swap in the 0528 weights with no changes to their infrastructure.

ChatReasoning

TinyLlama 1.1B Chat v1.0

TinyLlama · 1.1B · runs from 0.8 GB

2.0M 1.6K

TinyLlama 1.1B Chat is a 1.1-billion parameter chat model built on the Llama 2 architecture and trained on approximately 3 trillion tokens, an unusually large dataset for a model of its size. The TinyLlama project demonstrated that small models can achieve strong performance when given sufficient training compute, making it a standout in the sub-2B parameter class. The Chat variant is fine-tuned for conversational use and runs on virtually any modern GPU, including entry-level cards with 4GB of VRAM or less. It is a practical choice for lightweight local inference, edge deployment, and experimentation where hardware resources are limited.

Chat

Qwen3.5 122B A10B

Alibaba · 125.1B · runs from 53.5 GB

791.7K 568

Qwen3.5 122B A10B is a 125.1B-parameter open language model from Alibaba in the Qwen 3.5 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision

Mistral 7B Instruct v0.3

Mistral AI · 7.2B · runs from 2.7 GB

3.1M 2.6K

Mistral 7B Instruct v0.3 is the latest instruction-tuned release of Mistral AI's original 7-billion-parameter model, delivering meaningful improvements in instruction following, function calling, and multilingual support over its predecessors. With an extended 32K-token vocabulary and refined chat capabilities, v0.3 remains one of the most capable sub-10B models available. At 7.2 billion parameters it sits comfortably in the sweet spot for local inference, running well on GPUs with 6–8 GB of VRAM at full precision and even on 4 GB cards with 4-bit quantization. It is an excellent default choice for anyone getting started with local LLMs who wants strong conversational performance without heavy hardware.

Chat

Gemma 3 27B IT

Google · 27.4B · runs from 8.3 GB

1.4M 2.0K

Google Gemma 3 27B IT is a 27.4-billion parameter multimodal instruction-tuned model from Google's Gemma 3 family. It supports both text and image inputs, making it one of the most capable openly available vision-language models for local inference. The model handles conversational AI, visual question answering, image description, and complex reasoning tasks across modalities. Gemma 3 27B IT requires a GPU with at least 24GB of VRAM for quantized inference, placing it within reach of high-end consumer cards like the RTX 4090. It uses a dense Transformer architecture with a large context window and benefits from Google's extensive pretraining pipeline. Released under the Gemma license.

Vision

Mistral Nemo Instruct 2407

Mistral AI · 12.2B · runs from 4.8 GB

451.4K 1.7K

Mistral Nemo Instruct 2407 is a 12.2B-parameter open language model from Mistral AI in the Mistral family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Meta Llama 3.1 70B Instruct

Meta · 70.6B · runs from 21.3 GB

630.4K 924

Meta Llama 3.1 70B Instruct is a 70.6B-parameter open language model from Meta in the Llama 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Phi 4

Microsoft · 14.7B · runs from 5.1 GB

814.3K 2.3K

Microsoft Phi 4 is a 14-billion parameter language model from Microsoft Research's Phi series, designed to deliver strong reasoning, mathematical, and coding performance at an efficient size. Phi 4 continues the Phi family's focus on maximizing capability per parameter through high-quality training data curation, achieving benchmark scores that rival much larger models on reasoning and STEM tasks. The model runs well on consumer GPUs with 12-16GB of VRAM in quantized formats. It excels at mathematical problem solving, code generation, and structured reasoning. Released under the MIT license.

ChatMathCode

Mistral Small 24B Instruct 2501

Mistral AI · 23.6B · runs from 7.8 GB

56.8K 957

Mistral Small 24B Instruct is Mistral AI's January 2025 release targeting the mid-range parameter sweet spot. At 24 billion parameters it sits between lightweight 7B models and heavier 70B-class offerings, delivering strong instruction-following, reasoning, and coding performance without demanding top-tier hardware. This model fits comfortably on a single GPU with 16–24 GB of VRAM at common quantization levels, making it an attractive option for users with cards like the RTX 4090 or RTX 3090 who want a noticeable step up from 7B models. It strikes an appealing balance between quality and resource requirements for serious local use.

Chat