All LLM Models

Browse 739 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

Phi 4 Reasoning

Microsoft · 14.7B · runs from 4.8 GB

8.2K 227

Phi 4 Reasoning is a 14.7B-parameter open language model from Microsoft in the Phi 4 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatMathCodeReasoning

Huihui MiniCPM5 1B Abliterated

huihui-ai · 1.1B · runs from 0.6 GB

124 6

Huihui MiniCPM5 1B Abliterated is a 1.1B-parameter open language model from huihui-ai in the MiniCPM family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

ERNIE 4.5 21B A3B PT

Baidu · 21B · runs from 6.2 GB

25.6K 165

ERNIE 4.5 21B A3B PT is a 21B-parameter open language model from Baidu in the ERNIE family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

INTELLECT 1 Instruct

PrimeIntellect · 10.2B · runs from 3.7 GB

248 125

INTELLECT 1 Instruct is a 10.2B-parameter open language model from PrimeIntellect. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Mellum2 12B A2.5B Instruct

JetBrains · 12.1B · runs from 5.5 GB

990 64

Mellum2 12B A2.5B Instruct is a 12.1B-parameter open language model from JetBrains in the Mellum family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Apertus 8B Instruct 2509

swiss-ai · 8B · runs from 2.8 GB

117.9K 439

Apertus 8B Instruct is an open-source instruction-tuned model from Swiss AI, a collaborative research initiative. Built on an 8 billion parameter base, it emphasizes transparency, open data, and European AI sovereignty. For local users, it delivers solid general-purpose chat and instruction-following in a standard 8B footprint that runs well on consumer GPUs with 8 to 10 GB of VRAM, making it a practical choice for those who value open, community-driven model development.

Chat

Starcoder2 7B

BigCode · 7.2B · runs from 3.5 GB

12.3K 215

Starcoder2 7B is a 7.2B-parameter open language model from BigCode in the StarCoder family. It supports a context window of up to 16,384 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

Qwen3.5 9B Claude 4.6 Opus Reasoning Distilled

Jackrong · 9.7B · runs from 4.7 GB

5.0K 29

Qwen3.5 9B Claude 4.6 Opus Reasoning Distilled is a 9.7B-parameter open language model from Jackrong in the Qwen 3.5 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoning

Starcoder2 3B

BigCode · 3.0B · runs from 1.6 GB

123.0K 219

Starcoder2 3B is a 3.0B-parameter open language model from BigCode in the StarCoder family. It supports a context window of up to 16,384 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

Carnice V1 9B Hermes Agent Stage2 Merged

kai-os · 9.0B · runs from 4.4 GB

2.1K 183

Carnice V1 9B Hermes Agent Stage2 Merged is a 9.0B-parameter open language model from kai-os in the Hermes family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatFunctionsReasoning

BioMistral 7B

BioMistral · 7B · runs from 3.5 GB

102.4K 506

BioMistral 7B is a 7B-parameter open language model from BioMistral in the Mistral family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

LFM2.5 1.2B JP 202606

LiquidAI · 1.2B · runs from 0.9 GB

2.7K 61

LFM2.5 1.2B JP 202606 is a 1.2B-parameter open language model from LiquidAI. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Nemotron Mini 4B Instruct

NVIDIA · 4B · runs from 1.8 GB

473.9K 182

Nemotron Mini 4B Instruct is a 4B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Phi 3.5 MoE Instruct

Microsoft · 41.9B · runs from 12.1 GB

123.9K 574

Phi 3.5 MoE Instruct is a 41.9B-parameter open language model from Microsoft in the Phi 3 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

Hermes 3 Llama 3.1 70B

Nous Research · 70.6B · runs from 20.4 GB

131.8K 126

Hermes 3 Llama 3.1 70B is a 70.6B-parameter open language model from Nous Research in the Llama 3 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatRoleplay

Granite 3.3 8B Instruct

IBM · 8.2B · runs from 2.9 GB

70.4K 157

Granite 3.3 8B Instruct is a 8.2B-parameter open language model from IBM in the Granite family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Llama Guard 3 8B

Meta · 8.0B · runs from 2.4 GB

53.9K 299

Meta Llama Guard 3 8B is an 8-billion parameter safety classifier model built on the Llama 3.1 architecture. Unlike general-purpose chat models, Llama Guard is specifically designed to classify whether prompts or responses contain unsafe content across categories such as violence, sexual content, criminal planning, and other policy violations. The model is intended to be used as a moderation layer in LLM-based applications, providing input and output safety filtering. It follows a taxonomy-based classification approach and can be customized for different safety policies. Released under the Llama 3.1 Community License.

Chat

WhiteRabbitNeo 13B V1

WhiteRabbitNeo · 13B · runs from 7.5 GB

2.9K 459

WhiteRabbitNeo 13B V1 is a 13B-parameter open language model from WhiteRabbitNeo. It supports a context window of up to 16,384 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Laguna XS.2

poolside · 33.4B · runs from 14.6 GB

196.2K 292

Laguna XS.2 is a 33.4B-parameter open language model from poolside. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen3.6 35B A3B DFlash

z-lab · 35B · runs from 15.2 GB

59.8K 238

Qwen3.6 35B A3B DFlash is a 35B-parameter open language model from z-lab in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Meta Llama 3 8B

Meta · 8.0B · runs from 3.8 GB

1.2M 6.6K

Meta Llama 3 8B is an 8-billion parameter base (pretrained) language model from Meta's Llama 3 release. As a base model, it is not fine-tuned for chat or instructions and is intended for further fine-tuning, research, or as a foundation for custom applications. It uses grouped-query attention and was trained on over 15 trillion tokens. Llama 3 8B supports an 8K token context window and delivers strong benchmark performance across language understanding, reasoning, and coding tasks for its size. It is released under the Meta Llama 3 Community License and runs efficiently on consumer GPUs with 8GB or more of VRAM.

Chat

Ternary Bonsai 1.7B Unpacked

prism-ml · 1.7B · runs from 1.3 GB

1.2K 5

Ternary Bonsai 1.7B Unpacked is a 1.7B-parameter open language model from prism-ml. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Hermes 2 Theta Llama 3 70B

Nous Research · 70.6B · runs from 20.4 GB

1.3K 82

Hermes 2 Theta Llama 3 70B is a 70.6B-parameter open language model from Nous Research in the Llama 3 family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

GPT Neox 20B

EleutherAI · 20.7B · runs from 6.3 GB

649.3K 584

GPT Neox 20B is a 20.7B-parameter open language model from EleutherAI. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

VulnLLM R 7B

UCSB-SURFI · 7.6B · runs from 2.5 GB

59.7K 179

VulnLLM R 7B is a security-focused model developed by UCSB-SURFI, built on the Qwen2.5-7B base and fine-tuned specifically for vulnerability analysis and security reasoning. With 7.6 billion parameters, it targets tasks like identifying code vulnerabilities, explaining security flaws, and reasoning about attack vectors. This model fills a niche for security researchers and developers who want a locally-hosted assistant for code auditing and vulnerability assessment without sending sensitive code to external APIs. Its specialized training gives it an edge over general-purpose models on security-related tasks, though it is not a replacement for professional security tools. Runs on consumer GPUs with 8 GB of VRAM at typical quantization levels.

ChatReasoning

Llama 3.1 70B LatamGPT SFT 1.0

latam-gpt · 70.6B · runs from 24.8 GB

544 24

Llama 3.1 70B LatamGPT SFT 1.0 is a 70.6B-parameter open language model from latam-gpt in the Llama 3 family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Gpt2

OpenAI · 137M · runs from 0.1 GB

13.2M 3.3K

GPT-2 is the landmark 2019 language model from OpenAI that helped ignite widespread interest in large-scale text generation. At only 137 million parameters it is tiny by modern standards, but it holds an important place in AI history as the model that was initially deemed too dangerous to release in full. Today GPT-2 runs effortlessly on virtually any hardware, including CPUs, making it ideal for educational purposes, experimentation, and understanding transformer fundamentals. It should not be expected to match the quality of modern instruction-tuned models, but it remains a useful teaching tool and conversation starter.

Chat

Llama 3.1 Tulu 3 70B DPO

Allen AI · 70.6B · runs from 20.4 GB

1.5K 10

Llama 3.1 Tulu 3 70B DPO is a 70.6B-parameter open language model from Allen AI in the Llama 3 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Gemma 7B

Google · 8.5B · runs from 4.0 GB

26.2K 3.4K

Google Gemma 7B is a 7-billion parameter base (pretrained) model from the original Gemma generation, Google's first openly available family of language models. It represents Google's initial entry into the open-weight LLM space. While superseded by Gemma 2 and Gemma 3 in terms of benchmark performance, the original Gemma 7B remains a solid foundation model and a useful reference point in the evolution of Google's open models. Released under the Gemma license.

Chat

Yi 34B Chat

01.AI · 34.4B · runs from 15.0 GB

41.1K 356

Yi 34B Chat is a 34.4B-parameter open language model from 01.AI in the Yi family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat