All LLM Models

Browse 739 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

Qwen3 0.6B Base

Alibaba · 596M · runs from 0.7 GB

478.3K 174

Qwen3 0.6B Base is the smallest pretrained foundation model in Alibaba Cloud's Qwen 3 family, with approximately 600 million parameters. As a base model, it is not tuned for chat or instructions and is intended for fine-tuning, research, and experimentation. Its minimal size makes it suitable for rapid prototyping and resource-constrained training experiments. The model runs on virtually any hardware, including CPU-only setups. It is useful for educational purposes, architecture exploration, and as a compact foundation for task-specific fine-tuning where model size is a primary constraint. Released under the Apache 2.0 license.

Chat

Kappa 20B 131k

eousphoros · 20.9B · runs from 9.3 GB

400 12

Kappa 20B 131k is a 20.9B-parameter open language model from eousphoros. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Falcon Mamba 7B

TII UAE · 7.3B · runs from 2.7 GB

139.5K 243

Falcon Mamba 7B is a 7.3B-parameter open language model from TII UAE in the Falcon family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Internlm3 8B Instruct

InternLM · 8.8B · runs from 3.4 GB

89.5K 232

Internlm3 8B Instruct is a 8.8B-parameter open language model from InternLM in the InternLM family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Ternary Bonsai 4B Unpacked

prism-ml · 4.0B · runs from 2.2 GB

1.1K 4

Ternary Bonsai 4B Unpacked is a 4.0B-parameter open language model from prism-ml. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen3 1.7B Base

Alibaba · 1.7B · runs from 1.0 GB

336.3K 65

Qwen3 1.7B Base is a 1.7-billion parameter pretrained foundation model from Alibaba Cloud's Qwen 3 family. It is a compact base model designed for fine-tuning, research, and custom applications rather than direct conversational use. Its small size makes it accessible for resource-constrained fine-tuning and rapid experimentation. The model can run on virtually any modern GPU and benefits from the improved pretraining data of the Qwen 3 generation. It is suitable as a lightweight foundation for domain-specific fine-tunes and student models in distillation pipelines. Released under the Apache 2.0 license.

Chat

NVIDIA Nemotron Nano 9B v2

NVIDIA · 8.9B · runs from 4.5 GB

308.0K 482

NVIDIA Nemotron Nano 9B v2 is a compact yet capable chat model from NVIDIA, packing 8.9 billion parameters into a size that runs comfortably on a wide range of consumer GPUs. Built on NVIDIA's Nemotron architecture, it delivers strong instruction-following and conversational performance while keeping VRAM requirements modest. This second-generation Nano model reflects NVIDIA's push to make high-quality language models accessible on local hardware. It's an excellent starting point for users who want a responsive, general-purpose assistant without needing top-tier GPU memory.

Chat

Gemma 2B IT

Google · 2.5B · runs from 1.2 GB

63.6K 911

Gemma 2B IT is a 2.5B-parameter open language model from Google in the Gemma 2 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Falcon 40B Instruct

TII UAE · 40B · runs from 12.1 GB

15.6K 1.2K

Falcon 40B Instruct is a 40B-parameter open language model from TII UAE in the Falcon family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Yi 6B

01.AI · 6.1B · runs from 2.9 GB

9.3K 375

Yi 6B is a 6.1B-parameter open language model from 01.AI in the Yi family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

TinyLlama 1.1B Intermediate Step 1431k 3T

TinyLlama · 1.1B · runs from 0.8 GB

63.4K 192

TinyLlama 1.1B Intermediate Step 1431k 3T is a 1.1B-parameter open language model from TinyLlama in the TinyLlama family. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Gemma 2B

Google · 2.5B · runs from 1.2 GB

248.0K 1.2K

Gemma 2B is a 2.5B-parameter open language model from Google in the Gemma 2 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Llama 3 3 Nemotron Super 49B V1 5

NVIDIA · 49.9B · runs from 15.1 GB

56.5K 227

Llama 3.3 Nemotron Super 49B is a 49.9-billion parameter chat model by NVIDIA, built on a modified Llama 3.3 architecture. It occupies a unique size point between the common 70B and 8B tiers, offering strong reasoning and conversational ability while requiring less VRAM than full 70B models. NVIDIA's Nemotron Super training pipeline applies extensive alignment tuning to optimize helpfulness and factual accuracy. The model typically needs 32GB or more of VRAM for local inference at reduced precision, placing it within reach of high-end consumer GPUs like the RTX 4090 or professional workstation cards.

Chat

DeepSeek v2 Lite Chat

DeepSeek · 15.7B · runs from 5.1 GB

968.2K 141

DeepSeek v2 Lite Chat is a 15.7B-parameter open language model from DeepSeek in the DeepSeek V2 family. It supports a context window of up to 163,840 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Gemma 2 9B

Google · 9.2B · runs from 4.2 GB

75.4K 709

Gemma 2 9B is a 9.2B-parameter open language model from Google in the Gemma 2 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen2.5 7B

Alibaba · 7.6B · runs from 3.6 GB

802.3K 291

Qwen2.5 7B is a 7.6B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen2.5 72B

Alibaba · 72.7B · runs from 31.0 GB

30.8K 99

Qwen2.5 72B is a 72.7B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

EXAONE 4.0 1.2B

LGAI-EXAONE · 1.3B · runs from 1.0 GB

16.3K 184

EXAONE 4.0 1.2B is a 1.3B-parameter open language model from LGAI-EXAONE in the EXAONE family. It supports a context window of up to 65,536 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

LFM2 1.2B RAG

LiquidAI · 1.2B · runs from 0.9 GB

534 122

LFM2 1.2B RAG is a 1.2B-parameter open language model from LiquidAI. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Yi 34B

01.AI · 34.4B · runs from 15.0 GB

9.1K 1.3K

Yi 34B is a 34.4B-parameter open language model from 01.AI in the Yi family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Racka 4B

elte-nlp · 4.0B · runs from 1.9 GB

633 20

Racka 4B is a 4.0B-parameter open language model from elte-nlp. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

SmolLM 135M

Hugging Face · 135M · runs from 0.4 GB

210.8K 260

SmolLM 135M is the original first-generation small language model from Hugging Face, designed to push the boundaries of what is achievable at extremely low parameter counts. With just 135 million parameters, it was a pioneering effort in making capable language models accessible on the most resource-constrained hardware. While the SmolLM2 and SmolLM3 families have since surpassed it in quality, the original SmolLM 135M remains a useful reference point for research and a practical option for ultra-lightweight deployment scenarios where every megabyte of memory counts.

Chat

Supra 50M Reasoning

SupraLabs · 52M · runs from 0.3 GB

3.0K 44

Supra 50M Reasoning is a 52M-parameter open language model from SupraLabs. It supports a context window of up to 1,024 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoning

Qwen2.5 72B Instruct Abliterated

huihui-ai · 72.7B · runs from 31.9 GB

244.3K 51

An abliterated (uncensored) version of Alibaba's Qwen2.5 72B Instruct, modified by huihui-ai. Abliteration is a technique that removes or weakens the model's built-in refusal mechanisms and safety guardrails, resulting in a model that is more willing to respond to a broader range of prompts without declining. The base Qwen2.5 72B Instruct is one of Alibaba's flagship open models at 72.7 billion parameters. This is a full-precision or minimally modified version of the weights, so running it locally requires substantial VRAM, typically 40GB or more even with quantization applied on top. Users interested in this model should understand that abliterated models lack standard safety filtering and should be used responsibly. The underlying Qwen2.5 72B architecture delivers strong performance across reasoning, coding, writing, and multilingual tasks.

Chat

Bitnet B1.58 2B 4T

Microsoft · 850M · runs from 2.2 GB

8.8K 1.5K

Bitnet B1.58 2B 4T is a 850M-parameter open language model from Microsoft. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Yi 1.5 34B Chat

01.AI · 34.4B · runs from 12.4 GB

18.3K 278

Yi 1.5 34B Chat is a 34.4-billion parameter instruction-tuned model by 01.AI, the Chinese AI lab founded by Kai-Fu Lee. It is a bilingual model with strong performance in both English and Chinese, making it particularly well suited for users who need high-quality generation in either language. Yi 1.5 represents an improved iteration of the Yi model family with enhanced reasoning and coding ability. The 34B size requires a GPU with at least 24GB of VRAM for quantized inference, placing it within reach of high-end consumer cards like the RTX 4090. Released under the Yi License.

Chat

Qwen2.5 14B

Alibaba · 14.8B · runs from 6.8 GB

60.9K 154

Qwen2.5 14B is a 14.8B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen3 4B Base

Alibaba · 4.0B · runs from 2.2 GB

758.6K 95

Qwen3 4B Base is a 4.0B-parameter open language model from Alibaba in the Qwen 3 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Gemma 4 E2B IT Qat Q4 0 Unquantized Heretic

coder3101 · 5.1B · runs from 2.5 GB

1.5K 4

Gemma 4 E2B IT Qat Q4 0 Unquantized Heretic is a 5.1B-parameter open language model from coder3101 in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Code

Qwen2.5 32B

Alibaba · 32.8B · runs from 14.3 GB

65.7K 178

Qwen2.5 32B is a 32.8B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat