All LLM Models

Browse 529 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

SILMA 9B Instruct v1.0

silma-ai · 9.2B · runs from 4.8 GB

5.5K 83

SILMA 9B Instruct v1.0 is a 9.2B-parameter open language model from silma-ai. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Sarvam 1

sarvamai · 2.5B · runs from 1.6 GB

5.4K 139

Sarvam 1 is a 2.5B-parameter open language model from sarvamai. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Sarashina2.2 3B Instruct v0.1

sbintuitions · 3.4B · runs from 2.1 GB

5.0K 38

Sarashina2.2 3B Instruct v0.1 is a 3.4B-parameter open language model from sbintuitions. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

TinyLlama 1.1B Chat V0.6

TinyLlama · 1.1B · runs from 0.8 GB

4.9K 113

TinyLlama 1.1B Chat V0.6 is a 1.1B-parameter open language model from TinyLlama in the TinyLlama family. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Neural Chat 7B v3 3

Intel · 7.2B · runs from 3.6 GB

4.8K 83

Neural Chat 7B v3 3 is a 7.2B-parameter open language model from Intel. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatMath

Baguettotron

PleIAs · 321M · runs from 0.6 GB

4.8K 240

Baguettotron is a 321M-parameter open language model from PleIAs. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

DeepHat V1 7B

DeepHat · 7.6B · runs from 3.6 GB

4.8K 150

DeepHat V1 7B is a 7.6B-parameter open language model from DeepHat. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

DeepSeek Coder v2 Lite Base

DeepSeek · 15.7B · runs from 7.4 GB

4.8K 105

DeepSeek Coder v2 Lite Base is a 15.7B-parameter open language model from DeepSeek in the DeepSeek Coder family. It supports a context window of up to 163,840 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

Pollux 4B Judge

ai-forever · 4.0B · runs from 2.2 GB

4.7K 4

Pollux 4B Judge is a 4.0B-parameter open language model from ai-forever. It supports a context window of up to 40,960 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Shieldgemma 2B

Google · 2.6B · runs from 1.2 GB

4.6K 122

Shieldgemma 2B is a 2.6B-parameter open language model from Google in the Gemma 2 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Llama Krikri 8B Instruct

ilsp · 8.2B · runs from 4.0 GB

4.3K 32

Llama Krikri 8B Instruct is a 8.2B-parameter open language model from ilsp in the Llama family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Deeplm 108M

samcheng0 · 108M · runs from 0.2 GB

4.3K 5

Deeplm 108M is a 108M-parameter open language model from samcheng0. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Vaultgemma 1B

Google · 1.0B · runs from 2.3 GB

4.2K 240

Vaultgemma 1B is a 1.0B-parameter open language model from Google in the Gemma family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Gemma 4 12B IT Abliterated Uncensored

OpenYourMind · 12.0B · runs from 6.1 GB

4.1K 48

Gemma 4 12B IT Abliterated Uncensored is a 12.0B-parameter open language model from OpenYourMind in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision

Mistral 7B v0.2

mistral-community · 7.2B · runs from 3.6 GB

4.1K 229

Mistral 7B v0.2 is a 7.2B-parameter open language model from mistral-community in the Mistral family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Josiefied Qwen3 8B Abliterated V1

Goekdeniz-Guelmez · 8.2B · runs from 4.1 GB

4.0K 205

Josiefied Qwen3 8B Abliterated V1 is a 8.2B-parameter open language model from Goekdeniz-Guelmez in the Qwen 3 family. It supports a context window of up to 40,960 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

PLLuM 12B Chat

CYFRAGOVPL · 12.2B · runs from 5.9 GB

4.0K 7

PLLuM 12B Chat is a 12.2B-parameter open language model from CYFRAGOVPL. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

II Medical 8B

Intelligent-Internet · 8.2B · runs from 4.1 GB

3.9K 211

II Medical 8B is a 8.2B-parameter open language model from Intelligent-Internet. It supports a context window of up to 40,960 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Polyglot Ko 1.3B

EleutherAI · 1.4B · runs from 0.7 GB

3.9K 92

Polyglot Ko 1.3B is a 1.4B-parameter open language model from EleutherAI. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Schematron 3B

inference-net · 3B · runs from 1.8 GB

3.9K 324

Schematron 3B is a 3B-parameter open language model from inference-net. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

MythoMax L2 13B

Gryphe · 13B · runs from 7.5 GB

3.9K 388

MythoMax L2 13B is a 13B-parameter open language model from Gryphe. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Fanar 1 9B Instruct

QCRI · 8.8B · runs from 4.7 GB

3.9K 33

Fanar 1 9B Instruct is a 8.8B-parameter open language model from QCRI. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Gemma 3n E2B IT Litert Lm

Google · 2B · runs from 0.9 GB

3.8K 439

Gemma 3n E2B IT Litert Lm is a 2B-parameter open language model from Google in the Gemma 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Nemotron Orchestrator 8B

NVIDIA · 8.2B · runs from 4.1 GB

3.8K 580

Nemotron Orchestrator 8B is a 8.2B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 40,960 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Bella Bartender 8B Llama3.1

juiceb0xc0de · 8.0B · runs from 3.0 GB

3.7K 5

Bella Bartender 8B Llama3.1 is a 8.0B-parameter open language model from juiceb0xc0de in the Llama 3 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

KONI Llama3.1 8B Instruct 20241024

KISTI-KONI · 8.0B · runs from 4.0 GB

3.7K 2

KONI Llama3.1 8B Instruct 20241024 is a 8.0B-parameter open language model from KISTI-KONI in the Llama 3 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Saul 7B Instruct V1

Equall · 7.2B · runs from 3.6 GB

3.7K 115

Saul 7B Instruct V1 is a 7.2B-parameter open language model from Equall. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Cali 0.1B

Sandroeth · 124M · runs from 0.3 GB

3.6K 5

Cali 0.1B is a 124M-parameter open language model from Sandroeth. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen3 4B Gemini 3.1 Pro Reasoning Distilled

khazarai · 4B · runs from 2.2 GB

3.6K 2

Qwen3 4B Gemini 3.1 Pro Reasoning Distilled is a 4B-parameter open language model from khazarai in the Qwen 3 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoning

OpenMath Nemotron 1.5B

NVIDIA · 1.5B · runs from 1.0 GB

3.5K 29

OpenMath Nemotron 1.5B is a 1.5B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatMath