All LLM Models

Browse 593 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Featured only

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

EuroLLM 9B Instruct 2512

utter-project · 9.2B · runs from 4.5 GB

EuroLLM 9B Instruct 2512 is a 9.2B-parameter open language model from utter-project. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Quasar 3B A1B Preview

silx-ai · 2.9B · runs from 6.5 GB

Quasar 3B A1B Preview is a 2.9B-parameter open language model from silx-ai. It supports a context window of up to 16,384 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama 3 Korean Bllossom 8B

MLP-KTLim · 8.0B · runs from 4.0 GB

Llama 3 Korean Bllossom 8B is a 8.0B-parameter open language model from MLP-KTLim in the Llama 3 family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Salamandra 2B Instruct

BSC-LT · 2.3B · runs from 1.7 GB

Salamandra 2B Instruct is a 2.3B-parameter open language model from BSC-LT. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Carbon 3B

HuggingFaceBio · 3.5B · runs from 1.9 GB

Carbon 3B is a 3.5B-parameter open language model from HuggingFaceBio. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Huihui Qwen3 8B Abliterated v2

huihui-ai · 8.2B · runs from 4.1 GB

Huihui Qwen3 8B Abliterated v2 is a 8.2B-parameter open language model from huihui-ai in the Qwen 3 family. It supports a context window of up to 40,960 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Dolphin 2.9 Llama3 8B

dphn · 8.0B · runs from 4.0 GB

Dolphin 2.9 Llama3 8B is a 8.0B-parameter open language model from dphn in the Llama 3 family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen3.5 4B PTBR

lucasmg09 · 4B · runs from 1.5 GB

Qwen3.5 4B PTBR is a 4B-parameter open language model from lucasmg09 in the Qwen 3.5 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Txgemma 2B Predict

Google · 2.6B · runs from 1.2 GB

Txgemma 2B Predict is a 2.6B-parameter open language model from Google in the Gemma 2 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Mamba 2.8B HF

State Spaces · 2.8B · runs from 1.3 GB

Mamba 2.8B HF is a 2.8B-parameter open language model from State Spaces. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

BitCPM CANN 8B

openbmb · 8B · runs from 3.8 GB

BitCPM CANN 8B is a 8B-parameter open language model from openbmb. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

SILMA 9B Instruct v1.0

silma-ai · 9.2B · runs from 4.8 GB

SILMA 9B Instruct v1.0 is a 9.2B-parameter open language model from silma-ai. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Sarvam 1

sarvamai · 2.5B · runs from 1.6 GB

Sarvam 1 is a 2.5B-parameter open language model from sarvamai. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Sarashina2.2 3B Instruct v0.1

sbintuitions · 3.4B · runs from 2.1 GB

Sarashina2.2 3B Instruct v0.1 is a 3.4B-parameter open language model from sbintuitions. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

TinyLlama 1.1B Chat V0.6

TinyLlama · 1.1B · runs from 0.8 GB

TinyLlama 1.1B Chat V0.6 is a 1.1B-parameter open language model from TinyLlama in the TinyLlama family. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Neural Chat 7B v3 3

Intel · 7.2B · runs from 3.6 GB

Neural Chat 7B v3 3 is a 7.2B-parameter open language model from Intel. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Baguettotron

PleIAs · 321M · runs from 0.6 GB

Baguettotron is a 321M-parameter open language model from PleIAs. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

DeepHat V1 7B

DeepHat · 7.6B · runs from 3.6 GB

DeepHat V1 7B is a 7.6B-parameter open language model from DeepHat. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

DeepSeek Coder v2 Lite Base

DeepSeek · 15.7B · runs from 7.4 GB

DeepSeek Coder v2 Lite Base is a 15.7B-parameter open language model from DeepSeek in the DeepSeek Coder family. It supports a context window of up to 163,840 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 4 26B A4B IT DFlash

z-lab · 26B · runs from 11.4 GB

Gemma 4 26B A4B IT DFlash is a 26B-parameter open language model from z-lab in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Pollux 4B Judge

ai-forever · 4.0B · runs from 2.2 GB

Pollux 4B Judge is a 4.0B-parameter open language model from ai-forever. It supports a context window of up to 40,960 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Shieldgemma 2B

Google · 2.6B · runs from 1.2 GB

Shieldgemma 2B is a 2.6B-parameter open language model from Google in the Gemma 2 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

GigaChat 20B A3B Base

ai-sage · 20B · runs from 9.0 GB

GigaChat 20B A3B Base is a 20B-parameter open language model from ai-sage. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama Krikri 8B Instruct

ilsp · 8.2B · runs from 4.0 GB

Llama Krikri 8B Instruct is a 8.2B-parameter open language model from ilsp in the Llama family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Deeplm 108M

samcheng0 · 108M · runs from 0.2 GB

Deeplm 108M is a 108M-parameter open language model from samcheng0. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vaultgemma 1B

Google · 1.0B · runs from 2.3 GB

Vaultgemma 1B is a 1.0B-parameter open language model from Google in the Gemma family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 4 12B IT Abliterated Uncensored

OpenYourMind · 12.0B · runs from 6.1 GB

Gemma 4 12B IT Abliterated Uncensored is a 12.0B-parameter open language model from OpenYourMind in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Mistral 7B v0.2

mistral-community · 7.2B · runs from 3.6 GB

Mistral 7B v0.2 is a 7.2B-parameter open language model from mistral-community in the Mistral family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Josiefied Qwen3 8B Abliterated V1

Goekdeniz-Guelmez · 8.2B · runs from 4.1 GB

Josiefied Qwen3 8B Abliterated V1 is a 8.2B-parameter open language model from Goekdeniz-Guelmez in the Qwen 3 family. It supports a context window of up to 40,960 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

PLLuM 12B Chat

CYFRAGOVPL · 12.2B · runs from 5.9 GB

PLLuM 12B Chat is a 12.2B-parameter open language model from CYFRAGOVPL. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.