All LLM Models

Browse 719 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Featured only

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

MobileLLaMA 1.4B Chat

mtgv · 1.4B · runs from 1.3 GB

MobileLLaMA 1.4B Chat is a 1.4B-parameter open language model from mtgv in the Llama family. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Karnak 40B v1.0

Applied-Innovation-Center · 40.7B · runs from 17.7 GB

Karnak 40B v1.0 is a 40.7B-parameter open language model from Applied-Innovation-Center. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Moonlight 16B A3B

Moonshot AI · 16.0B · runs from 7.5 GB

Moonlight 16B A3B is a compact Mixture-of-Experts model from Moonshot AI that packs 16 billion total parameters while activating only around 3 billion per token. This efficient sparse design lets it punch well above its active parameter count, delivering surprisingly strong chat performance for its effective inference cost. The small active parameter count means Moonlight runs briskly on modest hardware, fitting comfortably on GPUs with 8–12 GB of VRAM at common quantization levels. It is an appealing choice for users who want MoE-level performance diversity without the heavy memory footprint typically associated with mixture models.

Nemotron Labs Diffusion 8B

NVIDIA · 8.5B · runs from 17.6 GB

Nemotron Labs Diffusion 8B is a 8.5B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama XLAM 2 8B Fc R

Salesforce · 8B · runs from 4.0 GB

xLAM 2 8B FC-R is an 8-billion parameter model by Salesforce, specifically optimized for function calling and tool use. Built on the Llama architecture, it is designed to reliably generate structured function call outputs, making it suitable for agentic workflows and applications that require models to interact with external tools and APIs. Unlike general-purpose chat models, xLAM 2 focuses on accurately parsing user intent into structured tool invocations with proper argument formatting. It runs on consumer GPUs with 8GB or more of VRAM and is a strong choice for developers building local AI agent systems that need reliable function-calling capabilities.

DialoGPT Small

Microsoft · 176M · runs from 0.1 GB

DialoGPT Small is a 176M-parameter open language model from Microsoft. It supports a context window of up to 1,024 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Distil Lfm25 Shellper

distil-labs · 354M · runs from 0.5 GB

Distil Lfm25 Shellper is a 354M-parameter open language model from distil-labs. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Ouro 1.4B

ByteDance · 1.4B · runs from 3.6 GB

Ouro 1.4B is a 1.4B-parameter open language model from ByteDance. It supports a context window of up to 65,536 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Falcon 7B Instruct

TII UAE · 7.2B · runs from 3.4 GB

Falcon 7B Instruct is the instruction-tuned version of TII's Falcon 7B, fine-tuned on a mix of chat and instruction datasets to follow user prompts more reliably. It was among the early open models to show that a well-tuned 7B model could handle conversational tasks, summarization, and basic reasoning without requiring massive hardware. While newer models have since raised the bar, Falcon 7B Instruct remains a lightweight option for users who want a responsive local assistant with modest resource requirements.

Nemotron Cascade 2 30B A3B

NVIDIA · 31.6B · runs from 13.8 GB

Nemotron Cascade 2 30B A3B is a 31.6B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 3 1B Pt

Google · 1000M · runs from 0.5 GB

Gemma 3 1B Pt is a 1000M-parameter open language model from Google in the Gemma 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen3 14B Base

Alibaba · 14.8B · runs from 6.9 GB

Qwen3 14B Base is a 14.8B-parameter open language model from Alibaba in the Qwen 3 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen3 30B A3B Base

Alibaba · 30.5B · runs from 13.4 GB

Qwen3 30B A3B Base is a 30.5B-parameter open language model from Alibaba in the Qwen 3 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama Guard 3 1B

Meta · 1.5B · runs from 3.3 GB

Llama Guard 3 1B is a 1.5B-parameter open language model from Meta in the Llama family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

EXAONE 3.0 7.8B Instruct

LGAI-EXAONE · 7.8B · runs from 17.2 GB

EXAONE 3.0 7.8B Instruct is a 7.8B-parameter open language model from LGAI-EXAONE in the EXAONE family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Nandi Mini 150M

FrontiersMind · 153M · runs from 0.6 GB

Nandi Mini 150M is a 153M-parameter open language model from FrontiersMind. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Academic Ds 9B

ByteDance-Seed · 9.4B · runs from 4.5 GB

Academic Ds 9B is a 9.4B-parameter open language model from ByteDance-Seed. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

LFM2.5 350M Base

LiquidAI · 354M · runs from 0.5 GB

LFM2.5 350M Base is a 354M-parameter open language model from LiquidAI. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Aya Expanse 8B

Cohere · 8.0B · runs from 17.7 GB

Aya Expanse 8B is a 8.0B-parameter open language model from Cohere in the Aya family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Deepseek Llm 7B Base

DeepSeek · 7B · runs from 4.3 GB

Deepseek Llm 7B Base is a 7B-parameter open language model from DeepSeek in the DeepSeek family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

H2o Danube3 500M Chat

h2oai · 514M · runs from 0.6 GB

H2o Danube3 500M Chat is a 514M-parameter open language model from h2oai. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Ling Mini 2.0

Inclusion AI · 16.3B · runs from 7.3 GB

Ling Mini 2.0 is a 16.3B-parameter open language model from Inclusion AI. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

GPT OSS Safeguard 20B

OpenAI · 21.5B · runs from 9.5 GB

GPT OSS Safeguard 20B is a 21.5B-parameter open language model from OpenAI in the GPT-OSS family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Internlm2 5 7B Chat

InternLM · 7B · runs from 3.5 GB

Internlm2 5 7B Chat is a 7B-parameter open language model from InternLM in the InternLM family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Seed OSS 36B Instruct

ByteDance-Seed · 36.2B · runs from 15.9 GB

Seed OSS 36B Instruct is a 36.2B-parameter open language model from ByteDance-Seed in the Seed-OSS family. It supports a context window of up to 524,288 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Deepseek Moe 16B Base

DeepSeek · 16.4B · runs from 7.7 GB

Deepseek Moe 16B Base is a 16.4B-parameter open language model from DeepSeek in the DeepSeek family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Tongyi DeepResearch 30B A3B

Alibaba-NLP · 30.5B · runs from 13.4 GB

Tongyi DeepResearch 30B A3B is a 30.5B-parameter open language model from Alibaba-NLP. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

DeepSeek R1 Distill Qwen 1.5B

litert-community · 1.5B · runs from 0.7 GB

DeepSeek R1 Distill Qwen 1.5B is a 1.5B-parameter open language model from litert-community in the DeepSeek R1 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen3.6 27B MTPLX Optimized Speed

Youssofal · 4.7B · runs from 2.7 GB

Qwen3.6 27B MTPLX Optimized Speed is a 4.7B-parameter open language model from Youssofal in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Nemotron Cascade 8B

NVIDIA · 8B · runs from 4 GB

Nemotron Cascade 8B is a 8B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.