All LLM Models

Browse 719 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

MobileLLaMA 1.4B Chat

mtgv · 1.4B · runs from 1.3 GB

82.4K 21

MobileLLaMA 1.4B Chat is a 1.4B-parameter open language model from mtgv in the Llama family. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Karnak 40B v1.0

Applied-Innovation-Center · 40.7B · runs from 17.7 GB

81.0K 36

Karnak 40B v1.0 is a 40.7B-parameter open language model from Applied-Innovation-Center. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Moonlight 16B A3B

Moonshot AI · 16.0B · runs from 7.5 GB

72.7K 109

Moonlight 16B A3B is a compact Mixture-of-Experts model from Moonshot AI that packs 16 billion total parameters while activating only around 3 billion per token. This efficient sparse design lets it punch well above its active parameter count, delivering surprisingly strong chat performance for its effective inference cost. The small active parameter count means Moonlight runs briskly on modest hardware, fitting comfortably on GPUs with 8–12 GB of VRAM at common quantization levels. It is an appealing choice for users who want MoE-level performance diversity without the heavy memory footprint typically associated with mixture models.

Chat

Nemotron Labs Diffusion 8B

NVIDIA · 8.5B · runs from 17.6 GB

70.3K 49

Nemotron Labs Diffusion 8B is a 8.5B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Llama XLAM 2 8B Fc R

Salesforce · 8B · runs from 4.0 GB

64.1K 59

xLAM 2 8B FC-R is an 8-billion parameter model by Salesforce, specifically optimized for function calling and tool use. Built on the Llama architecture, it is designed to reliably generate structured function call outputs, making it suitable for agentic workflows and applications that require models to interact with external tools and APIs. Unlike general-purpose chat models, xLAM 2 focuses on accurately parsing user intent into structured tool invocations with proper argument formatting. It runs on consumer GPUs with 8GB or more of VRAM and is a strong choice for developers building local AI agent systems that need reliable function-calling capabilities.

ChatFunctions

DialoGPT Small

Microsoft · 176M · runs from 0.1 GB

57.5K 146

DialoGPT Small is a 176M-parameter open language model from Microsoft. It supports a context window of up to 1,024 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Distil Lfm25 Shellper

distil-labs · 354M · runs from 0.5 GB

57.0K 11

Distil Lfm25 Shellper is a 354M-parameter open language model from distil-labs. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatFunctions

Ouro 1.4B

ByteDance · 1.4B · runs from 3.6 GB

54.2K 98

Ouro 1.4B is a 1.4B-parameter open language model from ByteDance. It supports a context window of up to 65,536 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoning

Falcon 7B Instruct

TII UAE · 7.2B · runs from 3.4 GB

50.8K 1.0K

Falcon 7B Instruct is the instruction-tuned version of TII's Falcon 7B, fine-tuned on a mix of chat and instruction datasets to follow user prompts more reliably. It was among the early open models to show that a well-tuned 7B model could handle conversational tasks, summarization, and basic reasoning without requiring massive hardware. While newer models have since raised the bar, Falcon 7B Instruct remains a lightweight option for users who want a responsive local assistant with modest resource requirements.

Chat

Nemotron Cascade 2 30B A3B

NVIDIA · 31.6B · runs from 13.8 GB

49.3K 505

Nemotron Cascade 2 30B A3B is a 31.6B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoning

Gemma 3 1B Pt

Google · 1000M · runs from 0.5 GB

47.6K 196

Gemma 3 1B Pt is a 1000M-parameter open language model from Google in the Gemma 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen3 14B Base

Alibaba · 14.8B · runs from 6.9 GB

45.2K 50

Qwen3 14B Base is a 14.8B-parameter open language model from Alibaba in the Qwen 3 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen3 30B A3B Base

Alibaba · 30.5B · runs from 13.4 GB

44.9K 70

Qwen3 30B A3B Base is a 30.5B-parameter open language model from Alibaba in the Qwen 3 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Llama Guard 3 1B

Meta · 1.5B · runs from 3.3 GB

44.9K 109

Llama Guard 3 1B is a 1.5B-parameter open language model from Meta in the Llama family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

EXAONE 3.0 7.8B Instruct

LGAI-EXAONE · 7.8B · runs from 17.2 GB

41.7K 420

EXAONE 3.0 7.8B Instruct is a 7.8B-parameter open language model from LGAI-EXAONE in the EXAONE family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Nandi Mini 150M

FrontiersMind · 153M · runs from 0.6 GB

40.6K 139

Nandi Mini 150M is a 153M-parameter open language model from FrontiersMind. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Academic Ds 9B

ByteDance-Seed · 9.4B · runs from 4.5 GB

40.5K 16

Academic Ds 9B is a 9.4B-parameter open language model from ByteDance-Seed. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

LFM2.5 350M Base

LiquidAI · 354M · runs from 0.5 GB

39.9K 14

LFM2.5 350M Base is a 354M-parameter open language model from LiquidAI. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Aya Expanse 8B

Cohere · 8.0B · runs from 17.7 GB

39.4K 434

Aya Expanse 8B is a 8.0B-parameter open language model from Cohere in the Aya family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Deepseek Llm 7B Base

DeepSeek · 7B · runs from 4.3 GB

39.1K 138

Deepseek Llm 7B Base is a 7B-parameter open language model from DeepSeek in the DeepSeek family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

H2o Danube3 500M Chat

h2oai · 514M · runs from 0.6 GB

38.8K 42

H2o Danube3 500M Chat is a 514M-parameter open language model from h2oai. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Ling Mini 2.0

Inclusion AI · 16.3B · runs from 7.3 GB

38.6K 195

Ling Mini 2.0 is a 16.3B-parameter open language model from Inclusion AI. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

GPT OSS Safeguard 20B

OpenAI · 21.5B · runs from 9.5 GB

38.1K 233

GPT OSS Safeguard 20B is a 21.5B-parameter open language model from OpenAI in the GPT-OSS family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Internlm2 5 7B Chat

InternLM · 7B · runs from 3.5 GB

37.8K 200

Internlm2 5 7B Chat is a 7B-parameter open language model from InternLM in the InternLM family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Seed OSS 36B Instruct

ByteDance-Seed · 36.2B · runs from 15.9 GB

37.2K 502

Seed OSS 36B Instruct is a 36.2B-parameter open language model from ByteDance-Seed in the Seed-OSS family. It supports a context window of up to 524,288 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Deepseek Moe 16B Base

DeepSeek · 16.4B · runs from 7.7 GB

36.1K 149

Deepseek Moe 16B Base is a 16.4B-parameter open language model from DeepSeek in the DeepSeek family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Tongyi DeepResearch 30B A3B

Alibaba-NLP · 30.5B · runs from 13.4 GB

35.9K 814

Tongyi DeepResearch 30B A3B is a 30.5B-parameter open language model from Alibaba-NLP. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

DeepSeek R1 Distill Qwen 1.5B

litert-community · 1.5B · runs from 0.7 GB

32.8K 35

DeepSeek R1 Distill Qwen 1.5B is a 1.5B-parameter open language model from litert-community in the DeepSeek R1 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoning

Qwen3.6 27B MTPLX Optimized Speed

Youssofal · 4.7B · runs from 2.7 GB

32.8K 42

Qwen3.6 27B MTPLX Optimized Speed is a 4.7B-parameter open language model from Youssofal in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Nemotron Cascade 8B

NVIDIA · 8B · runs from 4 GB

31.7K 65

Nemotron Cascade 8B is a 8B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoning