All LLM Models

Browse 671 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Featured only

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Laguna XS.2

poolside · 33.4B · runs from 14.6 GB

Laguna XS.2 is a 33.4B-parameter open language model from poolside. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen3.6 35B A3B DFlash

z-lab · 35B · runs from 15.2 GB

Qwen3.6 35B A3B DFlash is a 35B-parameter open language model from z-lab in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Meta Llama 3 8B

Meta · 8.0B · runs from 3.8 GB

Meta Llama 3 8B is an 8-billion parameter base (pretrained) language model from Meta's Llama 3 release. As a base model, it is not fine-tuned for chat or instructions and is intended for further fine-tuning, research, or as a foundation for custom applications. It uses grouped-query attention and was trained on over 15 trillion tokens. Llama 3 8B supports an 8K token context window and delivers strong benchmark performance across language understanding, reasoning, and coding tasks for its size. It is released under the Meta Llama 3 Community License and runs efficiently on consumer GPUs with 8GB or more of VRAM.

Ternary Bonsai 1.7B Unpacked

prism-ml · 1.7B · runs from 1.3 GB

Ternary Bonsai 1.7B Unpacked is a 1.7B-parameter open language model from prism-ml. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

GPT Neox 20B

EleutherAI · 20.7B · runs from 6.3 GB

GPT Neox 20B is a 20.7B-parameter open language model from EleutherAI. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

VulnLLM R 7B

UCSB-SURFI · 7.6B · runs from 2.5 GB

VulnLLM R 7B is a security-focused model developed by UCSB-SURFI, built on the Qwen2.5-7B base and fine-tuned specifically for vulnerability analysis and security reasoning. With 7.6 billion parameters, it targets tasks like identifying code vulnerabilities, explaining security flaws, and reasoning about attack vectors. This model fills a niche for security researchers and developers who want a locally-hosted assistant for code auditing and vulnerability assessment without sending sensitive code to external APIs. Its specialized training gives it an edge over general-purpose models on security-related tasks, though it is not a replacement for professional security tools. Runs on consumer GPUs with 8 GB of VRAM at typical quantization levels.

Gpt2

OpenAI · 137M · runs from 0.1 GB

GPT-2 is the landmark 2019 language model from OpenAI that helped ignite widespread interest in large-scale text generation. At only 137 million parameters it is tiny by modern standards, but it holds an important place in AI history as the model that was initially deemed too dangerous to release in full. Today GPT-2 runs effortlessly on virtually any hardware, including CPUs, making it ideal for educational purposes, experimentation, and understanding transformer fundamentals. It should not be expected to match the quality of modern instruction-tuned models, but it remains a useful teaching tool and conversation starter.

Gemma 7B

Google · 8.5B · runs from 4.0 GB

Google Gemma 7B is a 7-billion parameter base (pretrained) model from the original Gemma generation, Google's first openly available family of language models. It represents Google's initial entry into the open-weight LLM space. While superseded by Gemma 2 and Gemma 3 in terms of benchmark performance, the original Gemma 7B remains a solid foundation model and a useful reference point in the evolution of Google's open models. Released under the Gemma license.

Yi 34B Chat

01.AI · 34.4B · runs from 15.0 GB

Yi 34B Chat is a 34.4B-parameter open language model from 01.AI in the Yi family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen3 0.6B Base

Alibaba · 596M · runs from 0.7 GB

Qwen3 0.6B Base is the smallest pretrained foundation model in Alibaba Cloud's Qwen 3 family, with approximately 600 million parameters. As a base model, it is not tuned for chat or instructions and is intended for fine-tuning, research, and experimentation. Its minimal size makes it suitable for rapid prototyping and resource-constrained training experiments. The model runs on virtually any hardware, including CPU-only setups. It is useful for educational purposes, architecture exploration, and as a compact foundation for task-specific fine-tuning where model size is a primary constraint. Released under the Apache 2.0 license.

Kappa 20B 131k

eousphoros · 20.9B · runs from 9.3 GB

Kappa 20B 131k is a 20.9B-parameter open language model from eousphoros. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Falcon Mamba 7B

TII UAE · 7.3B · runs from 2.7 GB

Falcon Mamba 7B is a 7.3B-parameter open language model from TII UAE in the Falcon family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Internlm3 8B Instruct

InternLM · 8.8B · runs from 3.4 GB

Internlm3 8B Instruct is a 8.8B-parameter open language model from InternLM in the InternLM family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Ternary Bonsai 4B Unpacked

prism-ml · 4.0B · runs from 2.2 GB

Ternary Bonsai 4B Unpacked is a 4.0B-parameter open language model from prism-ml. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen3 1.7B Base

Alibaba · 1.7B · runs from 1.0 GB

Qwen3 1.7B Base is a 1.7-billion parameter pretrained foundation model from Alibaba Cloud's Qwen 3 family. It is a compact base model designed for fine-tuning, research, and custom applications rather than direct conversational use. Its small size makes it accessible for resource-constrained fine-tuning and rapid experimentation. The model can run on virtually any modern GPU and benefits from the improved pretraining data of the Qwen 3 generation. It is suitable as a lightweight foundation for domain-specific fine-tunes and student models in distillation pipelines. Released under the Apache 2.0 license.

NVIDIA Nemotron Nano 9B v2

NVIDIA · 8.9B · runs from 4.5 GB

NVIDIA Nemotron Nano 9B v2 is a compact yet capable chat model from NVIDIA, packing 8.9 billion parameters into a size that runs comfortably on a wide range of consumer GPUs. Built on NVIDIA's Nemotron architecture, it delivers strong instruction-following and conversational performance while keeping VRAM requirements modest. This second-generation Nano model reflects NVIDIA's push to make high-quality language models accessible on local hardware. It's an excellent starting point for users who want a responsive, general-purpose assistant without needing top-tier GPU memory.

Gemma 2B IT

Google · 2.5B · runs from 1.2 GB

Gemma 2B IT is a 2.5B-parameter open language model from Google in the Gemma 2 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Falcon 40B Instruct

TII UAE · 40B · runs from 12.1 GB

Falcon 40B Instruct is a 40B-parameter open language model from TII UAE in the Falcon family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Yi 6B

01.AI · 6.1B · runs from 2.9 GB

Yi 6B is a 6.1B-parameter open language model from 01.AI in the Yi family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

TinyLlama 1.1B Intermediate Step 1431k 3T

TinyLlama · 1.1B · runs from 0.8 GB

TinyLlama 1.1B Intermediate Step 1431k 3T is a 1.1B-parameter open language model from TinyLlama in the TinyLlama family. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 2B

Google · 2.5B · runs from 1.2 GB

Gemma 2B is a 2.5B-parameter open language model from Google in the Gemma 2 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama 3 3 Nemotron Super 49B V1 5

NVIDIA · 49.9B · runs from 15.1 GB

Llama 3.3 Nemotron Super 49B is a 49.9-billion parameter chat model by NVIDIA, built on a modified Llama 3.3 architecture. It occupies a unique size point between the common 70B and 8B tiers, offering strong reasoning and conversational ability while requiring less VRAM than full 70B models. NVIDIA's Nemotron Super training pipeline applies extensive alignment tuning to optimize helpfulness and factual accuracy. The model typically needs 32GB or more of VRAM for local inference at reduced precision, placing it within reach of high-end consumer GPUs like the RTX 4090 or professional workstation cards.

DeepSeek v2 Lite Chat

DeepSeek · 15.7B · runs from 5.1 GB

DeepSeek v2 Lite Chat is a 15.7B-parameter open language model from DeepSeek in the DeepSeek V2 family. It supports a context window of up to 163,840 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 2 9B

Google · 9.2B · runs from 4.2 GB

Gemma 2 9B is a 9.2B-parameter open language model from Google in the Gemma 2 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen2.5 7B

Alibaba · 7.6B · runs from 3.6 GB

Qwen2.5 7B is a 7.6B-parameter open language model from Alibaba in the Qwen 2.5 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

EXAONE 4.0 1.2B

LGAI-EXAONE · 1.3B · runs from 1.0 GB

EXAONE 4.0 1.2B is a 1.3B-parameter open language model from LGAI-EXAONE in the EXAONE family. It supports a context window of up to 65,536 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

LFM2 1.2B RAG

LiquidAI · 1.2B · runs from 0.9 GB

LFM2 1.2B RAG is a 1.2B-parameter open language model from LiquidAI. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Yi 34B

01.AI · 34.4B · runs from 15.0 GB

Yi 34B is a 34.4B-parameter open language model from 01.AI in the Yi family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Racka 4B

elte-nlp · 4.0B · runs from 1.9 GB

Racka 4B is a 4.0B-parameter open language model from elte-nlp. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

SmolLM 135M

Hugging Face · 135M · runs from 0.4 GB

SmolLM 135M is the original first-generation small language model from Hugging Face, designed to push the boundaries of what is achievable at extremely low parameter counts. With just 135 million parameters, it was a pioneering effort in making capable language models accessible on the most resource-constrained hardware. While the SmolLM2 and SmolLM3 families have since surpassed it in quality, the original SmolLM 135M remains a useful reference point for research and a practical option for ultra-lightweight deployment scenarios where every megabyte of memory counts.