All LLM Models

Browse 856 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

OLMo 2 0425 1B

Allen AI · 1.5B · runs from 1.2 GB

412.1K 77

OLMo 2 0425 1B is a 1.5B-parameter open language model from Allen AI in the OLMo family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Llama 3.1 70B Instruct

Meta · 70.6B · runs from 33.0 GB

630.4K 924

Meta Llama 3.1 70B Instruct is a 70.6-billion parameter instruction-tuned model from Meta's Llama 3.1 family. It features a 128K token context window and is optimized for chat, tool use, and complex reasoning tasks. The 70B size offers a strong balance between capability and hardware requirements, running well on multi-GPU setups or high-VRAM workstation cards. This model was trained on over 15 trillion tokens and fine-tuned with reinforcement learning from human feedback (RLHF). It excels at coding assistance, mathematical reasoning, and multilingual dialogue. Released under the Llama 3.1 Community License.

Chat

Qwen1.5 7B Chat

Alibaba · 7.7B · runs from 4.7 GB

12.5K 186

Qwen1.5 7B Chat is a 7.7B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

SmolLM2 360M

Hugging Face · 362M · runs from 0.5 GB

60.8K 104

SmolLM2 360M is a 362M-parameter open language model from Hugging Face in the SmolLM family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen1.5 32B Chat

Alibaba · 32.5B · runs from 14.3 GB

9.6K 109

Qwen1.5 32B Chat is a 32.5B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Phi 1 5

Microsoft · 1.4B · runs from 0.7 GB

60.9K 1.4K

Phi 1 5 is a 1.4B-parameter open language model from Microsoft in the Phi family. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

Gemma 4 12B IT Assistant

Google · 12B · runs from 5.4 GB

29.2K 82

Gemma 4 12B IT Assistant is a 12B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Falcon 40B

TII UAE · 41.8B · runs from 19.6 GB

29.2K 2.4K

Falcon 40B is a 41.8B-parameter open language model from TII UAE in the Falcon family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen 14B Chat

Alibaba · 14.2B · runs from 6.6 GB

1.7K 373

Qwen 14B Chat is a 14.2B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen1.5 7B

Alibaba · 7.7B · runs from 4.7 GB

133.4K 56

Qwen1.5 7B is a 7.7B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

SmolLM 1.7B

Hugging Face · 1.7B · runs from 1.4 GB

64.3K 181

SmolLM 1.7B is the largest model in Hugging Face's first-generation SmolLM family. At 1.7 billion parameters, it delivers solid general-purpose text generation in a compact package that runs easily on entry-level hardware, though it has been superseded by the improved SmolLM2 and SmolLM3 series. This model remains a reasonable choice for applications where proven stability matters more than cutting-edge performance. For most new projects, however, users should consider the SmolLM2 1.7B or SmolLM3 3B models, which offer better quality at comparable or only slightly higher resource requirements.

Chat

Qwen 7B

Alibaba · 7.7B · runs from 3.6 GB

17.3K 399

Qwen 7B is a 7.7B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Sarvam 30B Uncensored

aoxo · 32.2B · runs from 14 GB

413 6

Sarvam 30B Uncensored is a 32.2B-parameter open language model from aoxo. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen1.5 72B Chat

Alibaba · 72.3B · runs from 35.5 GB

9.2K 217

Qwen1.5 72B Chat is a 72.3B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Phi 3 Mini 128k Instruct

Microsoft · 3.8B · runs from 2.7 GB

248.6K 1.7K

Phi 3 Mini 128k Instruct is a 3.8B-parameter open language model from Microsoft in the Phi 3 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

CodeQwen1.5 7B

Alibaba · 7.3B · runs from 3.5 GB

2.2K 103

CodeQwen1.5 7B is a 7.3B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 65,536 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

Llama 2 7B Chat HF

Nous Research · 6.7B · runs from 4.2 GB

18.1K 199

Llama 2 7B Chat HF is a 6.7B-parameter open language model from Nous Research in the Llama 2 family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Llama 3.1 Nemotron Nano 8B V1

NVIDIA · 8B · runs from 2.8 GB

308.6K 219

Llama 3.1 Nemotron Nano 8B is an 8-billion parameter chat model by NVIDIA, a compact entry in the Nemotron family derived from Meta's Llama 3.1 architecture. It applies NVIDIA's alignment and fine-tuning techniques to deliver improved response quality over the base Llama 3.1 8B Instruct model at the same parameter count. The model runs on consumer GPUs with 8GB or more of VRAM and supports a 128K token context window. Its small footprint and NVIDIA-tuned quality make it a practical option for local inference on mainstream hardware.

Chat

Falcon 7B

TII UAE · 7.2B · runs from 3.4 GB

378.1K 1.1K

Falcon 7B was one of the first truly competitive open-source large language models, released in mid-2023 by the Technology Innovation Institute in Abu Dhabi. Trained on the massive RefinedWeb dataset, it demonstrated that carefully curated web data could rival models trained on more traditionally assembled corpora. At 7 billion parameters, Falcon 7B helped establish the 7B class as the sweet spot for local inference, offering genuine language understanding on consumer GPUs with as little as 6 GB of VRAM.

Chat

Qwen1.5 14B

Alibaba · 14.2B · runs from 8 GB

9.9K 41

Qwen1.5 14B is a 14.2B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen 14B

Alibaba · 14.2B · runs from 6.6 GB

1.8K 214

Qwen 14B is a 14.2B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

C4ai Command A 03 2025

Cohere · 111.1B · runs from 51.9 GB

2.0K 392

C4ai Command A 03 2025 is a 111.1B-parameter open language model from Cohere in the Command family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Distilgpt2

distilbert · 88M · runs from 0.0 GB

2.3M 618

DistilGPT-2 is a distilled version of OpenAI's GPT-2 model, compressed to just 88 million parameters while retaining much of the original model's text generation ability. Created using knowledge distillation techniques, it offers significantly faster inference than the full GPT-2 with only a modest reduction in output quality. This model is one of the lightest autoregressive language models available and can run on virtually any hardware, including CPUs. It is a practical choice for educational projects, quick prototyping, and applications where inference speed and minimal resource usage are more important than state-of-the-art generation quality.

Chat

Qwen1.5 32B

Alibaba · 32.5B · runs from 14.3 GB

9.5K 85

Qwen1.5 32B is a 32.5B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

QwQ 32B Preview

Alibaba · 32.8B · runs from 14.8 GB

20.8K 1.7K

QwQ 32B Preview is a 32.8B-parameter open language model from Alibaba in the QwQ family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatReasoning

Qwen 1 8B

Alibaba · 1.8B · runs from 0.9 GB

1.7K 73

Qwen 1 8B is a 1.8B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

NVIDIA Nemotron 3 Ultra 550B A55B Base BF16

NVIDIA · 560.5B · runs from 262.1 GB

2.0K 25

NVIDIA Nemotron 3 Ultra 550B A55B Base BF16 is a 560.5B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Baichuan2 13B Base

baichuan-inc · 13B · runs from 6.1 GB

1.6K 82

Baichuan2 13B Base is a 13B-parameter open language model from baichuan-inc in the Baichuan family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Baichuan 13B Base

baichuan-inc · 13B · runs from 6.1 GB

990 187

Baichuan 13B Base is a 13B-parameter open language model from baichuan-inc in the Baichuan family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Opt 125M

Meta · 125M · runs from 0.3 GB

10.5M 266

Meta OPT 125M is a 125-million parameter language model from Meta's Open Pre-trained Transformer (OPT) project. Released in 2022, it was part of Meta's effort to provide the research community with openly available large language models that replicate the performance of GPT-3 class models at various scales. As one of the smallest models in the OPT family, the 125M variant is primarily useful for research, experimentation, and educational purposes. It can run on virtually any hardware, including CPU-only setups. While significantly less capable than modern models, it remains a useful reference point in LLM research.

Chat