All LLM Models

Browse 739 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Model List

SmolLM3 3B Base

Hugging Face · 3B · runs from 1.3 GB

89.8K 150

SmolLM3 3B Base is the pretrained foundation model from Hugging Face's third-generation SmolLM family. Without instruction tuning or chat alignment, it serves as a versatile starting point for researchers and developers who want to fine-tune the model for specific domains, tasks, or behavioral profiles. With 3 billion parameters and the architectural improvements introduced in SmolLM3, this base model offers strong general language capabilities in a package that remains practical to train and adapt on consumer-grade hardware. It is an excellent choice for custom fine-tuning projects where off-the-shelf chat behavior is not needed.

Chat

Gpt2 Large

OpenAI · 812M · runs from 0.4 GB

2.1M 353

Gpt2 Large is a 812M-parameter open language model from OpenAI. It supports a context window of up to 1,024 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Pythia 410M

EleutherAI · 506M · runs from 0.2 GB

104.6K 37

Pythia 410M is a 506M-parameter open language model from EleutherAI. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Dolphin 2.9.1 Yi 1.5 34B

dphn · 34.4B · runs from 10.3 GB

4.7M 57

Dolphin 2.9.1 Yi 1.5 34B is a 34.4-billion parameter chat model created by Eric Hartford's Dolphin project, fine-tuned from 01.AI's Yi 1.5 34B base. The Dolphin series is known for producing uncensored fine-tunes that remove alignment-based refusals, giving users more direct and unrestricted model responses. This model combines the strong bilingual capabilities of Yi 1.5 with Dolphin's open fine-tuning approach. It requires a GPU with at least 24GB of VRAM for quantized local inference and is popular among users who prefer models without built-in content restrictions.

Chat

Falcon 11B

TII UAE · 11.1B · runs from 5.0 GB

4.5K 219

Falcon 11B is a 11.1B-parameter open language model from TII UAE in the Falcon family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen2.5 0.5B

Alibaba · 494M · runs from 0.5 GB

2.0M 421

Qwen2.5 0.5B is the smallest base (pretrained) model in Alibaba Cloud's Qwen 2.5 family, with 494 million parameters. As a base model, it is not instruction-tuned and is intended for fine-tuning, research, and as a foundation for custom applications. It supports a 128K token context window. Its minimal size makes it suitable for experimentation, rapid prototyping, and resource-constrained fine-tuning tasks. The model can run on virtually any hardware. Released under the Apache 2.0 license.

Chat

Qwen3 8B Base

Alibaba · 8.2B · runs from 4.1 GB

453.7K 107

Qwen3 8B Base is an 8.2-billion parameter pretrained foundation model from Alibaba Cloud's Qwen 3 series. As a base model, it is not instruction-tuned and is intended for fine-tuning, research, and as a starting point for custom downstream applications. It was trained on a large multilingual corpus with improved data quality and training methodology compared to the Qwen 2.5 generation. The model runs efficiently on consumer GPUs with 8GB or more of VRAM and serves as the foundation for the Qwen3 8B instruction-tuned variant and community fine-tunes. It is a strong choice for practitioners building specialized models through further training. Released under the Apache 2.0 license.

Chat

Qwen3Guard Gen 8B

Alibaba · 8.2B · runs from 4.1 GB

70.1K 114

Qwen3Guard Gen 8B is a 8.2B-parameter open language model from Alibaba in the Qwen 3 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

C4ai Command R V01

CohereForAI · 35.0B · runs from 16.4 GB

28.5K 1.1K

C4ai Command R V01 is a 35.0B-parameter open language model from CohereForAI in the Command R family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen1.5 14B Chat

Alibaba · 14.2B · runs from 8 GB

10.9K 112

Qwen1.5 14B Chat is a 14.2B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Gpt2 Xl

OpenAI · 1.6B · runs from 0.7 GB

189.2K 380

GPT-2 XL is the largest variant of the GPT-2 family at 1.6 billion parameters, representing the full release of the model OpenAI originally withheld over safety concerns in 2019. It produces the most coherent and capable outputs of the GPT-2 lineup, though it remains far behind modern multi-billion-parameter instruction-tuned models. At its size, GPT-2 XL still runs easily on most consumer GPUs and even on CPUs with reasonable speed, making it useful for experimentation, fine-tuning projects, and as a baseline for comparing against newer architectures. It requires roughly 3 GB of VRAM at full precision.

Chat

SmolLM2 1.7B

Hugging Face · 1.7B · runs from 1.4 GB

281.3K 152

SmolLM2 1.7B is the base pretrained model from Hugging Face's second-generation SmolLM family. Unlike the instruct variant, this model has not been fine-tuned for chat or instruction following, making it a strong foundation for custom fine-tuning, domain adaptation, or research into small-scale language model behavior. At 1.7 billion parameters, it provides meaningful language understanding and generation capabilities while remaining lightweight enough to train and experiment with on consumer hardware. Researchers and developers who want full control over downstream behavior will find this base model more flexible than the instruction-tuned version.

Chat

SmolLM2 135M

Hugging Face · 135M · runs from 0.4 GB

1.4M 199

SmolLM2 135M is one of the smallest capable language models available, developed by Hugging Face as part of their SmolLM2 family. With just 135 million parameters, it requires virtually no VRAM and can run on almost any hardware, making it an excellent starting point for researchers experimenting with language model behavior, fine-tuning workflows, or edge deployment scenarios. Despite its tiny footprint, SmolLM2 135M benefits from improved training data and techniques compared to its first-generation predecessor. It is best suited for lightweight text generation tasks, prototyping, and educational purposes rather than production-grade applications.

Chat

Qwen2 72B Instruct

Alibaba · 72.7B · runs from 21.0 GB

20.5K 717

Qwen2 72B Instruct is a 72.7B-parameter open language model from Alibaba in the Qwen 2 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

OLMo 2 0425 1B

Allen AI · 1.5B · runs from 1.2 GB

412.1K 77

OLMo 2 0425 1B is a 1.5B-parameter open language model from Allen AI in the OLMo family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen1.5 7B Chat

Alibaba · 7.7B · runs from 4.7 GB

12.5K 186

Qwen1.5 7B Chat is a 7.7B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

SmolLM2 360M

Hugging Face · 362M · runs from 0.5 GB

60.8K 104

SmolLM2 360M is a 362M-parameter open language model from Hugging Face in the SmolLM family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen1.5 32B Chat

Alibaba · 32.5B · runs from 14.3 GB

9.6K 109

Qwen1.5 32B Chat is a 32.5B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Phi 1 5

Microsoft · 1.4B · runs from 0.7 GB

60.9K 1.4K

Phi 1 5 is a 1.4B-parameter open language model from Microsoft in the Phi family. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

Gemma 4 12B IT Assistant

Google · 12B · runs from 5.4 GB

29.2K 82

Gemma 4 12B IT Assistant is a 12B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Falcon 40B

TII UAE · 41.8B · runs from 19.6 GB

29.2K 2.4K

Falcon 40B is a 41.8B-parameter open language model from TII UAE in the Falcon family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen 14B Chat

Alibaba · 14.2B · runs from 6.6 GB

1.7K 373

Qwen 14B Chat is a 14.2B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Qwen1.5 7B

Alibaba · 7.7B · runs from 4.7 GB

133.4K 56

Qwen1.5 7B is a 7.7B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

SmolLM 1.7B

Hugging Face · 1.7B · runs from 1.4 GB

64.3K 181

SmolLM 1.7B is the largest model in Hugging Face's first-generation SmolLM family. At 1.7 billion parameters, it delivers solid general-purpose text generation in a compact package that runs easily on entry-level hardware, though it has been superseded by the improved SmolLM2 and SmolLM3 series. This model remains a reasonable choice for applications where proven stability matters more than cutting-edge performance. For most new projects, however, users should consider the SmolLM2 1.7B or SmolLM3 3B models, which offer better quality at comparable or only slightly higher resource requirements.

Chat

Qwen 7B

Alibaba · 7.7B · runs from 3.6 GB

17.3K 399

Qwen 7B is a 7.7B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Sarvam 30B Uncensored

aoxo · 32.2B · runs from 14 GB

413 6

Sarvam 30B Uncensored is a 32.2B-parameter open language model from aoxo. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Phi 3 Mini 128k Instruct

Microsoft · 3.8B · runs from 2.7 GB

248.6K 1.7K

Phi 3 Mini 128k Instruct is a 3.8B-parameter open language model from Microsoft in the Phi 3 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

CodeQwen1.5 7B

Alibaba · 7.3B · runs from 3.5 GB

2.2K 103

CodeQwen1.5 7B is a 7.3B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 65,536 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ChatCode

Llama 2 7B Chat HF

Nous Research · 6.7B · runs from 4.2 GB

18.1K 199

Llama 2 7B Chat HF is a 6.7B-parameter open language model from Nous Research in the Llama 2 family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

Llama 3.1 Nemotron Nano 8B V1

NVIDIA · 8B · runs from 2.8 GB

308.6K 219

Llama 3.1 Nemotron Nano 8B is an 8-billion parameter chat model by NVIDIA, a compact entry in the Nemotron family derived from Meta's Llama 3.1 architecture. It applies NVIDIA's alignment and fine-tuning techniques to deliver improved response quality over the base Llama 3.1 8B Instruct model at the same parameter count. The model runs on consumer GPUs with 8GB or more of VRAM and supports a 128K token context window. Its small footprint and NVIDIA-tuned quality make it a practical option for local inference on mainstream hardware.

Chat