All LLM Models

Browse 529 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Featured only

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Qwen2 1.5B

Alibaba · 1.5B · runs from 1.0 GB

Qwen2 1.5B is a 1.5-billion parameter base (pretrained) model from Alibaba Cloud's older Qwen 2 generation. It was trained on a multilingual corpus and supports a context window of up to 32K tokens. As a base model, it is designed for fine-tuning and research rather than direct conversational use. While superseded by the Qwen 2.5 series in terms of training data quality and benchmark performance, Qwen2 1.5B remains a lightweight option for experimentation and as a baseline for comparison. Released under the Apache 2.0 license.

MiMo 7B RL

XiaomiMiMo · 7.8B · runs from 3.9 GB

MiMo 7B RL is a 7.8B-parameter open language model from XiaomiMiMo. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

MobileLLaMA 1.4B Chat

mtgv · 1.4B · runs from 1.3 GB

MobileLLaMA 1.4B Chat is a 1.4B-parameter open language model from mtgv in the Llama family. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Moonlight 16B A3B

Moonshot AI · 16.0B · runs from 7.5 GB

Moonlight 16B A3B is a compact Mixture-of-Experts model from Moonshot AI that packs 16 billion total parameters while activating only around 3 billion per token. This efficient sparse design lets it punch well above its active parameter count, delivering surprisingly strong chat performance for its effective inference cost. The small active parameter count means Moonlight runs briskly on modest hardware, fitting comfortably on GPUs with 8–12 GB of VRAM at common quantization levels. It is an appealing choice for users who want MoE-level performance diversity without the heavy memory footprint typically associated with mixture models.

Llama XLAM 2 8B Fc R

Salesforce · 8B · runs from 4.0 GB

xLAM 2 8B FC-R is an 8-billion parameter model by Salesforce, specifically optimized for function calling and tool use. Built on the Llama architecture, it is designed to reliably generate structured function call outputs, making it suitable for agentic workflows and applications that require models to interact with external tools and APIs. Unlike general-purpose chat models, xLAM 2 focuses on accurately parsing user intent into structured tool invocations with proper argument formatting. It runs on consumer GPUs with 8GB or more of VRAM and is a strong choice for developers building local AI agent systems that need reliable function-calling capabilities.

DialoGPT Small

Microsoft · 176M · runs from 0.1 GB

DialoGPT Small is a 176M-parameter open language model from Microsoft. It supports a context window of up to 1,024 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Distil Lfm25 Shellper

distil-labs · 354M · runs from 0.5 GB

Distil Lfm25 Shellper is a 354M-parameter open language model from distil-labs. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Ouro 1.4B

ByteDance · 1.4B · runs from 3.6 GB

Ouro 1.4B is a 1.4B-parameter open language model from ByteDance. It supports a context window of up to 65,536 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Falcon 7B Instruct

TII UAE · 7.2B · runs from 3.4 GB

Falcon 7B Instruct is the instruction-tuned version of TII's Falcon 7B, fine-tuned on a mix of chat and instruction datasets to follow user prompts more reliably. It was among the early open models to show that a well-tuned 7B model could handle conversational tasks, summarization, and basic reasoning without requiring massive hardware. While newer models have since raised the bar, Falcon 7B Instruct remains a lightweight option for users who want a responsive local assistant with modest resource requirements.

Gemma 3 1B Pt

Google · 1000M · runs from 0.5 GB

Gemma 3 1B Pt is a 1000M-parameter open language model from Google in the Gemma 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen3 14B Base

Alibaba · 14.8B · runs from 6.9 GB

Qwen3 14B Base is a 14.8B-parameter open language model from Alibaba in the Qwen 3 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama Guard 3 1B

Meta · 1.5B · runs from 3.3 GB

Llama Guard 3 1B is a 1.5B-parameter open language model from Meta in the Llama family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Nandi Mini 150M

FrontiersMind · 153M · runs from 0.6 GB

Nandi Mini 150M is a 153M-parameter open language model from FrontiersMind. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Academic Ds 9B

ByteDance-Seed · 9.4B · runs from 4.5 GB

Academic Ds 9B is a 9.4B-parameter open language model from ByteDance-Seed. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

LFM2.5 350M Base

LiquidAI · 354M · runs from 0.5 GB

LFM2.5 350M Base is a 354M-parameter open language model from LiquidAI. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Deepseek Llm 7B Base

DeepSeek · 7B · runs from 4.3 GB

Deepseek Llm 7B Base is a 7B-parameter open language model from DeepSeek in the DeepSeek family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

H2o Danube3 500M Chat

h2oai · 514M · runs from 0.6 GB

H2o Danube3 500M Chat is a 514M-parameter open language model from h2oai. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Ling Mini 2.0

Inclusion AI · 16.3B · runs from 7.3 GB

Ling Mini 2.0 is a 16.3B-parameter open language model from Inclusion AI. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Internlm2 5 7B Chat

InternLM · 7B · runs from 3.5 GB

Internlm2 5 7B Chat is a 7B-parameter open language model from InternLM in the InternLM family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Deepseek Moe 16B Base

DeepSeek · 16.4B · runs from 7.7 GB

Deepseek Moe 16B Base is a 16.4B-parameter open language model from DeepSeek in the DeepSeek family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

DeepSeek R1 Distill Qwen 1.5B

litert-community · 1.5B · runs from 0.7 GB

DeepSeek R1 Distill Qwen 1.5B is a 1.5B-parameter open language model from litert-community in the DeepSeek R1 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen3.6 27B MTPLX Optimized Speed

Youssofal · 4.7B · runs from 2.7 GB

Qwen3.6 27B MTPLX Optimized Speed is a 4.7B-parameter open language model from Youssofal in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Nemotron Cascade 8B

NVIDIA · 8B · runs from 4 GB

Nemotron Cascade 8B is a 8B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

MiniCPM 2B Sft BF16

openbmb · 2B · runs from 1.9 GB

MiniCPM 2B Sft BF16 is a 2B-parameter open language model from openbmb in the MiniCPM family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Codegemma 2B

Google · 2.5B · runs from 1.2 GB

Codegemma 2B is a 2.5B-parameter open language model from Google in the Gemma 2 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen1.5 MoE A2.7B Chat

Alibaba · 2.7B · runs from 1.9 GB

Qwen1.5 MoE A2.7B Chat is a 2.7B-parameter open language model from Alibaba in the Qwen family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llm Jp 3.1 13B Instruct4

llm-jp · 13.7B · runs from 7.8 GB

Llm Jp 3.1 13B Instruct4 is a 13.7B-parameter open language model from llm-jp. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

ERNIE 4.5 0.3B PT

Baidu · 361M · runs from 0.5 GB

ERNIE 4.5 0.3B PT is a 361M-parameter open language model from Baidu in the ERNIE family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Pythia 1B

EleutherAI · 1.1B · runs from 0.5 GB

Pythia 1B is a 1.1B-parameter open language model from EleutherAI. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama 2 13B HF

Meta · 13.0B · runs from 6.1 GB

Llama 2 13B HF is a 13.0B-parameter open language model from Meta in the Llama 2 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.