All LLM Models

Browse 671 LLM models with VRAM requirements, quantization options, and hardware compatibility.

Featured only

Understanding LLM VRAM Requirements

How much VRAM you need depends on the model size and quantization level. Quantization reduces the precision of model weights, trading small quality losses for significantly lower VRAM usage. For example, a 7B parameter model needs ~14 GB at FP16 but only ~4 GB at Q4_K_M quantization.

Opt 350M

Meta · 350M · runs from 0.8 GB

Meta OPT 350M is a 350-million parameter language model from Meta's Open Pre-trained Transformer (OPT) project, released in 2022 as part of a suite of models ranging from 125M to 175B parameters. It was designed to provide researchers with open access to models comparable to GPT-3 at various scales. The 350M variant runs on minimal hardware and is suitable for research, prototyping, and educational use. While it has been surpassed by modern architectures in terms of capability, it remains a lightweight option for basic text generation experiments and as a benchmark baseline.

Wildguard

Allen AI · 7.2B · runs from 3.4 GB

Wildguard is a 7.2B-parameter open language model from Allen AI. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama 7B

huggyllama · 6.7B · runs from 3.1 GB

This is a community reupload of Meta's original Llama 1 7B model, published by the huggyllama account on Hugging Face. The original Llama 1 was a 6.7-billion parameter base model released by Meta in early 2023, trained on 1 trillion tokens of publicly available data. It pioneered the wave of open-weight large language models. As a first-generation Llama model, it has been superseded by Llama 2 and Llama 3 in terms of quality and capability. It remains of historical and research interest as the model that catalyzed the open-source LLM ecosystem. This upload provides convenient access in Hugging Face Transformers format.

Mistral Small Instruct 2409

Mistral AI · 22.2B · runs from 10.2 GB

Mistral Small Instruct 2409 is a 22.2B-parameter open language model from Mistral AI in the Mistral family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

TinyStories 1M

roneneldan · 1M · runs from 0.0 GB

TinyStories 1M is a 1M-parameter open language model from roneneldan. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama 2 13B Chat HF

Meta · 13.0B · runs from 6.1 GB

Meta Llama 2 13B Chat is a 13-billion parameter instruction-tuned model from Meta's Llama 2 family, fine-tuned for dialogue and chat applications. It offers improved reasoning and generation quality over the 7B variant while maintaining manageable hardware requirements with a 4K token context window. The model was fine-tuned using supervised fine-tuning and RLHF. It can run on consumer GPUs with 16GB or more of VRAM at reduced precision. Released under the Llama 2 Community License.

Qwen2 1.5B

Alibaba · 1.5B · runs from 1.0 GB

Qwen2 1.5B is a 1.5-billion parameter base (pretrained) model from Alibaba Cloud's older Qwen 2 generation. It was trained on a multilingual corpus and supports a context window of up to 32K tokens. As a base model, it is designed for fine-tuning and research rather than direct conversational use. While superseded by the Qwen 2.5 series in terms of training data quality and benchmark performance, Qwen2 1.5B remains a lightweight option for experimentation and as a baseline for comparison. Released under the Apache 2.0 license.

MiMo 7B RL

XiaomiMiMo · 7.8B · runs from 3.9 GB

MiMo 7B RL is a 7.8B-parameter open language model from XiaomiMiMo. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Baichuan 7B

baichuan-inc · 7B · runs from 15.4 GB

Baichuan 7B is a 7B-parameter open language model from baichuan-inc in the Baichuan family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

MobileLLaMA 1.4B Chat

mtgv · 1.4B · runs from 1.3 GB

MobileLLaMA 1.4B Chat is a 1.4B-parameter open language model from mtgv in the Llama family. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Moonlight 16B A3B

Moonshot AI · 16.0B · runs from 7.5 GB

Moonlight 16B A3B is a compact Mixture-of-Experts model from Moonshot AI that packs 16 billion total parameters while activating only around 3 billion per token. This efficient sparse design lets it punch well above its active parameter count, delivering surprisingly strong chat performance for its effective inference cost. The small active parameter count means Moonlight runs briskly on modest hardware, fitting comfortably on GPUs with 8–12 GB of VRAM at common quantization levels. It is an appealing choice for users who want MoE-level performance diversity without the heavy memory footprint typically associated with mixture models.

Llama XLAM 2 8B Fc R

Salesforce · 8B · runs from 4.0 GB

xLAM 2 8B FC-R is an 8-billion parameter model by Salesforce, specifically optimized for function calling and tool use. Built on the Llama architecture, it is designed to reliably generate structured function call outputs, making it suitable for agentic workflows and applications that require models to interact with external tools and APIs. Unlike general-purpose chat models, xLAM 2 focuses on accurately parsing user intent into structured tool invocations with proper argument formatting. It runs on consumer GPUs with 8GB or more of VRAM and is a strong choice for developers building local AI agent systems that need reliable function-calling capabilities.

DialoGPT Small

Microsoft · 176M · runs from 0.1 GB

DialoGPT Small is a 176M-parameter open language model from Microsoft. It supports a context window of up to 1,024 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Distil Lfm25 Shellper

distil-labs · 354M · runs from 0.5 GB

Distil Lfm25 Shellper is a 354M-parameter open language model from distil-labs. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Ouro 1.4B

ByteDance · 1.4B · runs from 3.6 GB

Ouro 1.4B is a 1.4B-parameter open language model from ByteDance. It supports a context window of up to 65,536 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Falcon 7B Instruct

TII UAE · 7.2B · runs from 3.4 GB

Falcon 7B Instruct is the instruction-tuned version of TII's Falcon 7B, fine-tuned on a mix of chat and instruction datasets to follow user prompts more reliably. It was among the early open models to show that a well-tuned 7B model could handle conversational tasks, summarization, and basic reasoning without requiring massive hardware. While newer models have since raised the bar, Falcon 7B Instruct remains a lightweight option for users who want a responsive local assistant with modest resource requirements.

Nemotron Cascade 2 30B A3B

NVIDIA · 31.6B · runs from 13.8 GB

Nemotron Cascade 2 30B A3B is a 31.6B-parameter open language model from NVIDIA in the Nemotron family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Gemma 3 1B Pt

Google · 1000M · runs from 0.5 GB

Gemma 3 1B Pt is a 1000M-parameter open language model from Google in the Gemma 3 family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen3 14B Base

Alibaba · 14.8B · runs from 6.9 GB

Qwen3 14B Base is a 14.8B-parameter open language model from Alibaba in the Qwen 3 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Qwen3 30B A3B Base

Alibaba · 30.5B · runs from 13.4 GB

Qwen3 30B A3B Base is a 30.5B-parameter open language model from Alibaba in the Qwen 3 family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Llama Guard 3 1B

Meta · 1.5B · runs from 3.3 GB

Llama Guard 3 1B is a 1.5B-parameter open language model from Meta in the Llama family. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Nandi Mini 150M

FrontiersMind · 153M · runs from 0.6 GB

Nandi Mini 150M is a 153M-parameter open language model from FrontiersMind. It supports a context window of up to 2,048 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Academic Ds 9B

ByteDance-Seed · 9.4B · runs from 4.5 GB

Academic Ds 9B is a 9.4B-parameter open language model from ByteDance-Seed. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

LFM2.5 350M Base

LiquidAI · 354M · runs from 0.5 GB

LFM2.5 350M Base is a 354M-parameter open language model from LiquidAI. It supports a context window of up to 128,000 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Deepseek Llm 7B Base

DeepSeek · 7B · runs from 4.3 GB

Deepseek Llm 7B Base is a 7B-parameter open language model from DeepSeek in the DeepSeek family. It supports a context window of up to 4,096 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

H2o Danube3 500M Chat

h2oai · 514M · runs from 0.6 GB

H2o Danube3 500M Chat is a 514M-parameter open language model from h2oai. It supports a context window of up to 8,192 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Ling Mini 2.0

Inclusion AI · 16.3B · runs from 7.3 GB

Ling Mini 2.0 is a 16.3B-parameter open language model from Inclusion AI. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

GPT OSS Safeguard 20B

OpenAI · 21.5B · runs from 9.5 GB

GPT OSS Safeguard 20B is a 21.5B-parameter open language model from OpenAI in the GPT-OSS family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Internlm2 5 7B Chat

InternLM · 7B · runs from 3.5 GB

Internlm2 5 7B Chat is a 7B-parameter open language model from InternLM in the InternLM family. It supports a context window of up to 32,768 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Seed OSS 36B Instruct

ByteDance-Seed · 36.2B · runs from 15.9 GB

Seed OSS 36B Instruct is a 36.2B-parameter open language model from ByteDance-Seed in the Seed-OSS family. It supports a context window of up to 524,288 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.