Can I Run LLM Model Locally?

Find out which AI models your machine can actually run. Check GPU compatibility, VRAM requirements, and expected performance.

Model Rankings

Top models ranked by compatibility for 24 GB VRAM

145 models · 7 excellent · 15 good

LLM models ranked by compatibility and performance
ModelVRAMGrade
Gemma 3 27B IT27.4B
Q4_K_M· tok/s·131K ctx·GREAT FIT
18.1 GBS90
Qwen3.6 27B27.8B
Q4_K_M· tok/s·262K ctx·GREAT FIT
17.4 GBS88
Q4_K_M· tok/s·262K ctx·GREAT FIT
16.6 GBS85
Q4_K_M· tok/s·8K ctx·GREAT FIT
18.0 GBS90
Q4_K_M· tok/s·262K ctx·GREAT FIT
18.7 GBS86
Q4_K_M· tok/s·262K ctx·GREAT FIT
18.7 GBS86
Q4_K_M· tok/s·262K ctx·GREAT FIT
18.7 GBS86
Q4_K_M· tok/s·131K ctx·GOOD FIT
15.1 GBA80
BF16· tok/s·4K ctx·GOOD FIT
15.4 GBA81
BF16· tok/s·8K ctx·GOOD FIT
15.3 GBA81

Browse by VRAM

Find the best models for your VRAM tier

Popular Devices

All Hardware →

GPUs, MacBooks, AI boxes, and more — find what runs AI best

Popular Models

View all →

Qwen2.5 7B Instruct

Alibaba · 7.6B · runs from 2.7 GB

11.9M 1.4K

Qwen2.5 7B Instruct is a 7.6-billion parameter instruction-tuned model from Alibaba Cloud's Qwen 2.5 series. It supports a 128K token context window and is fine-tuned for conversational AI, instruction following, and general assistant tasks. Its efficient size makes it well-suited for local deployment on consumer GPUs with 8GB or more of VRAM. The model delivers strong performance for its parameter class across reasoning, multilingual understanding, and coding tasks. It benefits from the improved pretraining data and techniques of the Qwen 2.5 generation. Released under the Apache 2.0 license and widely supported by inference frameworks such as llama.cpp, vLLM, and Ollama.

Chat

Gemma 4 26B A4B IT

Google · 26.5B · runs from 8.0 GB

11.5M 1.1K

Gemma 4 26B A4B IT is a 26.5B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision

Gemma 4 31B IT

Google · 32.7B · runs from 10.6 GB

9.9M 3.0K

Gemma 4 31B IT is a 32.7B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision

Llama 3.2 1B Instruct

Meta · 1.2B · runs from 0.4 GB

7.4M 1.5K

Meta Llama 3.2 1B Instruct is a 1-billion parameter instruction-tuned model from Meta, the smallest in the Llama 3.2 family. It is designed for ultra-lightweight deployment scenarios where minimal hardware resources are available, supporting a 128K token context window despite its compact size. This model is suitable for basic conversational tasks, text summarization, and simple instruction following. It can run on virtually any modern GPU and even on CPU-only setups with acceptable performance. Released under the Llama 3.2 Community License.

Chat

Qwen3.6 35B A3B

Alibaba · 36.0B · runs from 10.3 GB

3.7M 2.1K

Qwen3.6 35B A3B is a 36.0B-parameter open language model from Alibaba in the Qwen 3.6 family. It supports a context window of up to 262,144 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Vision

Gemma 4 E4B IT

Google · 8.0B · runs from 3.2 GB

5.6M 1.2K

Gemma 4 E4B IT is a 8.0B-parameter open language model from Google in the Gemma 4 family. It supports a context window of up to 131,072 tokens. See its VRAM requirements by quantization and which GPUs and Macs can run it locally below.

Chat

How It Works

Three steps to find your perfect local AI setup

1

Select Your Hardware

Pick your GPU or Apple Silicon device from the dropdown.

2

Check Compatibility

See which models fit in your VRAM with performance grades.

3

Run It

Install via Ollama, LM Studio, or download the GGUF file.