VRAM Requirements for AI: How Much Do You Need?

I ran out of VRAM three minutes into my first local AI experiment.

The model loaded fine. I typed my prompt. Then—nothing. My GPU fans spun up to jet-engine levels, my system stuttered, and eventually crashed with an “out of memory” error. The 7B model I was trying to run needed 14GB of VRAM. My GPU had 8GB.

That’s when I learned the most important lesson in local AI: VRAM is everything. Not clock speed. Not CUDA cores. Not tensor cores. VRAM—video memory—is the single factor that determines what AI models you can run locally.

This guide gives you the exact formulas and tables to calculate your VRAM requirements before you waste time (or money) on hardware that can’t run what you need.

Why VRAM Is the Bottleneck for AI

When you run a Large Language Model on your GPU, the entire model needs to fit in VRAM. Not “most of it.” Not the “active parts.” The whole thing—every parameter, every weight.

Here’s why: GPUs are extremely fast at the parallel math required for AI inference, but only when the data is in their local memory (VRAM). If even part of the model spills over to system RAM, performance falls off a cliff. We’re talking 10-100x slower, sometimes more.

Think of it like a chef’s workspace. System RAM is the pantry down the hall—you can store a lot there, but every trip takes time. VRAM is the counter in front of you—limited space, but everything is instantly accessible. For AI inference, you need your entire recipe (the model) on that counter.

VRAM vs System RAM:

VRAM (GPU memory): High-speed memory on your graphics card. Directly accessible by GPU cores. This is what matters for AI.
System RAM: Your computer’s main memory. Much slower for AI workloads. Used as fallback when VRAM runs out.

Some frameworks support “CPU offloading,” where portions of the model run on system RAM. This works in a pinch, but expect a 5-20x performance penalty. It’s a last resort, not a strategy.

The Simple Formula for VRAM Requirements

Here’s the formula that will save you hours of confusion:

VRAM Required = (Model Parameters × Bytes per Parameter) + Overhead

The “bytes per parameter” depends on the precision format:

Precision Format	Bytes per Parameter	VRAM per Billion Parameters
FP32 (32-bit)	4 bytes	~4 GB
FP16 (16-bit)	2 bytes	~2 GB
INT8 (8-bit)	1 byte	~1 GB
INT4 (4-bit)	0.5 bytes	~0.5 GB

Real example calculations (from Hugging Face model specs):

A Llama 3 8B model at different precisions:

FP16 (standard): 8B × 2 bytes = 16 GB VRAM
INT8 (quantized): 8B × 1 byte = 8 GB VRAM
INT4 (heavily quantized): 8B × 0.5 bytes = 4 GB VRAM

A Llama 3 70B model:

FP16: 70B × 2 bytes = 140 GB VRAM (not happening on consumer hardware)
INT4: 70B × 0.5 bytes = 35 GB VRAM (possible on RTX 5090 or multiple GPUs)

These are baseline requirements. In practice, you need 10-20% extra for:

KV cache (grows with context length)
Framework overhead
CUDA memory management

Rule of thumb: Add 15% to your calculated VRAM need for safety margin.

Complete VRAM Requirements Table (2026 Models)

Here’s a comprehensive table of popular models and their VRAM requirements. These are real-world numbers including typical overhead.

Llama Family (Meta)

Model	Parameters	Q4_K_M	Q5_K_M	Q8_0	FP16
Llama 3.2 1B	1B	0.8 GB	1 GB	1.5 GB	2.5 GB
Llama 3.2 3B	3B	2 GB	2.5 GB	3.5 GB	7 GB
Llama 3.1 8B	8B	5 GB	6 GB	9 GB	17 GB
Llama 3.1 70B	70B	40 GB	48 GB	75 GB	145 GB
Llama 3.1 405B	405B	230 GB	275 GB	430 GB	850 GB

Mistral Family

Model	Parameters	Q4_K_M	Q5_K_M	Q8_0	FP16
Mistral 7B	7B	4.5 GB	5.5 GB	8 GB	15 GB
Mixtral 8x7B	47B (active: 13B)	27 GB	33 GB	50 GB	100 GB
Mistral Large 2	123B	70 GB	85 GB	130 GB	260 GB

Qwen Family (Alibaba)

Model	Parameters	Q4_K_M	Q5_K_M	Q8_0	FP16
Qwen 2.5 7B	7B	4.5 GB	5.5 GB	8 GB	15 GB
Qwen 2.5 14B	14B	9 GB	11 GB	16 GB	30 GB
Qwen 2.5 72B	72B	42 GB	50 GB	78 GB	150 GB

Smaller/Efficient Models

Model	Parameters	Q4_K_M	Q5_K_M	Q8_0	FP16
Phi-3.5 Mini	3.8B	2.5 GB	3 GB	4.5 GB	8 GB
Phi-3 Medium	14B	9 GB	11 GB	16 GB	30 GB
Gemma 2 9B	9B	6 GB	7 GB	10 GB	19 GB
Gemma 2 27B	27B	16 GB	19 GB	29 GB	56 GB

Reading the table: Q4_K_M and Q5_K_M are the most common quantization formats for daily use—good balance of quality and size. Q8_0 offers near-original quality. FP16 is the unquantized format.

For most users, Q4_K_M or Q5_K_M is the sweet spot. You’ll barely notice quality differences from FP16 for typical use cases.

Context Length: The Hidden VRAM Cost

Here’s what catches many people off guard: VRAM requirements grow with context length.

When an LLM processes text, it maintains a “KV cache”—a record of all previous tokens it needs to reference. This cache consumes VRAM, and it scales linearly with context length.

The additional VRAM math:

KV Cache VRAM ≈ 2 × Layers × Heads × (Head Dimension) × (Context Length) × Precision Bytes

That’s complex, so here’s a practical table for a typical 7B model:

Context Length	Additional KV Cache VRAM (FP16)
2,048 tokens	~0.5 GB
4,096 tokens	~1 GB
8,192 tokens	~2 GB
16,384 tokens	~4 GB
32,768 tokens	~8 GB
65,536 tokens	~16 GB
131,072 tokens	~32 GB

What this means in practice:

That 8B model that “fits” in 5GB at Q4 with 4K context? Try to use its full 128K context window, and you suddenly need 37GB+. This is why you might load a model successfully but crash when you paste in a long document.

Practical guidelines:

For 8GB VRAM: Stick to 4K-8K context reliably
For 16GB VRAM: Use up to 16K context comfortably
For 24GB VRAM: 32K context is reasonable
For 32GB+ VRAM: Extended context windows become practical

How Quantization Changes Everything

Quantization compresses model weights by reducing their numerical precision. It’s the magic that makes local AI practical.

How it works (simplified):

Original models use 16-bit floating-point numbers (FP16)
Quantization converts these to 8-bit, 4-bit, or even lower
Each step roughly halves the VRAM requirement
Quality degrades slightly, but often imperceptibly

Common quantization formats:

Format	Quality Impact	Use Case
Q8_0	Minimal (~1% degradation)	When quality is paramount
Q6_K	Very slight	High-quality default
Q5_K_M	Slight	Balanced choice
Q4_K_M	Moderate	Best value for most users
Q4_0	Noticeable	When VRAM is very tight
Q3_K	Significant	Last resort
Q2_K	Heavy	Don’t use for serious work

My recommendations:

Primary workhorse: Q4_K_M or Q5_K_M. Best balance of quality and VRAM.
Quality-critical tasks: Q8_0 if you have the VRAM
VRAM-constrained: Q4_K_M is the floor for usable quality
Avoid: Q3 and below for anything you care about

The “_K_M” suffix indicates K-quants with medium precision—a good middle ground. The “_S” variants save a bit more VRAM with slightly lower quality.

GPU VRAM Options in 2026

Let me map VRAM requirements to actual GPU options. For the latest GPU specs, check NVIDIA’s GeForce page:

Consumer NVIDIA GPUs

GPU	VRAM	Best Model Size	Street Price (Jan 2026)
RTX 4060	8 GB	7B Q4	~$300
RTX 4060 Ti 16GB	16 GB	13B Q4, 7B Q8	~$450
RTX 4070	12 GB	7B Q8, 13B Q4	~$550
RTX 4070 Ti Super	16 GB	13B Q5	~$800
RTX 4080 Super	16 GB	13B Q5	~$1,000
RTX 4090	24 GB	34B Q4, 13B FP16	~$1,600
RTX 5090	32 GB	70B Q4	~$2,000
RTX 3090 (used)	24 GB	34B Q4, 13B FP16	~$650

AMD GPUs

GPU	VRAM	Best Model Size	Street Price
RX 7600	8 GB	7B Q4	~$250
RX 7800 XT	16 GB	13B Q4	~$450
RX 7900 XTX	24 GB	34B Q4	~$900

Apple Silicon

Chip	Unified Memory Options	Notes
M3	8-24 GB	Shared with system
M3 Pro	18-36 GB	Shared with system
M3 Max	36-128 GB	Best for large models
M4 Pro	24-48 GB	Good mid-range
M4 Max	36-128 GB	Comparable to RTX 5090

Apple Silicon uses unified memory shared between CPU, GPU, and system. It’s not directly comparable to dedicated VRAM, but for AI inference, a 64GB M3 Max can run models that would require a $2,000+ NVIDIA GPU.

Checking Your Current VRAM Usage

Before you run out of memory, monitor your usage:

NVIDIA (nvidia-smi)

# Check current VRAM usage
nvidia-smi

# Continuous monitoring (updates every 1 second)
watch -n 1 nvidia-smi

# Just the memory info
nvidia-smi --query-gpu=memory.used,memory.total --format=csv

Reading the output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15    Driver Version: 550.54.15    CUDA Version: 12.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Memory-Usage |      GPU-Util        |
|===============================+======================+======================|
|   0  NVIDIA GeForce RTX 4090  |  18432MiB / 24564MiB |     45%           |
+-------------------------------+----------------------+----------------------+

This shows 18.4GB used of 24.5GB total. You have ~6GB headroom.

During Ollama Inference

Ollama shows memory usage when you load a model:

>>> /show info
Model:    llama3:8b
Parameters: 8B
Quantization: Q4_K_M
Context Length: 8192
Memory Used: 4.92 GB

Using nvitop (Better Monitoring)

pip install nvitop
nvitop

This gives a beautiful, real-time dashboard of GPU usage.

What To Do When You Run Out of VRAM

Already hit an out-of-memory error? Here’s your troubleshooting checklist:

1. Use More Aggressive Quantization

If you’re running Q8, try Q5_K_M or Q4_K_M. The quality difference is usually acceptable.

# In Ollama, pull a smaller quantization
ollama pull llama3:8b-q4_K_M

2. Reduce Context Length

Most tools let you limit context length. Shorter context = less KV cache = less VRAM.

# Ollama example
ollama run llama3 --num-ctx 4096

3. Choose a Smaller Model

An 8B model outperforms a 70B model that doesn’t run. Sometimes smaller is better.

Quality rankings within similar sizes:

Llama 3.1 8B > Mistral 7B > older models
For code: DeepSeek Coder performs above its size class

4. Close Other GPU Applications

Check what else is using your GPU:

nvidia-smi

Browsers, video players, and even some desktop effects use VRAM. Close them before loading large models.

5. Enable CPU Offloading (Last Resort)

Some tools let you offload layers to CPU. It’s slow but works.

# llama.cpp example: offload 20 layers to GPU, rest to CPU
./main -m model.gguf -ngl 20

Expect 5-20x slower inference for offloaded layers.

6. Consider the Upgrade

If you’re constantly fighting VRAM limits, a hardware upgrade may be the real solution. See our GPU buying guide for recommendations.

Frequently Asked Questions

Can I run AI with only 4GB VRAM?

Barely. You can run very small models (1-3B parameters) or heavily quantized 7B models with short context. It’s usable for experimentation but frustrating for real work. For serious local AI, 8GB is the minimum; 16GB is comfortable.

Does faster VRAM (GDDR6X vs GDDR6) matter?

For inference, memory bandwidth matters more than capacity once you have enough VRAM. Higher-speed memory (GDDR6X, GDDR7) improves token generation speed. But if you don’t have enough VRAM to load the model, speed is irrelevant. Capacity first, bandwidth second.

Can I combine CPU and GPU memory?

Technically yes, but with severe performance penalties. When a model exceeds VRAM, frameworks like llama.cpp can offload layers to CPU. Expect each offloaded layer to slow things down significantly. It’s a workaround, not a solution.

How does Apple Silicon unified memory compare?

Apple’s unified memory is shared between CPU and GPU, making direct comparisons tricky. A 64GB M3 Max can run models requiring ~40GB of dedicated VRAM on NVIDIA, but memory bandwidth is lower, so generation speed is typically 30-50% slower. The advantage is that you can actually access that much memory on a laptop.

What about Intel Arc GPUs?

Intel Arc GPUs (like the A770 with 16GB) are budget-friendly and increasingly supported. Performance is lower than NVIDIA equivalents, and software compatibility still maturing. They’re viable for experimentation but not my first recommendation for serious work.

Get Enough VRAM, Then Everything Else

Local AI becomes dramatically easier once you have enough VRAM. Models load instantly. No more cryptic crashes. No more mental math about what will fit.

Here’s my summary recommendation:

Your Goal	Minimum VRAM	Recommended VRAM
Experiment with AI	8 GB	12 GB
Regular development	12 GB	16 GB
Run quality 13B models	16 GB	24 GB
Run 34B+ models	24 GB	32 GB+

Calculate your needs using the formulas above. Check the model table. Then buy the GPU that fits—or find ways to make your current GPU work with quantization and context limits.

For GPU buying advice, check our complete guide to the best GPUs for AI. For setting up local AI once you have the hardware, see our Ollama tutorial.

Your VRAM is your runway. Make sure it’s long enough for takeoff.