Build Offline AI: Run AI Without Internet (2026)
Complete guide to setting up AI that works without internet. Run local LLMs, speech recognition, and more in air-gapped environments.
Last summer I spent two weeks in a cabin with terrible satellite internet. Couldn’t load a webpage to save my life. But I had a full AI assistant running on my laptop—asking questions, debugging code, helping with writing—all completely offline.
That trip changed how I think about AI. Cloud services are convenient, but they’re also dependencies. When the internet goes down (and it WILL go down when you need it most), your AI assistant disappears too.
Building a fully offline AI setup isn’t just for survivalist scenarios. It’s for airplanes, remote locations, secure environments, and anyone who values AI that works reliably and privately.
Let me show you exactly how to build an AI system that works without a single internet connection.
Why Run AI Completely Offline
The mainstream AI experience is cloud-dependent. ChatGPT, Claude, Gemini—all require constant internet. Every query goes to distant servers. That’s fine until it isn’t.
Privacy and Security
When you use cloud AI:
- Your prompts travel over the internet
- They’re processed on servers you don’t control
- They may be logged, reviewed, or used for training
- Even “privacy-focused” services require trusting their policies
With offline AI:
- Nothing leaves your machine. Ever.
- No network traffic to intercept
- No server logs to worry about
- Complete, verifiable privacy
For sensitive work—legal documents, medical information, proprietary code, confidential business data—offline AI means genuine security. It’s the only way to be certain your data stays yours.
Reliability
Cloud services fail. I’ve been mid-conversation with Claude and had the service go down. Lost the context, couldn’t continue.
More commonly:
- Slow/unreliable internet degrades experience
- API rate limits can block you when you need help most
- Service outages happen (remember OpenAI’s big outage last year?)
- Geographic restrictions may block access while traveling
Offline AI doesn’t care about any of this. It runs on your hardware, using your power. Network problems become irrelevant.
Remote Locations
Some places simply don’t have internet:
- Airplanes (even with WiFi, it’s often unusable)
- Remote work sites
- Developing regions
- Wilderness and rural areas
- Underwater installations, ships, remote research stations
If your work takes you off-grid, offline AI lets you maintain AI-enhanced productivity anywhere.
Enterprise and Government Requirements
Many organizations prohibit cloud AI entirely:
- Classified environments (air-gapped networks)
- Healthcare (HIPAA compliance concerns)
- Finance (data residency requirements)
- Defense contractors
- Critical infrastructure
For these users, “just use ChatGPT” isn’t an option. Offline AI is the only AI.
Hardware Requirements
Let’s be realistic about what you need to run AI locally without compromises.
Minimum Specs
CPU-only setup (usable but slow):
- Any modern 64-bit CPU (Intel/AMD, last 5 years)
- 16 GB RAM minimum
- 50 GB free storage
- SSD strongly recommended
This runs 7B parameter models at ~5-10 tokens/second. Usable for simple tasks, a bit slow for interactive work.
Entry-level GPU setup:
- GPU with 8+ GB VRAM (RTX 3060, RTX 4060)
- 16 GB RAM
- 100 GB free storage
This runs 7B-13B models at ~20-40 tokens/second. Comfortable for most uses.
Recommended Specs
Good experience:
- GPU with 12-16 GB VRAM (RTX 4070 Ti, RTX 3080)
- 32 GB RAM
- 200 GB SSD storage
Excellent experience:
- GPU with 24+ GB VRAM (RTX 4090)
- 64 GB RAM
- 500 GB+ NVMe storage
With 24 GB VRAM, you can run 34B models or even 70B with quantization—genuinely impressive capability.
GPU vs CPU
GPU advantages:
- 5-10x faster than CPU
- Essential for larger models
- Better power efficiency per token
CPU advantages:
- Works on more hardware (including laptops without discrete GPUs)
- No driver issues
- Still functional with enough RAM
My recommendation: Use a GPU if you have one. But don’t let lack of GPU stop you—CPU inference works, just slower.
For more on hardware choices, see our GPU guide for AI.
Storage Needs
Models take significant space:
| Model Size | Quantized Size | Full Precision |
|---|---|---|
| 7B | 4 GB (Q4) | 14 GB |
| 13B | 7 GB (Q4) | 26 GB |
| 34B | 20 GB (Q4) | 68 GB |
| 70B | 40 GB (Q4) | 140 GB |
Keep multiple models for different tasks? Budget accordingly. I keep about 150 GB of models on my main machine.
Choosing Your Offline Software Stack
Several tools can run AI locally. Here’s how they compare.
Ollama (Recommended)
Ollama is my top recommendation for most users:
Pros:
- Simplest setup (one command to install, one to run)
- Large model library
- Cross-platform (Mac, Windows, Linux)
- Active community
- OpenAI-compatible API
Cons:
- Less customization than raw llama.cpp
- Slightly more overhead than bare metal
Best for: Most users, especially beginners.
We have a detailed Ollama guide covering installation and usage.
LM Studio
LM Studio is a desktop application with a graphical interface:
Pros:
- Beautiful GUI—no command line needed
- Easy model downloading and management
- Built-in chat interface
- Cross-platform
Cons:
- GUI overhead uses more resources
- Less scriptable
- Larger installation
Best for: Users who prefer graphical interfaces, quick testing of models.
Check out our LM Studio guide for more details.
llama.cpp
The underlying engine that powers many local AI tools:
Pros:
- Maximum performance
- Minimum overhead
- Most customization
- Updated frequently with new optimizations
Cons:
- Requires building from source (or finding binaries)
- Command-line only
- Steeper learning curve
Best for: Power users, embedded/resource-constrained environments, maximum performance seekers.
Comparison Table
| Feature | Ollama | LM Studio | llama.cpp |
|---|---|---|---|
| Ease of use | ★★★★★ | ★★★★★ | ★★★☆☆ |
| Performance | ★★★★☆ | ★★★★☆ | ★★★★★ |
| GUI | No (can add) | Yes | No |
| Customization | ★★★★☆ | ★★★☆☆ | ★★★★★ |
| Model variety | ★★★★★ | ★★★★☆ | ★★★★☆ |
For this guide, I’ll use Ollama—the best balance of simplicity and capability.
Best Models for Offline Use
Not all models are equal for offline deployment. Here are my recommendations.
Llama 4 Variants
Meta’s Llama 4 family is the gold standard for open-source AI:
Llama 4 8B:
- Best for: 8-12 GB VRAM or 16+ GB RAM
- Capability: Very good for most tasks
- Speed: Fast, even on modest hardware
ollama pull llama4
Llama 4 70B (quantized):
- Best for: 24+ GB VRAM or 64+ GB RAM
- Capability: Excellent, approaching GPT-4 level
- Speed: Slower but worth it for harder tasks
ollama pull llama4:70b-instruct-q4_K_M
For more on the Llama family, see our Meta Llama 4 guide.
Mistral Models
Mistral from the European lab offers excellent quality:
Mistral 7B:
- Punches above its weight
- Great instruction following
- Excellent for coding
ollama pull mistral
Mistral Large (via API or quantized):
- Top-tier capability
- Larger memory requirements
- Worth it for complex tasks
See our Mistral guide for details.
Phi-4 (Compact Option)
Microsoft’s Phi-4 is surprisingly capable for its size:
Best for: Limited hardware, embedded systems, ARM devices
Capability: Impressive for its size, though not matching larger models
ollama pull phi4
This runs well even on Raspberry Pi 5 or older laptops.
Quantization Explained
Quantization compresses model weights to use less memory:
| Format | Bits | Size Reduction | Quality Impact |
|---|---|---|---|
| Full (fp16) | 16 | Baseline | Perfect |
| Q8 | 8 | ~50% | Minimal |
| Q6 | 6 | ~60% | Very slight |
| Q4 | 4 | ~75% | Slight |
| Q2 | 2 | ~85% | Noticeable |
For most uses, Q4 offers the best balance. You fit larger models with minimal quality loss. Q2 is useful for truly constrained environments but quality suffers.
When running with Ollama, models are typically available in Q4 by default:
ollama pull llama4:70b-instruct-q4_K_M # Q4 quantization
Complete Offline Setup Walkthrough
Let’s build a fully offline AI system from scratch.
Pre-downloading Everything
The key to offline operation: download everything while you have internet.
Step 1: Download Ollama
While online, download the installer for your platform from ollama.ai.
Save the installer file—you’ll use it on your offline machine.
Step 2: Download Models
Pull all models you’ll need:
# Primary general-purpose model
ollama pull llama4
# Code-focused model
ollama pull deepseek-coder-v2
# Fast/light model for quick tasks
ollama pull phi4
# Vision model (if needed)
ollama pull llava
Each download stores the model locally. They’ll work offline forever after.
Step 3: Note Model Locations
Ollama stores models in:
- macOS:
~/.ollama/models - Linux:
~/.ollama/models - Windows:
%USERPROFILE%\.ollama\models
You can copy this entire folder to another machine for offline use.
Installing Ollama Offline
If your target machine has no internet:
- Copy the Ollama installer to a USB drive
- Copy the
.ollama/modelsfolder to the USB drive - On the offline machine:
- Install Ollama from the installer
- Copy the models folder to the appropriate location
That’s it. Run ollama list to verify models are available.
Transferring Models
For air-gapped environments, you need to physically transfer models.
Export from online machine:
# Locate model files
ls ~/.ollama/models/manifests/registry.ollama.ai/library/
# Tar up everything
tar -czvf ollama-models.tar.gz ~/.ollama/models/
Import on offline machine:
# Extract to correct location
tar -xzvf ollama-models.tar.gz -C ~/
# Verify
ollama list
This works across any transport medium: USB, external drive, sneakernet.
Verifying Offline Operation
Test that everything works without internet.
Step 1: Disable network
- Turn off WiFi
- Disconnect Ethernet
- Enable airplane mode
Step 2: Run a model
ollama run llama4 "What year was the Eiffel Tower built?"
If you get a response, congratulations—you have working offline AI.
Step 3: Test various models
# Test each downloaded model
ollama run phi4 "Explain quantum computing briefly"
ollama run deepseek-coder-v2 "Write a Python function to find prime numbers"
All should respond without network access.
Adding Offline GUI
Command line not your preference? Add a graphical interface.
Open WebUI (Offline Setup)
Open WebUI provides a ChatGPT-like interface for Ollama.
Pre-download (while online):
# Pull the Docker image
docker pull ghcr.io/open-webui/open-webui:main
# Save to file
docker save ghcr.io/open-webui/open-webui:main > openwebui.tar
Install offline:
# Load image from file
docker load < openwebui.tar
# Run
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:main
Open http://localhost:3000 in your browser. Full ChatGPT-style interface, completely offline.
Desktop Options
For no-Docker setups:
Chatbox - Cross-platform desktop app
- Download the installer while online
- Configure to connect to
http://localhost:11434(Ollama’s API) - Works offline after initial setup
Msty (macOS) - Native Mac app
- Download from App Store while online
- Connects to local Ollama automatically
- Clean, native interface
Terminal-Based Usage
For minimal setups without GUI overhead:
# Create a simple chat script
cat > chat.sh << 'EOF'
#!/bin/bash
while true; do
echo -n "You: "
read input
if [ "$input" = "quit" ]; then break; fi
ollama run llama4 "$input"
echo
done
EOF
chmod +x chat.sh
# Run
./chat.sh
Simple, effective, works on any terminal.
Offline Speech and Vision
For complete offline AI, you might want voice and vision too.
Local Whisper
OpenAI’s Whisper can run locally for speech-to-text:
Install (while online):
pip install openai-whisper
Download models (while online):
import whisper
# This downloads the model
model = whisper.load_model("base") # or "small", "medium", "large"
Models are cached in ~/.cache/whisper/.
Use offline:
import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])
Works completely offline once models are downloaded. See our audio guide for more context.
Local Image Models
For image understanding offline:
LLaVA via Ollama:
# Download while online
ollama pull llava
# Use offline
ollama run llava "Describe this image: [image path]"
LLaVA can analyze images, describe their contents, and answer questions about them.
Stable Diffusion for image generation:
Requires more setup but works fully offline. Download models while online, then generate images anywhere. See our Stable Diffusion tutorial.
Performance Optimization
Once your offline setup works, optimize it.
Memory Management
Monitor memory usage:
# Linux/Mac
watch -n 1 free -h
# Check Ollama specifically
ollama ps
Reduce memory usage:
- Use more quantized models (Q4 instead of Q8)
- Run one model at a time
- Close other applications during inference
Speed Improvements
Use GPU when available: Ollama automatically uses GPU. Verify with:
ollama run llama4 --verbose
# Look for GPU info in output
Match model to hardware:
- 8 GB VRAM → 7B models
- 12 GB VRAM → 13B models
- 24 GB VRAM → 34B or 70B models
Reduce context length for speed:
# Create Modelfile with shorter context
cat > Modelfile << EOF
FROM llama4
PARAMETER num_ctx 2048 # Default is often 4096+
EOF
ollama create llama4-fast -f Modelfile
Battery Considerations
For laptops:
- Smaller models use less power
- CPU inference is possible but drains faster
- Consider “low-power” models like Phi-4 for battery operation
Heavy AI work can drain a laptop battery in 1-2 hours. Plan accordingly.
Frequently Asked Questions
Can I run all these models on a MacBook?
Yes! Apple Silicon Macs (M1/M2/M3/M4) run local AI exceptionally well. The unified memory architecture means even a 16GB MacBook Air can run 7B models smoothly. See our Mac AI guide.
What’s the quality difference vs ChatGPT?
Honest answer: cloud models still have an edge. GPT-5 and Claude 4 are generally more capable than Llama 4 70B. But the gap has narrowed dramatically. For most practical tasks, local models are “good enough”—and they never fail due to internet issues.
How do I update models when I’m back online?
# Re-pull to get updates
ollama pull llama4
# Then transfer to offline machine as before
Check for updates periodically. Model improvements are released regularly.
Can I run this on a Raspberry Pi?
Pi 5 with 8GB RAM can run small models (Phi, Gemma 2B). It’s slow but works. Great for embedded projects or always-on assistants where speed isn’t critical.
What if I need the latest information?
Offline AI can’t access the internet, so it doesn’t know recent events. For most tasks (coding, writing, analysis), this doesn’t matter. For current events, you’ll need to provide context or wait until you’re back online.
Conclusion
Setting up offline AI takes some initial effort, but the payoff is substantial:
- True privacy: Nothing ever leaves your machine
- Perfect reliability: Works anywhere, anytime
- Zero ongoing cost: Run as much as you want
- Complete independence: No account needed, no service to trust
The capabilities of local models have reached the point where offline AI isn’t a compromise—it’s a legitimate choice that makes sense for many use cases.
Start simple:
- Install Ollama
- Pull a few models
- Test offline operation
- Add GUI if desired
Once set up, you’ll wonder why you ever depended on cloud AI for everything.
For more local AI resources, explore our guides on Ollama, open source LLMs, and LM Studio.
Your AI, your hardware, your terms.
Last updated: January 11, 2026