Local LLM Setup Guide 2026: Run AI on Your Machine
Complete guide to running LLMs locally. Hardware requirements, model selection, setup tools, and real benchmarks from my RTX 4090 and MacBook M3.
Why Run Locally?
Cloud LLMs are great, but local models have unique advantages:
- Privacy: Your data never leaves your machine
- Cost: No API fees after hardware investment
- Offline: Works without internet
- Customization: Fine-tune for your use case
- Latency: No network round-trip (often faster for small models)
Trade-off? Model quality. The best local models are ~75-80% of GPT-4.1 quality. But for many tasks, that's good enough.
Hardware: What You Need
Minimum (Usable)
| Component | Minimum | Recommended |
|---|---|---|
| GPU VRAM | 8GB | 12GB+ |
| System RAM | 16GB | 32GB |
| Storage | 50GB SSD | 100GB+ NVMe |
| CPU | Modern 4-core | 8-core+ |
GPU Selection by Model Size
| Model Size | Min VRAM | Recommended GPU | Affordable Options |
|---|---|---|---|
| 7B params | 6GB | 8GB | RTX 3060, RTX 4060 |
| 14B params | 10GB | 12GB | RTX 3080, RTX 4070 |
| 32B params | 20GB | 24GB | RTX 3090, RTX 4090 |
| 70B params | 40GB | 48GB (2x 24GB or 2x3090/4090) | $1500-3000 |
Mac Silicon (M-series)
Macs use unified memory — GPU and CPU share RAM:
- M1/M2/M3 8GB: 7B models only
- M1/M2/M3 16GB: 7-14B models
- M1/M2/M3 32GB: 14-32B models
- M1/M2/M3 64GB+: Up to 70B models
Warning: MacBooks throttle under sustained load. Desktop Mac Studio is better for heavy use.
Setup Tools Comparison
| Tool | Platform | Ease | Best For |
|---|---|---|---|
| Ollama | Mac, Linux, Windows (WSL2) | ⭐⭐⭐⭐⭐ | Quick start, CLI workflows |
| LM Studio | Mac, Windows, Linux | ⭐⭐⭐⭐⭐ | GUI, model experimentation |
| llama.cpp | All platforms | ⭐⭐⭐ | Maximum control, customization |
| text-generation-webui | All platforms | ⭐⭐⭐ | Chat interface, extensions |
| vLLM | Linux (CUDA) | ⭐⭐ | Production serving, high throughput |
Option 1: Ollama (Recommended for Most)
Easiest way to get started. One-line install.
Installation
# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: Download from https://ollama.com/download
# Verify
ollama --version
Running Your First Model
# Download and run Llama 4 Scout (17B, great balance)
ollama run llama4:scout
# Other good options:
ollama run qwen3:8b # Faster, lighter
ollama run deepseek-r1:7b # Reasoning-focused
ollama run mistral:7b # Classic, stable
Performance Benchmarks (My Tests)
| Model | Hardware | Speed (tokens/s) | First Token | Quality |
|---|---|---|---|---|
| Llama 4 Scout 17B | RTX 4090 | 85 | 0.3s | ⭐⭐⭐⭐ |
| Llama 4 Scout 17B | M3 Max 36GB | 42 | 0.5s | ⭐⭐⭐⭐ |
| Qwen3 8B | RTX 4090 | 120 | 0.2s | ⭐⭐⭐ |
| Qwen3 8B | M3 MacBook 16GB | 38 | 0.4s | ⭐⭐⭐ |
| DeepSeek-R1 7B | RTX 4090 | 140 | 0.15s | ⭐⭐⭐+ |
| DeepSeek-R1 7B | RTX 3060 12GB | 45 | 0.3s | ⭐⭐⭐+ |
Using via API
# Ollama runs a local API server (default: localhost:11434)
# Python with requests
import requests
response = requests.post('http://localhost:11434/api/generate', json={
"model": "llama4:scout",
"prompt": "Explain RAG in one paragraph",
"stream": False
})
print(response.json()['response'])
# Or use official SDK
import ollama
response = ollama.chat(model='llama4:scout', messages=[
{'role': 'user', 'content': 'Explain RAG in one paragraph'}
])
print(response['message']['content'])
Option 2: LM Studio (GUI Experience)
Best if you prefer a GUI over command line.
Setup
- Download from lmstudio.ai
- Open the app
- Browse models in the "Discover" tab
- Download a model (I recommend Qwen3 8B or Llama 4 Scout)
- Chat in the UI or start a local server
Start Local Server
# In LM Studio:
# 1. Go to "Local Server" tab
# 2. Select your model
# 3. Click "Start Server"
# 4. Server runs on localhost:1234
# Test with curl:
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local-model",
"messages": [{"role": "user", "content": "Hello!"}]
}'
LM Studio Benefits
- Easy model management (download, delete, organize)
- Built-in chat interface
- OpenAI-compatible API server
- GPU selection (multi-GPU setups)
- Quantization options (Q4, Q5, Q8)
Model Recommendations
By Use Case
| Use Case | Recommended Model | Why |
|---|---|---|
| General chat/writing | Llama 4 Scout 17B | Best quality/size balance |
| Coding | Qwen3 14B Coder | Strong at code generation |
| Reasoning | DeepSeek-R1 Distill | Built-in CoT, strong logic |
| Fast responses | Qwen3 8B | Speed + decent quality |
| Limited VRAM | Phi-4 Mini 4B | 4GB VRAM, usable quality |
| Maximum quality | Llama 4 Maverick 400B (quantized) | Need 48GB+ VRAM |
Quantization Explained
Quantization reduces model size by lowering precision:
| Quantization | Size Reduction | Quality Loss | Speed Gain |
|---|---|---|---|
| FP16 (full) | None | Baseline | 1x |
| Q8 (8-bit) | ~50% | ~2-3% | ~1.3x |
| Q6 (6-bit) | ~60% | ~3-5% | ~1.5x |
| Q4 (4-bit) | ~75% | ~5-8% | ~2x |
| Q4_K_M (recommended) | ~70% | ~5% | ~1.8x |
Recommendation: Q4_K_M or Q5_K_M for most use cases. Good size/quality trade-off.
Advanced: llama.cpp Direct
For maximum control and customization.
# Build from source
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release
# Download a model (GGUF format)
# From HuggingFace: https://huggingface.co/models?search=gguf
# Run inference
./build/bin/llama-cli \
-m models/llama-4-scout-q4_k_m.gguf \
-p "Explain RAG in one paragraph" \
-n 256 \
--temp 0.7
Production Considerations
Memory Management
# Check GPU memory usage (NVIDIA)
nvidia-smi
# Ollama: Set GPU layers
ollama run llama4:scout --gpu-layers 35
# Fewer layers = less VRAM, slower
# More layers = more VRAM, faster
Concurrent Requests
Ollama queues requests by default. For concurrent handling:
# Set OLLAMA_NUM_PARALLEL
export OLLAMA_NUM_PARALLEL=4 # Up to 4 concurrent requests
# Or in systemd service:
# /etc/systemd/system/ollama.service
[Service]
Environment="OLLAMA_NUM_PARALLEL=4"
Persistent Server
# Linux: Run as systemd service
sudo systemctl enable ollama
sudo systemctl start ollama
# macOS: Run at login
# Ollama auto-starts after first run
# Windows: Runs as startup app automatically
Troubleshooting
"Out of memory" errors
- Use smaller quantization (Q4 instead of Q8)
- Reduce GPU layers:
ollama run model --gpu-layers 20 - Try a smaller model (7B instead of 14B)
- Check for other GPU processes (browser, video editor)
Slow inference
- If using CPU, add
--gpu-layersto offload more to GPU - Close other GPU-heavy applications
- Try smaller quantization
- Check thermal throttling:
nvidia-smi -q -d TEMPERATURE
Download failures
- Check internet connection
- Try different mirror: set
OLLAMA_ORIGINS=* - Download manually from HuggingFace and import
Cost Analysis
When does local make sense vs cloud?
| Usage | Cloud Cost (Monthly) | Local Break-even |
|---|---|---|
| 100K tokens/day | $30-50 | 6-12 months (RTX 4070) |
| 500K tokens/day | $150-250 | 2-4 months (RTX 4070) |
| 1M tokens/day | $300-500 | 1-2 months (RTX 4090) |
Local makes sense if:
- You process 100K+ tokens/day regularly
- Privacy is non-negotiable
- You need offline capability
- You want to fine-tune models
Cloud is better if:
- Usage is sporadic or low volume
- You need the absolute best model quality
- You don't want to manage hardware
Key Takeaways
- Start with Ollama: One-line install, works everywhere
- 8GB VRAM minimum: For usable 7B models; 12GB+ for 14B
- Llama 4 Scout 17B: Best quality/size balance for most users
- Q4_K_M quantization: Good compromise for size/quality
- Mac M-series works: Unified memory means RAM = VRAM
- Local pays off if you push 100K+ tokens/day or need privacy
Running LLMs locally isn't just for researchers anymore. With tools like Ollama and models like Llama 4 Scout, anyone with a decent GPU or Mac can have a capable AI assistant running entirely on their machine.