Tutorial May 4, 2026 · 16 min read

Local LLM Setup Guide 2026: Run AI on Your Machine

Complete guide to running LLMs locally. Hardware requirements, model selection, setup tools, and real benchmarks from my RTX 4090 and MacBook M3.

Why Run Locally?

Cloud LLMs are great, but local models have unique advantages:

  • Privacy: Your data never leaves your machine
  • Cost: No API fees after hardware investment
  • Offline: Works without internet
  • Customization: Fine-tune for your use case
  • Latency: No network round-trip (often faster for small models)

Trade-off? Model quality. The best local models are ~75-80% of GPT-4.1 quality. But for many tasks, that's good enough.

Hardware: What You Need

Minimum (Usable)

ComponentMinimumRecommended
GPU VRAM8GB12GB+
System RAM16GB32GB
Storage50GB SSD100GB+ NVMe
CPUModern 4-core8-core+

GPU Selection by Model Size

Model SizeMin VRAMRecommended GPUAffordable Options
7B params6GB8GBRTX 3060, RTX 4060
14B params10GB12GBRTX 3080, RTX 4070
32B params20GB24GBRTX 3090, RTX 4090
70B params40GB48GB (2x 24GB or 2x3090/4090)$1500-3000

Mac Silicon (M-series)

Macs use unified memory — GPU and CPU share RAM:

  • M1/M2/M3 8GB: 7B models only
  • M1/M2/M3 16GB: 7-14B models
  • M1/M2/M3 32GB: 14-32B models
  • M1/M2/M3 64GB+: Up to 70B models

Warning: MacBooks throttle under sustained load. Desktop Mac Studio is better for heavy use.

Setup Tools Comparison

ToolPlatformEaseBest For
OllamaMac, Linux, Windows (WSL2)⭐⭐⭐⭐⭐Quick start, CLI workflows
LM StudioMac, Windows, Linux⭐⭐⭐⭐⭐GUI, model experimentation
llama.cppAll platforms⭐⭐⭐Maximum control, customization
text-generation-webuiAll platforms⭐⭐⭐Chat interface, extensions
vLLMLinux (CUDA)⭐⭐Production serving, high throughput

Option 1: Ollama (Recommended for Most)

Easiest way to get started. One-line install.

Installation

# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: Download from https://ollama.com/download

# Verify
ollama --version

Running Your First Model

# Download and run Llama 4 Scout (17B, great balance)
ollama run llama4:scout

# Other good options:
ollama run qwen3:8b        # Faster, lighter
ollama run deepseek-r1:7b  # Reasoning-focused
ollama run mistral:7b      # Classic, stable

Performance Benchmarks (My Tests)

ModelHardwareSpeed (tokens/s)First TokenQuality
Llama 4 Scout 17BRTX 4090850.3s⭐⭐⭐⭐
Llama 4 Scout 17BM3 Max 36GB420.5s⭐⭐⭐⭐
Qwen3 8BRTX 40901200.2s⭐⭐⭐
Qwen3 8BM3 MacBook 16GB380.4s⭐⭐⭐
DeepSeek-R1 7BRTX 40901400.15s⭐⭐⭐+
DeepSeek-R1 7BRTX 3060 12GB450.3s⭐⭐⭐+

Using via API

# Ollama runs a local API server (default: localhost:11434)

# Python with requests
import requests

response = requests.post('http://localhost:11434/api/generate', json={
    "model": "llama4:scout",
    "prompt": "Explain RAG in one paragraph",
    "stream": False
})

print(response.json()['response'])

# Or use official SDK
import ollama
response = ollama.chat(model='llama4:scout', messages=[
    {'role': 'user', 'content': 'Explain RAG in one paragraph'}
])
print(response['message']['content'])

Option 2: LM Studio (GUI Experience)

Best if you prefer a GUI over command line.

Setup

  1. Download from lmstudio.ai
  2. Open the app
  3. Browse models in the "Discover" tab
  4. Download a model (I recommend Qwen3 8B or Llama 4 Scout)
  5. Chat in the UI or start a local server

Start Local Server

# In LM Studio:
# 1. Go to "Local Server" tab
# 2. Select your model
# 3. Click "Start Server"
# 4. Server runs on localhost:1234

# Test with curl:
curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local-model",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

LM Studio Benefits

  • Easy model management (download, delete, organize)
  • Built-in chat interface
  • OpenAI-compatible API server
  • GPU selection (multi-GPU setups)
  • Quantization options (Q4, Q5, Q8)

Model Recommendations

By Use Case

Use CaseRecommended ModelWhy
General chat/writingLlama 4 Scout 17BBest quality/size balance
CodingQwen3 14B CoderStrong at code generation
ReasoningDeepSeek-R1 DistillBuilt-in CoT, strong logic
Fast responsesQwen3 8BSpeed + decent quality
Limited VRAMPhi-4 Mini 4B4GB VRAM, usable quality
Maximum qualityLlama 4 Maverick 400B (quantized)Need 48GB+ VRAM

Quantization Explained

Quantization reduces model size by lowering precision:

QuantizationSize ReductionQuality LossSpeed Gain
FP16 (full)NoneBaseline1x
Q8 (8-bit)~50%~2-3%~1.3x
Q6 (6-bit)~60%~3-5%~1.5x
Q4 (4-bit)~75%~5-8%~2x
Q4_K_M (recommended)~70%~5%~1.8x

Recommendation: Q4_K_M or Q5_K_M for most use cases. Good size/quality trade-off.

Advanced: llama.cpp Direct

For maximum control and customization.

# Build from source
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release

# Download a model (GGUF format)
# From HuggingFace: https://huggingface.co/models?search=gguf

# Run inference
./build/bin/llama-cli \
  -m models/llama-4-scout-q4_k_m.gguf \
  -p "Explain RAG in one paragraph" \
  -n 256 \
  --temp 0.7

Production Considerations

Memory Management

# Check GPU memory usage (NVIDIA)
nvidia-smi

# Ollama: Set GPU layers
ollama run llama4:scout --gpu-layers 35

# Fewer layers = less VRAM, slower
# More layers = more VRAM, faster

Concurrent Requests

Ollama queues requests by default. For concurrent handling:

# Set OLLAMA_NUM_PARALLEL
export OLLAMA_NUM_PARALLEL=4  # Up to 4 concurrent requests

# Or in systemd service:
# /etc/systemd/system/ollama.service
[Service]
Environment="OLLAMA_NUM_PARALLEL=4"

Persistent Server

# Linux: Run as systemd service
sudo systemctl enable ollama
sudo systemctl start ollama

# macOS: Run at login
# Ollama auto-starts after first run

# Windows: Runs as startup app automatically

Troubleshooting

"Out of memory" errors

  1. Use smaller quantization (Q4 instead of Q8)
  2. Reduce GPU layers: ollama run model --gpu-layers 20
  3. Try a smaller model (7B instead of 14B)
  4. Check for other GPU processes (browser, video editor)

Slow inference

  1. If using CPU, add --gpu-layers to offload more to GPU
  2. Close other GPU-heavy applications
  3. Try smaller quantization
  4. Check thermal throttling: nvidia-smi -q -d TEMPERATURE

Download failures

  1. Check internet connection
  2. Try different mirror: set OLLAMA_ORIGINS=*
  3. Download manually from HuggingFace and import

Cost Analysis

When does local make sense vs cloud?

UsageCloud Cost (Monthly)Local Break-even
100K tokens/day$30-506-12 months (RTX 4070)
500K tokens/day$150-2502-4 months (RTX 4070)
1M tokens/day$300-5001-2 months (RTX 4090)

Local makes sense if:

  • You process 100K+ tokens/day regularly
  • Privacy is non-negotiable
  • You need offline capability
  • You want to fine-tune models

Cloud is better if:

  • Usage is sporadic or low volume
  • You need the absolute best model quality
  • You don't want to manage hardware

Key Takeaways

  1. Start with Ollama: One-line install, works everywhere
  2. 8GB VRAM minimum: For usable 7B models; 12GB+ for 14B
  3. Llama 4 Scout 17B: Best quality/size balance for most users
  4. Q4_K_M quantization: Good compromise for size/quality
  5. Mac M-series works: Unified memory means RAM = VRAM
  6. Local pays off if you push 100K+ tokens/day or need privacy

Running LLMs locally isn't just for researchers anymore. With tools like Ollama and models like Llama 4 Scout, anyone with a decent GPU or Mac can have a capable AI assistant running entirely on their machine.