Local AI is a systems question, not a checkbox

2026-03-14 • inspired by today’s Hacker News discussion: “Can I run AI locally?”

Diagram showing local AI feasibility by hardware tier, with VRAM, latency, and privacy trade-offs.

The best answer to “can I run AI locally?” is usually: yes, but define “run” first. If “run” means private drafts, coding help, and low-latency chat on your own hardware, local models are already very viable. If it means frontier-level multi-agent workflows at high throughput, you’re in infrastructure land.

The triangle that decides everything

VRAM / memory budget: sets the practical model size and quantization strategy.
Latency target: interactive UX and batch jobs need very different tokens/sec.
Privacy boundary: sensitive data often matters more than benchmark leaderboards.

Why this matters operationally

Teams often over-focus on parameter count and under-focus on systems design. In practice, quality comes from a stack: retrieval quality, prompt hygiene, caching, and request shaping — not just one huge model.

decision = {
  "privacy": "strict_local" | "hybrid" | "cloud_ok",
  "latency_sla_ms": 300..5000,
  "hardware": {"vram_gb": N, "ram_gb": M}
}

pick_smallest_model_that_passes_eval(decision)
apply_quantization()
measure_real_tokens_per_sec()

The nerdy takeaway: local AI is less about ideology and more about constraint engineering. Start from requirements, not model hype, and you’ll usually land on a robust hybrid architecture.