Running AI Models Locally: A Complete Guide
TL;DR
Run production-quality AI models locally using Ollama for simple deployment or llama.cpp for maximum control — quantized models make this feasible on consumer hardware.
Running large language models locally eliminates API costs, removes data privacy concerns, and gives you full control over inference. With quantized models and tools like Ollama, this is now practical on consumer hardware.
Why Run Models Locally
Cloud APIs charge per token and require sending your data to third-party servers. Local inference solves both problems. You pay nothing per request after the initial hardware investment, and your data never leaves your machine.
Local inference is not about replacing cloud APIs — it is about having the right tool for workloads where privacy, cost, or latency constraints make cloud inference impractical.
The tradeoff is capability. Local models are smaller than frontier cloud models. But for many tasks — code generation, summarization, classification, extraction — a well-chosen local model performs more than adequately.
Hardware Requirements
The primary constraint for local inference is RAM, not CPU or GPU. Quantized models load entirely into memory, so your available RAM determines the largest model you can run.
| Hardware | RAM | Max Model Size | Example Models |
|---|---|---|---|
| Laptop (M-series Mac) | 16 GB | 7B–13B (Q4) | Llama 3.1 8B, Mistral 7B |
| Workstation | 32 GB | 13B–34B (Q4) | CodeLlama 34B, Mixtral 8x7B |
| Server (GPU) | 48 GB VRAM | 70B (Q4) | Llama 3.1 70B, Qwen 72B |
| Multi-GPU | 96+ GB VRAM | 70B+ (FP16) | Full-precision large models |
Apple Silicon Macs are particularly effective for local inference because their unified memory architecture allows the GPU to access system RAM directly, avoiding the VRAM bottleneck that limits NVIDIA consumer GPUs.
Understanding Quantization
Quantization reduces model weights from 16-bit floating point to lower precision formats. A 7B parameter model at FP16 requires ~14 GB of RAM. At Q4 (4-bit quantization), the same model needs ~4 GB — a 3.5x reduction with minimal quality loss.
The GGUF format, developed by the llama.cpp project, is the standard for quantized models. Common quantization levels include:
- Q8_0 — 8-bit, near-original quality, ~50% size reduction
- Q5_K_M — 5-bit, excellent quality, ~65% size reduction
- Q4_K_M — 4-bit, good quality, ~75% size reduction
- Q2_K — 2-bit, noticeable degradation, ~85% size reduction
For most use cases, Q4_K_M offers the best balance of quality and efficiency.
Getting Started with Ollama
Ollama wraps llama.cpp in a simple interface with model management, an HTTP API, and one-command setup.
Installation and First Model
# Install Ollama (macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run a model
ollama pull llama3.1:8b
ollama run llama3.1:8b
# List downloaded models
ollama list
# Run with specific quantization
ollama pull llama3.1:8b-q4_K_M
API Integration
Ollama exposes an OpenAI-compatible API on localhost:11434. You can integrate it directly into your applications:
interface ChatMessage {
role: "system" | "user" | "assistant";
content: string;
}
interface OllamaResponse {
message: ChatMessage;
done: boolean;
total_duration: number;
}
async function chat(messages: ChatMessage[]): Promise<string> {
const response = await fetch("http://localhost:11434/api/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "llama3.1:8b",
messages,
stream: false,
}),
});
const data: OllamaResponse = await response.json();
return data.message.content;
}
// Usage
const result = await chat([
{ role: "system", content: "You are a code reviewer." },
{ role: "user", content: "Review this function for bugs..." },
]);
This API is compatible with the OpenAI SDK, so existing codebases that use openai can switch to local inference by changing the base URL.
Using llama.cpp Directly
For maximum control over inference parameters, memory mapping, and batch processing, use llama.cpp directly.
# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j
# Run inference
./llama-cli \
-m models/llama-3.1-8b-q4_K_M.gguf \
--ctx-size 4096 \
--threads 8 \
--n-gpu-layers 35 \
-p "Explain the CAP theorem in distributed systems:"
llama.cpp supports GPU offloading via
--n-gpu-layers. On Apple Silicon, set this to a high value to offload most layers to the Metal GPU. On NVIDIA systems, CUDA acceleration provides significant speedups.
Server Mode
llama.cpp includes a built-in server that provides an OpenAI-compatible endpoint:
./llama-server \
-m models/llama-3.1-8b-q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
--ctx-size 4096 \
--n-gpu-layers 35
This gives you fine-grained control over context size, thread count, batch size, and GPU offloading that Ollama abstracts away.
Choosing the Right Model
Model selection depends on your task and hardware. General guidelines:
- Code generation — CodeLlama 13B or DeepSeek Coder 6.7B
- General chat — Llama 3.1 8B or Mistral 7B
- Instruction following — Llama 3.1 8B Instruct
- Multilingual — Qwen 2.5 7B
Start with the largest model your hardware can run at Q4_K_M quantization. If inference speed is too slow, drop to a smaller model rather than a lower quantization level — model architecture matters more than precision for output quality.
FAQ
Can I run LLMs on my laptop?
Yes. Quantized models in GGUF format run efficiently on consumer hardware. A MacBook with 16 GB of RAM can run 7B–13B parameter models at usable speeds, typically generating 10–30 tokens per second depending on the model and quantization level. Even older Intel laptops with 16 GB RAM can run 7B models, though at slower speeds.
What is model quantization?
Quantization reduces model precision from 16-bit floating point to 4-bit or 8-bit integers, dramatically reducing memory requirements while maintaining most of the model’s capability. The process maps continuous weight values to a smaller set of discrete values. Modern quantization techniques like GPTQ and AWQ use calibration data to minimize quality loss, making 4-bit models nearly indistinguishable from full-precision versions on many benchmarks.
Is Ollama free to use?
Yes. Ollama is open-source and free. It provides a simple CLI and API for running open-source models locally without complex setup. There are no usage fees, token limits, or telemetry. You download models once and run them entirely on your own hardware with no internet connection required after the initial download.
Comments