Running AI Models Locally: A Complete Guide

Q: Can I run LLMs on my laptop?

Yes. Quantized models (GGUF format) run efficiently on consumer hardware. A MacBook with 16GB RAM can run 7B-13B parameter models at usable speeds.

Q: What is model quantization?

Quantization reduces model precision from 16-bit to 4-bit or 8-bit, dramatically reducing memory requirements while maintaining most of the model's capability.

Running large language models locally eliminates API costs, removes data privacy concerns, and gives you full control over inference. With quantized models and tools like Ollama, this is now practical on consumer hardware.

Why Run Models Locally

Cloud APIs charge per token and require sending your data to third-party servers. Local inference solves both problems. You pay nothing per request after the initial hardware investment, and your data never leaves your machine.

Local inference is not about replacing cloud APIs — it is about having the right tool for workloads where privacy, cost, or latency constraints make cloud inference impractical.

The tradeoff is capability. Local models are smaller than frontier cloud models. But for many tasks — code generation, summarization, classification, extraction — a well-chosen local model performs more than adequately.

Hardware Requirements

The primary constraint for local inference is RAM, not CPU or GPU. Quantized models load entirely into memory, so your available RAM determines the largest model you can run.

Hardware	RAM	Max Model Size	Example Models
Laptop (M-series Mac)	16 GB	7B–13B (Q4)	Llama 3.1 8B, Mistral 7B
Workstation	32 GB	13B–34B (Q4)	CodeLlama 34B, Mixtral 8x7B
Server (GPU)	48 GB VRAM	70B (Q4)	Llama 3.1 70B, Qwen 72B
Multi-GPU	96+ GB VRAM	70B+ (FP16)	Full-precision large models

Apple Silicon Macs are particularly effective for local inference because their unified memory architecture allows the GPU to access system RAM directly, avoiding the VRAM bottleneck that limits NVIDIA consumer GPUs.

Understanding Quantization

Quantization reduces model weights from 16-bit floating point to lower precision formats. A 7B parameter model at FP16 requires ~14 GB of RAM. At Q4 (4-bit quantization), the same model needs ~4 GB — a 3.5x reduction with minimal quality loss.

The GGUF format, developed by the llama.cpp project, is the standard for quantized models. Common quantization levels include:

Q8_0 — 8-bit, near-original quality, ~50% size reduction
Q5_K_M — 5-bit, excellent quality, ~65% size reduction
Q4_K_M — 4-bit, good quality, ~75% size reduction
Q2_K — 2-bit, noticeable degradation, ~85% size reduction

For most use cases, Q4_K_M offers the best balance of quality and efficiency.

Getting Started with Ollama

Ollama wraps llama.cpp in a simple interface with model management, an HTTP API, and one-command setup.

Installation and First Model

# Install Ollama (macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model
ollama pull llama3.1:8b
ollama run llama3.1:8b

# List downloaded models
ollama list

# Run with specific quantization
ollama pull llama3.1:8b-q4_K_M

API Integration

Ollama exposes an OpenAI-compatible API on localhost:11434. You can integrate it directly into your applications:

interface ChatMessage {
  role: "system" | "user" | "assistant";
  content: string;
}

interface OllamaResponse {
  message: ChatMessage;
  done: boolean;
  total_duration: number;
}

async function chat(messages: ChatMessage[]): Promise<string> {
  const response = await fetch("http://localhost:11434/api/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      model: "llama3.1:8b",
      messages,
      stream: false,
    }),
  });

  const data: OllamaResponse = await response.json();
  return data.message.content;
}

// Usage
const result = await chat([
  { role: "system", content: "You are a code reviewer." },
  { role: "user", content: "Review this function for bugs..." },
]);

This API is compatible with the OpenAI SDK, so existing codebases that use openai can switch to local inference by changing the base URL.

Using llama.cpp Directly

For maximum control over inference parameters, memory mapping, and batch processing, use llama.cpp directly.

# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j

# Run inference
./llama-cli \
  -m models/llama-3.1-8b-q4_K_M.gguf \
  --ctx-size 4096 \
  --threads 8 \
  --n-gpu-layers 35 \
  -p "Explain the CAP theorem in distributed systems:"

llama.cpp supports GPU offloading via --n-gpu-layers. On Apple Silicon, set this to a high value to offload most layers to the Metal GPU. On NVIDIA systems, CUDA acceleration provides significant speedups.

Server Mode

llama.cpp includes a built-in server that provides an OpenAI-compatible endpoint:

./llama-server \
  -m models/llama-3.1-8b-q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 4096 \
  --n-gpu-layers 35

This gives you fine-grained control over context size, thread count, batch size, and GPU offloading that Ollama abstracts away.

Choosing the Right Model

Model selection depends on your task and hardware. General guidelines:

Code generation — CodeLlama 13B or DeepSeek Coder 6.7B
General chat — Llama 3.1 8B or Mistral 7B
Instruction following — Llama 3.1 8B Instruct
Multilingual — Qwen 2.5 7B

Start with the largest model your hardware can run at Q4_K_M quantization. If inference speed is too slow, drop to a smaller model rather than a lower quantization level — model architecture matters more than precision for output quality.

FAQ

Can I run LLMs on my laptop?

Yes. Quantized models in GGUF format run efficiently on consumer hardware. A MacBook with 16 GB of RAM can run 7B–13B parameter models at usable speeds, typically generating 10–30 tokens per second depending on the model and quantization level. Even older Intel laptops with 16 GB RAM can run 7B models, though at slower speeds.

What is model quantization?

Quantization reduces model precision from 16-bit floating point to 4-bit or 8-bit integers, dramatically reducing memory requirements while maintaining most of the model’s capability. The process maps continuous weight values to a smaller set of discrete values. Modern quantization techniques like GPTQ and AWQ use calibration data to minimize quality loss, making 4-bit models nearly indistinguishable from full-precision versions on many benchmarks.

Is Ollama free to use?

Yes. Ollama is open-source and free. It provides a simple CLI and API for running open-source models locally without complex setup. There are no usage fees, token limits, or telemetry. You download models once and run them entirely on your own hardware with no internet connection required after the initial download.

Why Run Models Locally

Hardware Requirements

Understanding Quantization

Getting Started with Ollama

Installation and First Model

API Integration

Using llama.cpp Directly

Server Mode

Choosing the Right Model

FAQ

Can I run LLMs on my laptop?

What is model quantization?

Is Ollama free to use?

Keep Reading

Fine-Tuning Models for Your Domain

Building RAG Systems from Scratch

How Large Language Models Actually Work

Comments