Local LLMs · a tour
01 / 10
Issue 01 · A nine-minute tour

Running LLMs
on your own
machine.

Three tools. One open model format. Zero cloud. A practical guide to putting a capable AI on your laptop.

GPT4All· LM Studio· Ollama
02
The case for local

Why bother running
models locally?

01

Privacy

Your data never leaves the device. Nothing is logged by OpenAI, Anthropic, or anyone else.

02

Cost

Once the model is downloaded, inference is free. No per-token fees, no monthly bills.

03

Offline

Works on a plane, in a basement, or on a laptop with no internet. The model lives on your disk.

04

Control

Swap models, tune prompts, build private assistants that know your work. You're the platform.

The tradeoff: local models are slower than frontier models and not as smart. But for a lot of work — summarizing, drafting, Q&A over your notes — they're plenty.
03
The cast of characters

Three tools dominate
the local-LLM space.

Tool 01 · Easiest

GPT4All

For beginners & non-technical users

Looks like a chat app. Hides the complexity. Great first step if you've never done this before.

Friendly desktop app
gpt4all.io
Tool 02 · Explorer

LM Studio

For model browsers & explorers

Feature-rich. Beautiful GUI. Built-in Hugging Face browser. Every knob exposed.

Rich desktop GUI
lmstudio.ai
Tool 03 · Developer

Ollama

For developers & scripters

A tiny command-line tool plus a minimal Mac app. Dead-simple API that speaks OpenAI.

CLI + simple app
ollama.com
All free All run GGUF models All can share each other's models
04
The most important question

Choose by your
device.

Apple Silicon — M1 to M4

Any of the three. LM Studio edges ahead with MLX support — roughly 30% faster on Apple chips.

Intel Mac / older laptop

Reach for GPT4All or Ollama. LM Studio works, but its MLX advantage disappears.

Windows / Linux

All three work. Ollama leads for headless server use, LM Studio for desktop.

RAM to model-size heuristic
8 GB
Stick to 3B models — Phi, Gemma 2B/3B
16 GB
7-8B models — Llama 3.1 8B, Qwen 7B, Mistral 7B (sweet spot)
32 GB
13B comfortable; 70B quantized if patient
64 GB+
70B models at good quality
More parameters = smarter but slower. You don't need a 70B model to summarize your notes.
What about phones? On-device LLMs on iOS & Android
05
Tool 01 · The softest landing

GPT4All — the
friendliest front door.

GPT4All homepage

The tool I'd hand to a non-technical friend. Install, pick a model from a short curated list, start chatting. No terminal. No decision paralysis.

  • Looks like a normal chat app — no CLI, no config files
  • Built-in LocalDocs: drop a folder, it becomes searchable context
  • Curated model list — no decision paralysis
  • Cross-platform, free, open source
  • Familiar thread-style chat UX, like ChatGPT
LocalDocs is the killer feature. Private RAG over your own files with zero setup. You can always graduate to LM Studio or Ollama later.
Deep dive: LocalDocs Setup, RAG diagram, tips
06
Tool 02 · Power-user toolkit

LM Studio — everything
surfaced.

LM Studio homepage

Where GPT4All hides complexity, LM Studio surfaces it. Every model parameter is a visible knob.

  • Built-in model discovery — browse Hugging Face without leaving the app
  • MLX backend on Apple Silicon (noticeably faster)
  • Per-chat system prompt, temperature, context length, quantization
  • Local server mode with OpenAI-compatible API
  • lms CLI for scripting workflows
The tradeoff: more surface area means more to learn. If you're brand new, it can feel overwhelming.
07
Tool 03 · Developer foundation

Ollama — the
foundation layer.

# one-time install $ brew install ollama # pull a model $ ollama pull llama3.1:8b # run it $ ollama run llama3.1:8b
Inside the REPL
$ ollama run llama3.1:8b >>> Write a haiku about Monday morning. Alarm clock lying, coffee not strong enough yet — the week stares me down. >>> /? # list slash commands >>> /bye # exit

One daemon, many clients

The real magic isn't the CLI — it's that every other chat app in the ecosystem can talk to the same Ollama instance.

Open WebUI
Chatbox
Msty
Page Assist
Obsidian plugins
your own scripts
Deep dive: API cookbook curl, Python, streaming, tools
  • Runs as a background daemon — no window to babysit
  • Exposes an OpenAI-compatible API at localhost:11434
  • Any app that speaks ChatGPT can instead speak to your local Ollama
08
The Ollama power move

Bake personas
into the model itself.

The Ollama Mac app doesn't have a system-prompt field. At first that seems limiting — until you realize you can bake the personality straight into a model file and it becomes portable across every client.

1. Write the Modelfile

FROM gemma3:4b SYSTEM """You are Steve Jobs. Direct. Simplicity over features. Ask piercing questions. Demand insanely great."""

2. Build & run it

$ ollama create steve-jobs -f Modelfile $ ollama run steve-jobs
Modelfile ollama create shows in every client's model dropdown portable persona
Deep dive: Persona gallery 5 ready-to-copy Modelfiles + syntax ref
09
Bonus round

Obsidian: your notes
become the AI's memory.

Local LLMs get exponentially more useful when they can reach into your actual knowledge — not just their training data. If you keep notes in Obsidian, three plugins turn your vault into live context.

PLUGIN 01

Copilot for Obsidian

Chat sidebar inside your vault. Points at your local Ollama or LM Studio — no cloud calls.

PLUGIN 02

Smart Connections

Embeds every note into a local vector DB. The LLM retrieves the most relevant pages before answering.

PROTOCOL 03

Ollama MCP

Your vault becomes a live context source any MCP-aware client can query on demand.

Your notes are the brain.
The LLM is the mouth.
Deep dive: Setup walkthrough Plugins, CORS fix, starter prompts
10
Pick your starting point

Where to begin.

Decision tree

Never tried local AI?
GPT4All
Apple Silicon, want it all?
LM Studio
Developer who wants flexibility?
Ollama

You can install all three. They share models through the GGUF format — your downloads aren't locked to any one tool.

Resources

Now go make your laptop a little smarter.

All slides

Speaker notes press N to hide
navigate Space next N notes O overview F fullscreen 1-9 jump
Local LLMs · 2026
Deep dive · GPT4All

LocalDocs — private RAG, zero setup.

Point GPT4All at a folder. It reads, indexes, and cites your documents when it answers — all on your machine, no cloud, no account.

What it actually does

LocalDocs turns a folder on your disk into retrieval context for whatever model you're chatting with. Ask “what's in my meeting notes from last Tuesday?” and it will find the relevant passages, hand them to the LLM, and ground the answer in your actual documents. The pattern is called retrieval-augmented generation — RAG for short — and it's how every “chat with your PDFs” product works under the hood.

How RAG works (simplified)

Documents PDFs, notes, text
Chunks Split into passages
Embeddings Vectors per chunk
? Retrieve Match your query
Answer LLM + context
Your question pulls the three-to-five most relevant chunks from your vault. The LLM only sees those — plus the question.

Setup in four steps

Step 01
Install GPT4All and download a chat model

Grab the installer from gpt4all.io. Open the app, go to Models, and pick a model that fits your RAM — Llama 3.1 8B Instruct is a solid default for 16 GB machines.

Step 02
Create a LocalDocs collection

Click LocalDocs in the sidebar, then + Add Collection. Give it a name (e.g. research, company-wiki) and point it at a folder. GPT4All will watch that folder and reindex when files change.

Good folders to try: your Obsidian vault, a folder of PDFs you've been meaning to read, a dump of exported Notion pages, or a project-specific docs directory.
Step 03
Pick an embedding model

The first time you add a collection, GPT4All will download a small embedding model (Nomic's embedder, about 137 MB). This runs locally too — no API calls. Once it finishes indexing, you'll see a chunk count next to the collection name.

Step 04
Chat with it enabled

Open a new chat, toggle your collection on in the top bar, and ask a question. GPT4All will cite the source files it used at the bottom of each answer. Click a citation to open the source passage.

What works well, what struggles

Works well

  • Markdown and text notes — clean structure, easy to chunk
  • PDFs with real text (not scanned images)
  • Medium-sized collections — hundreds to low-thousands of documents
  • Focused questions with specific keywords
  • Stable content that doesn't change daily

Struggles with

  • Huge codebases — structure-aware tools do better
  • Scanned PDFs without OCR — nothing to embed
  • Vague questions like “what's important?”
  • Cross-document reasoning that needs five sources synthesized
  • Fast-changing content — reindexing has cost

Tips from the field

Privacy note: LocalDocs never sends your documents to any server. The embedder runs locally, the LLM runs locally, the vector store lives on your disk. If you need to prove that to a security team, point them at the open-source code on GitHub.
Deep dive · Obsidian

Your vault, made chat-able.

A step-by-step for wiring Obsidian Copilot to a local Ollama, with the gotchas that will otherwise cost you an afternoon.

The stack you're building

You'll connect four pieces: Obsidian (your notes), Ollama (runs the model), a chat model (answers questions), and an embedding model (makes your notes searchable by meaning, not just keywords). The glue is the Copilot plugin by Logan Yang — it supports cloud models too, but we're going fully local.

Obsidian
Copilot plugin
Ollama daemon
Chat model
Embedding model

Step-by-step setup

Step 01
Install Ollama and pull two models

One model for chat, one for embeddings. nomic-embed-text is a small, fast embedder that pairs well with most chat models.

# chat model $ ollama pull llama3.1:8b # embedding model for RAG $ ollama pull nomic-embed-text
Step 02
Fix the CORS trap (the afternoon-saving step)

Obsidian calls Ollama from its renderer process, and Ollama will refuse the call unless you whitelist the Obsidian origin. This is the single most common reason “it just won't connect.”

Quit the Ollama app, then in a terminal:

# macOS / Linux $ OLLAMA_ORIGINS=app://obsidian.md* ollama serve

Windows (PowerShell): $env:OLLAMA_ORIGINS=##TQ##app://obsidian.md*##TQ##; ollama serve

Make this permanent on macOS: launchctl setenv OLLAMA_ORIGINS "app://obsidian.md*", then restart the Ollama app. You won't have to run ollama serve manually anymore.
Step 03
Install the Copilot plugin

In Obsidian: Settings → Community plugins → turn off restricted mode if it's on → Browse → search Copilot → install and enable the one by Logan Yang.

Step 04
Wire up the chat model

Open Copilot settings. Under Add Custom Model:

  • Model name: llama3.1:8b — must match exactly what ollama list shows
  • Provider: Ollama
  • Base URL: http://localhost:11434 (default, usually auto-filled)
  • CORS: enable it in the model form

Click Verify. Green check means you're in. Set it as your default chat model at the top of settings.

Step 05
Wire up the embedding model

In Copilot's QA settings, set:

  • Embedding model: nomic-embed-text
  • Provider: Ollama

Then click Index Vault (or similar). First-time indexing takes a few minutes for large vaults. Subsequent reindexes are incremental.

Starter prompts to try

Once it's live, open the Copilot side panel and try these — each one tests a different capability:

Smart Connections — the lightweight alternative

If Copilot feels like too much apparatus, Smart Connections is a simpler plugin. It embeds your notes into a local vector DB and shows the most similar notes in a sidebar as you write — no chat, just discovery. Pair it with Copilot for chat, or use it alone for serendipity.

Troubleshooting

It won't connect

  • CORS: re-check OLLAMA_ORIGINS is set and Ollama was restarted
  • Port conflict: is anything else on 11434?
  • Firewall: try curl localhost:11434 — should return “Ollama is running”

Answers are thin

  • Context window too short — try num_ctx 32768 in the model settings
  • Wrong embedding model — small vaults do fine with default; large vaults benefit from bigger embedders
  • Chat model too small — a 3B model will be weak on synthesis; try 8B+
Tip on context: Copilot respects the num_ctx you set on the Ollama model. If you're getting truncated answers on long notes, run ollama run llama3.1:8b, then /set parameter num_ctx 32768, then /save llama3.1:8b to bake it in.
Deep dive · Ollama

Ollama API cookbook.

The Ollama daemon speaks two protocols: its own native API and an OpenAI-compatible one. Pick the second and most of your existing code works unchanged.

The one-line mental model

Ollama runs a local HTTP server at http://localhost:11434. Any client that can make an HTTP request can use it. If that client already speaks the OpenAI API, point its base_url at http://localhost:11434/v1 and the API key can be any non-empty string.

Quickstart — curl

The simplest way to verify everything works. This hits Ollama's native chat API:

$ curl http://localhost:11434/api/chat -d '{ "model": "llama3.1:8b", "messages": [{"role": "user", "content": "why is the sky blue?"}], "stream": false }'

Python with the OpenAI SDK

This is the pattern I reach for first. If you already use openai, change two lines and your whole codebase runs locally.

from openai import OpenAI client = OpenAI( base_url="http://localhost:11434/v1", api_key="ollama", # required, but unused ) response = client.chat.completions.create( model="llama3.1:8b", messages=[ {"role": "system", "content": "You are a terse assistant."}, {"role": "user", "content": "Summarize the French Revolution in 3 bullets."}, ], ) print(response.choices[0].message.content)

Streaming

For responsive UIs, stream the response token-by-token. Same SDK, one flag flipped:

stream = client.chat.completions.create( model="llama3.1:8b", messages=[{"role": "user", "content": "tell me a short story"}], stream=True, ) for chunk in stream: delta = chunk.choices[0].delta.content if delta: print(delta, end="", flush=True)

Tool / function calling

Recent Ollama versions and recent models (Llama 3.1, Qwen 2.5, Mistral) support structured tool calls. Same OpenAI-compatible shape:

tools = [{ "type": "function", "function": { "name": "get_weather", "description": "Get current weather for a city", "parameters": { "type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"], }, }, }] response = client.chat.completions.create( model="llama3.1:8b", messages=[{"role": "user", "content": "weather in Paris?"}], tools=tools, )
Heads up: smaller models are unreliable at tool-calling. Stick to 7B+ and check response.choices[0].message.tool_calls — it may be None if the model decided not to call the tool.

Embeddings

Build your own semantic search or RAG pipeline:

response = client.embeddings.create( model="nomic-embed-text", input="the quick brown fox", ) vector = response.data[0].embedding # list[float], 768 dims

Give Ollama a nicer face — Open WebUI in 3 commands

If you want a ChatGPT-style web interface for your Ollama setup, Open WebUI is the standard pick. Runs in Docker:

$ docker run -d -p 3000:8080 \ --add-host=host.docker.internal:host-gateway \ -v open-webui:/app/backend/data \ --name open-webui --restart always \ ghcr.io/open-webui/open-webui:main

Visit http://localhost:3000. Create an account (stored locally), and it auto-discovers your Ollama models.

Common pitfalls

Deep dive · Modelfiles

Modelfile gallery.

Five ready-to-use personas plus the syntax reference you'll keep coming back to. Copy, customize, share.

Modelfile syntax reference

A Modelfile is a plain text file — no extension required. These are the instructions you'll use most:

Build any of the personas below with:

$ ollama create <name> -f Modelfile $ ollama run <name>

The gallery

Persona 01 · Product

Steve Jobs

Direct. Simplicity over features.

Good for product critiques, pitch-deck review, and “should we even build this?” questions. Will tell you your idea is junk.

FROM llama3.1:8b SYSTEM """You are Steve Jobs. You care about one thing: insanely great products. Rules: - Simplicity over features. Subtract before you add. - Ask piercing questions about who this is for and why it matters. - Demand taste. Call out mediocrity. - Short, declarative sentences. No hedging. - When something is bad, say so.""" PARAMETER temperature 0.7
Persona 02 · Creative

Rick Rubin

Taste as a discipline.

Good for creative work — writing, music, visual art. Asks what you're really trying to say. Slow, patient, unafraid of silence.

FROM llama3.1:8b SYSTEM """You are Rick Rubin, the record producer. Your job is to help the artist find what they are really trying to say. Rules: - Ask more than you answer. Questions before opinions. - Point at the feeling, not the technique. - Name what is fake without meanness. - Celebrate what is true. - Short, calm sentences. No filler.""" PARAMETER temperature 0.8
Persona 03 · Research

Skeptical Research Assistant

Show your work.

Good for fact-checking, argument-sharpening, and steelmanning. Will refuse to speculate without flagging it.

FROM llama3.1:8b SYSTEM """You are a skeptical research assistant. Rules: - Distinguish clearly: what is established, what is contested, what is speculation. - Flag confidence levels: high, medium, low, unknown. - When you do not know, say so clearly. Do not invent citations. - Offer the strongest counterargument before giving your own view. - Prefer primary sources. Name them when you can.""" PARAMETER temperature 0.3
Persona 04 · Engineering

Terse Code Reviewer

No preamble, no praise.

Good for real PR review. Will not say “great start!” It reads the code and ships a list.

FROM qwen2.5-coder:7b SYSTEM """You are a code reviewer. Assume the author is senior and short on time. Rules: - No preamble, no praise, no summaries. - Output a terse bulleted list of issues, grouped: Correctness, Performance, Readability, Security. - For each issue: file:line if known, then one sentence, then suggested fix. - If the code is fine, say LGTM and stop. - Never invent problems to look useful.""" PARAMETER temperature 0.2
Persona 05 · Visual

Mermaid Diagram Only

Outputs diagrams, never prose.

Good for system-design chats and flowcharts. Describe what you want; it returns a valid Mermaid diagram block and nothing else.

FROM llama3.1:8b SYSTEM """You output Mermaid diagrams, nothing else. Rules: - Every response is a valid Mermaid code block: three backticks, then mermaid, then the diagram, then three backticks. - No explanatory prose before or after. - Pick the best Mermaid type (flowchart, sequenceDiagram, stateDiagram, erDiagram, classDiagram) for the request. - Prefer clear node labels over clever ones. - If the request is ambiguous, make a reasonable choice and diagram that.""" PARAMETER temperature 0.4

Sharing Modelfiles

A Modelfile is just a text file — drop it in a gist, email it, commit it to a repo. The person on the other end runs ollama create with it and gets the exact same persona you did. It's the most lightweight way to share AI tools that exists: no API keys, no accounts, no SaaS dependency.

You can also publish built models to ollama.com/library with ollama push yourname/model-name if you want a distributable version.

Temperature cheat sheet: 0.0–0.3 for code, extraction, and facts. 0.5–0.7 for general chat. 0.8–1.2 for creative writing and brainstorming. Above 1.5 and the model starts losing coherence.
Deep dive · Mobile

AI in your pocket.

On-device LLMs aren't a laptop-only story. Modern phones run small, capable models without a network — here's what's worth installing right now.

Why on-device on a phone is interesting

Phones added neural processors years before they had anything to do. Now they do: Gemma 4, Llama 3.2, Phi-3 Mini, and Qwen 2.5 all ship variants designed specifically for mobile hardware. The pitch is the same as on a laptop — privacy, offline, zero per-query cost — with one sharper edge: your phone is always with you. A plane, a subway, a remote trail, a hospital waiting room. The cloud is intermittent there. Your device is not.

Realistic expectations. Mobile LLMs are small — usually 1B to 4B parameters. They're great at summarization, short Q&A, transcription, image description, and structured tasks. They are not a replacement for Claude or GPT on complex reasoning. Think of them as a competent pocket assistant, not a research companion.

The apps worth installing

App 01 · Google

Google AI Edge Gallery

The reference implementation from the folks making Gemma.

Google's open-source showcase app for on-device AI. Download Gemma 4 E2B or E4B right inside the app. Fully offline after the first download, no account required. This is the one that feels most like a real product while also being fully transparent — source code on GitHub, the whole thing.

What you get:

  • AI Chat with Thinking Mode — watch the model's reasoning step by step
  • Ask Image — take a photo, ask questions about it (multimodal)
  • Audio Scribe — on-device transcription and translation
  • Prompt Lab — tune temperature, top-k, test single-turn prompts
  • Agent Skills — tool-using agents that can hit Wikipedia, show maps, generate QR codes
  • Benchmarks — see tokens/sec on your specific device
Platform: iOS 17+ · Android 12+ Cost: Free Source: Open
App Store ↗ Play Store ↗ GitHub ↗
App 02 · Community

PocketPal AI

Any GGUF model, right on your phone.

Free, open-source, cross-platform. Built on llama.cpp, so anything you'd run on a laptop with Ollama can run in PocketPal — pick from its catalog or sideload a GGUF you already have. The power-user flavor of mobile LLM.

Why people use it:

  • Enable Metal acceleration on iOS (shift work from CPU to GPU) for a dramatic speed jump
  • Turn on Flash Attention on capable devices for more
  • Make “Pals” — system-prompt personas, exactly like Ollama Modelfiles but in your pocket
  • Import your own GGUF from Hugging Face
Platform: iOS · Android Cost: Free Source: Open
App 03 · Apple

Apollo AI

Open-source iOS chat, community-driven.

iOS-only, open-source, llama.cpp under the hood. More technical than its polish-forward competitors — it's the “let me tinker” app. If you want fine control over model parameters and you're on an iPhone or iPad, this is the one.

Platform: iOS / iPadOS Cost: Free Source: Open
App 04 · Apple

Haplo AI

Polished, plug-and-play, across your Apple devices.

The commercial option — slick interface, no account required, runs across iPhone, iPad, and Mac with the same library. Good pick if you want local AI that feels like a real product and you're not trying to maximize tinkering.

Platform: iOS / iPadOS / macOS Cost: Free tier + paid Source: Closed

What to expect, performance-wise

Works well on mobile

  • Short Q&A, text rewriting, grammar fixes
  • Summarizing a single article or email
  • Image description and OCR (with multimodal models like Gemma 3n/4)
  • Voice transcription (Audio Scribe, on-device Whisper variants)
  • Brainstorming short lists, titles, captions

Works poorly on mobile

  • Long documents — context windows are small (often 512–2k tokens)
  • Heavy reasoning — the model is 2B, not 200B
  • Coding tasks beyond snippets
  • Sustained chat — battery and heat will become apparent
  • Accurate factual recall on obscure topics

A pragmatic workflow

The pattern that actually sticks: AI Edge Gallery as your first install (smooth, reliable, multimodal), then add PocketPal if you want to run your own models or experiment more. Use them for the narrow set of things mobile LLMs are genuinely good at — transcription, photo Q&A, quick rewrites — and let your laptop handle everything else.

Battery warning. LLM inference is compute-heavy. Expect noticeable battery drain and phone warmth during long sessions. If your phone gets hot, pause — the OS will throttle, and you'll stop getting good tokens/sec anyway.