Three tools. One open model format. Zero cloud. A practical guide to putting a capable AI on your laptop.
Your data never leaves the device. Nothing is logged by OpenAI, Anthropic, or anyone else.
Once the model is downloaded, inference is free. No per-token fees, no monthly bills.
Works on a plane, in a basement, or on a laptop with no internet. The model lives on your disk.
Swap models, tune prompts, build private assistants that know your work. You're the platform.
Looks like a chat app. Hides the complexity. Great first step if you've never done this before.
Feature-rich. Beautiful GUI. Built-in Hugging Face browser. Every knob exposed.
A tiny command-line tool plus a minimal Mac app. Dead-simple API that speaks OpenAI.
Any of the three. LM Studio edges ahead with MLX support — roughly 30% faster on Apple chips.
Reach for GPT4All or Ollama. LM Studio works, but its MLX advantage disappears.
All three work. Ollama leads for headless server use, LM Studio for desktop.
The tool I'd hand to a non-technical friend. Install, pick a model from a short curated list, start chatting. No terminal. No decision paralysis.
Where GPT4All hides complexity, LM Studio surfaces it. Every model parameter is a visible knob.
lms CLI for scripting workflowsThe real magic isn't the CLI — it's that every other chat app in the ecosystem can talk to the same Ollama instance.
localhost:11434The Ollama Mac app doesn't have a system-prompt field. At first that seems limiting — until you realize you can bake the personality straight into a model file and it becomes portable across every client.
Local LLMs get exponentially more useful when they can reach into your actual knowledge — not just their training data. If you keep notes in Obsidian, three plugins turn your vault into live context.
Chat sidebar inside your vault. Points at your local Ollama or LM Studio — no cloud calls.
Embeds every note into a local vector DB. The LLM retrieves the most relevant pages before answering.
Your vault becomes a live context source any MCP-aware client can query on demand.
You can install all three. They share models through the GGUF format — your downloads aren't locked to any one tool.
Now go make your laptop a little smarter.
Point GPT4All at a folder. It reads, indexes, and cites your documents when it answers — all on your machine, no cloud, no account.
LocalDocs turns a folder on your disk into retrieval context for whatever model you're chatting with. Ask “what's in my meeting notes from last Tuesday?” and it will find the relevant passages, hand them to the LLM, and ground the answer in your actual documents. The pattern is called retrieval-augmented generation — RAG for short — and it's how every “chat with your PDFs” product works under the hood.
Grab the installer from gpt4all.io. Open the app, go to Models, and pick a model that fits your RAM — Llama 3.1 8B Instruct is a solid default for 16 GB machines.
Click LocalDocs in the sidebar, then + Add Collection. Give it a name (e.g. research, company-wiki) and point it at a folder. GPT4All will watch that folder and reindex when files change.
The first time you add a collection, GPT4All will download a small embedding model (Nomic's embedder, about 137 MB). This runs locally too — no API calls. Once it finishes indexing, you'll see a chunk count next to the collection name.
Open a new chat, toggle your collection on in the top bar, and ask a question. GPT4All will cite the source files it used at the bottom of each answer. Click a citation to open the source passage.
A step-by-step for wiring Obsidian Copilot to a local Ollama, with the gotchas that will otherwise cost you an afternoon.
You'll connect four pieces: Obsidian (your notes), Ollama (runs the model), a chat model (answers questions), and an embedding model (makes your notes searchable by meaning, not just keywords). The glue is the Copilot plugin by Logan Yang — it supports cloud models too, but we're going fully local.
One model for chat, one for embeddings. nomic-embed-text is a small, fast embedder that pairs well with most chat models.
Obsidian calls Ollama from its renderer process, and Ollama will refuse the call unless you whitelist the Obsidian origin. This is the single most common reason “it just won't connect.”
Quit the Ollama app, then in a terminal:
Windows (PowerShell): $env:OLLAMA_ORIGINS=##TQ##app://obsidian.md*##TQ##; ollama serve
launchctl setenv OLLAMA_ORIGINS "app://obsidian.md*", then restart the Ollama app. You won't have to run ollama serve manually anymore.In Obsidian: Settings → Community plugins → turn off restricted mode if it's on → Browse → search Copilot → install and enable the one by Logan Yang.
Open Copilot settings. Under Add Custom Model:
llama3.1:8b — must match exactly what ollama list showsOllamahttp://localhost:11434 (default, usually auto-filled)Click Verify. Green check means you're in. Set it as your default chat model at the top of settings.
In Copilot's QA settings, set:
nomic-embed-textThen click Index Vault (or similar). First-time indexing takes a few minutes for large vaults. Subsequent reindexes are incremental.
Once it's live, open the Copilot side panel and try these — each one tests a different capability:
If Copilot feels like too much apparatus, Smart Connections is a simpler plugin. It embeds your notes into a local vector DB and shows the most similar notes in a sidebar as you write — no chat, just discovery. Pair it with Copilot for chat, or use it alone for serendipity.
OLLAMA_ORIGINS is set and Ollama was restartedcurl localhost:11434 — should return “Ollama is running”num_ctx 32768 in the model settingsnum_ctx you set on the Ollama model. If you're getting truncated answers on long notes, run ollama run llama3.1:8b, then /set parameter num_ctx 32768, then /save llama3.1:8b to bake it in.The Ollama daemon speaks two protocols: its own native API and an OpenAI-compatible one. Pick the second and most of your existing code works unchanged.
Ollama runs a local HTTP server at http://localhost:11434. Any client that can make an HTTP request can use it. If that client already speaks the OpenAI API, point its base_url at http://localhost:11434/v1 and the API key can be any non-empty string.
The simplest way to verify everything works. This hits Ollama's native chat API:
This is the pattern I reach for first. If you already use openai, change two lines and your whole codebase runs locally.
For responsive UIs, stream the response token-by-token. Same SDK, one flag flipped:
Recent Ollama versions and recent models (Llama 3.1, Qwen 2.5, Mistral) support structured tool calls. Same OpenAI-compatible shape:
response.choices[0].message.tool_calls — it may be None if the model decided not to call the tool.Build your own semantic search or RAG pipeline:
If you want a ChatGPT-style web interface for your Ollama setup, Open WebUI is the standard pick. Runs in Docker:
Visit http://localhost:3000. Create an account (stored locally), and it auto-discovers your Ollama models.
ollama serve (or open the Mac app).ollama pull <model>.OLLAMA_ORIGINS to your app's origin before starting Ollama.num_ctx.Five ready-to-use personas plus the syntax reference you'll keep coming back to. Copy, customize, share.
A Modelfile is a plain text file — no extension required. These are the instructions you'll use most:
FROM <model> — the base model to build on (required)SYSTEM """...""" — the system prompt, baked inPARAMETER temperature 0.7 — 0.0 is deterministic, 1.0+ is chaoticPARAMETER num_ctx 32768 — context window sizePARAMETER top_p 0.9 — nucleus sampling cutoffTEMPLATE """...""" — chat template override (rarely needed)MESSAGE user "...", MESSAGE assistant "..." — few-shot examples baked into contextBuild any of the personas below with:
Good for product critiques, pitch-deck review, and “should we even build this?” questions. Will tell you your idea is junk.
Good for creative work — writing, music, visual art. Asks what you're really trying to say. Slow, patient, unafraid of silence.
Good for fact-checking, argument-sharpening, and steelmanning. Will refuse to speculate without flagging it.
Good for real PR review. Will not say “great start!” It reads the code and ships a list.
Good for system-design chats and flowcharts. Describe what you want; it returns a valid Mermaid diagram block and nothing else.
A Modelfile is just a text file — drop it in a gist, email it, commit it to a repo. The person on the other end runs ollama create with it and gets the exact same persona you did. It's the most lightweight way to share AI tools that exists: no API keys, no accounts, no SaaS dependency.
You can also publish built models to ollama.com/library with ollama push yourname/model-name if you want a distributable version.
On-device LLMs aren't a laptop-only story. Modern phones run small, capable models without a network — here's what's worth installing right now.
Phones added neural processors years before they had anything to do. Now they do: Gemma 4, Llama 3.2, Phi-3 Mini, and Qwen 2.5 all ship variants designed specifically for mobile hardware. The pitch is the same as on a laptop — privacy, offline, zero per-query cost — with one sharper edge: your phone is always with you. A plane, a subway, a remote trail, a hospital waiting room. The cloud is intermittent there. Your device is not.
Google's open-source showcase app for on-device AI. Download Gemma 4 E2B or E4B right inside the app. Fully offline after the first download, no account required. This is the one that feels most like a real product while also being fully transparent — source code on GitHub, the whole thing.
What you get:
Free, open-source, cross-platform. Built on llama.cpp, so anything you'd run on a laptop with Ollama can run in PocketPal — pick from its catalog or sideload a GGUF you already have. The power-user flavor of mobile LLM.
Why people use it:
iOS-only, open-source, llama.cpp under the hood. More technical than its polish-forward competitors — it's the “let me tinker” app. If you want fine control over model parameters and you're on an iPhone or iPad, this is the one.
The commercial option — slick interface, no account required, runs across iPhone, iPad, and Mac with the same library. Good pick if you want local AI that feels like a real product and you're not trying to maximize tinkering.
The pattern that actually sticks: AI Edge Gallery as your first install (smooth, reliable, multimodal), then add PocketPal if you want to run your own models or experiment more. Use them for the narrow set of things mobile LLMs are genuinely good at — transcription, photo Q&A, quick rewrites — and let your laptop handle everything else.