Issue 01 · A nine-minute tour

Running LLMs
on your own
machine.

Three tools. One open model format. Zero cloud. A practical guide to putting a capable AI on your laptop.

GPT4All· LM Studio· Ollama

For the tech-curious Est. 9 min · 10 slides

The case for local

Why bother running
models locally?

Privacy

Your data never leaves the device. Nothing is logged by OpenAI, Anthropic, or anyone else.

Cost

Once the model is downloaded, inference is free. No per-token fees, no monthly bills.

Offline

Works on a plane, in a basement, or on a laptop with no internet. The model lives on your disk.

Control

Swap models, tune prompts, build private assistants that know your work. You're the platform.

The tradeoff: local models are slower than frontier models and not as smart. But for a lot of work — summarizing, drafting, Q&A over your notes — they're plenty.

The cast of characters

Three tools dominate
the local-LLM space.

Tool 01 · Easiest

GPT4All

For beginners & non-technical users

Looks like a chat app. Hides the complexity. Great first step if you've never done this before.

Friendly desktop app

gpt4all.io→

Tool 02 · Explorer

LM Studio

For model browsers & explorers

Feature-rich. Beautiful GUI. Built-in Hugging Face browser. Every knob exposed.

Rich desktop GUI

lmstudio.ai→

Tool 03 · Developer

Ollama

For developers & scripters

A tiny command-line tool plus a minimal Mac app. Dead-simple API that speaks OpenAI.

CLI + simple app

ollama.com→

All free All run GGUF models All can share each other's models

The most important question

Choose by your
device.

Apple Silicon — M1 to M4

Any of the three. LM Studio edges ahead with MLX support — roughly 30% faster on Apple chips.

Intel Mac / older laptop

Reach for GPT4All or Ollama. LM Studio works, but its MLX advantage disappears.

Windows / Linux

All three work. Ollama leads for headless server use, LM Studio for desktop.

RAM to model-size heuristic

8 GB

Stick to 3B models — Phi, Gemma 2B/3B

16 GB

7-8B models — Llama 3.1 8B, Qwen 7B, Mistral 7B (sweet spot)

32 GB

13B comfortable; 70B quantized if patient

64 GB+

70B models at good quality

More parameters = smarter but slower. You don't need a 70B model to summarize your notes.

What about phones?↗ On-device LLMs on iOS & Android

Tool 01 · The softest landing

GPT4All — the
friendliest front door.

The tool I'd hand to a non-technical friend. Install, pick a model from a short curated list, start chatting. No terminal. No decision paralysis.

Looks like a normal chat app — no CLI, no config files
Built-in LocalDocs: drop a folder, it becomes searchable context
Curated model list — no decision paralysis
Cross-platform, free, open source
Familiar thread-style chat UX, like ChatGPT

LocalDocs is the killer feature. Private RAG over your own files with zero setup. You can always graduate to LM Studio or Ollama later.

Deep dive: LocalDocs↗ Setup, RAG diagram, tips

Tool 02 · Power-user toolkit

LM Studio — everything
surfaced.

Where GPT4All hides complexity, LM Studio surfaces it. Every model parameter is a visible knob.

Built-in model discovery — browse Hugging Face without leaving the app
MLX backend on Apple Silicon (noticeably faster)
Per-chat system prompt, temperature, context length, quantization
Local server mode with OpenAI-compatible API
lms CLI for scripting workflows

The tradeoff: more surface area means more to learn. If you're brand new, it can feel overwhelming.

Tool 03 · Developer foundation

Ollama — the
foundation layer.

# one-time install
$ brew install ollama

# pull a model
$ ollama pull llama3.1:8b

# run it
$ ollama run llama3.1:8b

Inside the REPL

$ ollama run llama3.1:8b

>>> Write a haiku about Monday morning.

Alarm clock lying,
coffee not strong enough yet —
the week stares me down.

>>> /?  # list slash commands
>>> /bye # exit

One daemon, many clients

The real magic isn't the CLI — it's that every other chat app in the ecosystem can talk to the same Ollama instance.

Open WebUI

Chatbox

Msty

Page Assist

Obsidian plugins

your own scripts

Deep dive: API cookbook↗ curl, Python, streaming, tools

Runs as a background daemon — no window to babysit
Exposes an OpenAI-compatible API at localhost:11434
Any app that speaks ChatGPT can instead speak to your local Ollama

The Ollama power move

Bake personas
into the model itself.

The Ollama Mac app doesn't have a system-prompt field. At first that seems limiting — until you realize you can bake the personality straight into a model file and it becomes portable across every client.

1. Write the Modelfile

FROM gemma3:4b

SYSTEM """You are Steve Jobs. Direct.
Simplicity over features. Ask piercing
questions. Demand insanely great."""

2. Build & run it

$ ollama create steve-jobs -f Modelfile

$ ollama run steve-jobs

Modelfile → ollama create → shows in every client's model dropdown → portable persona

Deep dive: Persona gallery↗ 5 ready-to-copy Modelfiles + syntax ref

Bonus round

Obsidian: your notes
become the AI's memory.

Local LLMs get exponentially more useful when they can reach into your actual knowledge — not just their training data. If you keep notes in Obsidian, three plugins turn your vault into live context.

PLUGIN 01

Copilot for Obsidian

Chat sidebar inside your vault. Points at your local Ollama or LM Studio — no cloud calls.

PLUGIN 02

Smart Connections

Embeds every note into a local vector DB. The LLM retrieves the most relevant pages before answering.

PROTOCOL 03

Ollama MCP

Your vault becomes a live context source any MCP-aware client can query on demand.

Your notes are the brain.
The LLM is the mouth.

Deep dive: Setup walkthrough↗ Plugins, CORS fix, starter prompts

Pick your starting point

Where to begin.

Decision tree

Never tried local AI?

→

GPT4All

Apple Silicon, want it all?

→

LM Studio

Developer who wants flexibility?

→

Ollama

You can install all three. They share models through the GGUF format — your downloads aren't locked to any one tool.

Resources

Now go make your laptop a little smarter.

Deep dive · GPT4All

LocalDocs — private RAG, zero setup.

Point GPT4All at a folder. It reads, indexes, and cites your documents when it answers — all on your machine, no cloud, no account.

What it actually does

LocalDocs turns a folder on your disk into retrieval context for whatever model you're chatting with. Ask “what's in my meeting notes from last Tuesday?” and it will find the relevant passages, hand them to the LLM, and ground the answer in your actual documents. The pattern is called retrieval-augmented generation — RAG for short — and it's how every “chat with your PDFs” product works under the hood.

How RAG works (simplified)

¶ Documents PDFs, notes, text

◐ Chunks Split into passages

☀ Embeddings Vectors per chunk

? Retrieve Match your query

✦ Answer LLM + context

Your question pulls the three-to-five most relevant chunks from your vault. The LLM only sees those — plus the question.

Setup in four steps

Step 01

Install GPT4All and download a chat model

Grab the installer from gpt4all.io. Open the app, go to Models, and pick a model that fits your RAM — Llama 3.1 8B Instruct is a solid default for 16 GB machines.

Step 02

Create a LocalDocs collection

Click LocalDocs in the sidebar, then + Add Collection. Give it a name (e.g. research, company-wiki) and point it at a folder. GPT4All will watch that folder and reindex when files change.

Good folders to try: your Obsidian vault, a folder of PDFs you've been meaning to read, a dump of exported Notion pages, or a project-specific docs directory.

Step 03

Pick an embedding model

The first time you add a collection, GPT4All will download a small embedding model (Nomic's embedder, about 137 MB). This runs locally too — no API calls. Once it finishes indexing, you'll see a chunk count next to the collection name.

Step 04

Chat with it enabled

Open a new chat, toggle your collection on in the top bar, and ask a question. GPT4All will cite the source files it used at the bottom of each answer. Click a citation to open the source passage.

What works well, what struggles

Works well

Markdown and text notes — clean structure, easy to chunk
PDFs with real text (not scanned images)
Medium-sized collections — hundreds to low-thousands of documents
Focused questions with specific keywords
Stable content that doesn't change daily

Struggles with

Huge codebases — structure-aware tools do better
Scanned PDFs without OCR — nothing to embed
Vague questions like “what's important?”
Cross-document reasoning that needs five sources synthesized
Fast-changing content — reindexing has cost

Tips from the field

Start narrow. One focused collection (e.g. “Q4 planning docs”) outperforms one giant “everything” collection — retrieval is easier when the haystack is smaller.
Name collections like categories. You can toggle them per chat, so it helps to have legal, research, meeting-notes rather than one lump.
Check the citations. If the model cites the wrong chunk, the issue is almost always retrieval, not the model — try rephrasing your question with more specific terms.
Chunk size matters. GPT4All's defaults are reasonable, but if answers feel fragmented, you may want larger chunks. Fragmented retrieval = fragmented answers.

Privacy note: LocalDocs never sends your documents to any server. The embedder runs locally, the LLM runs locally, the vector store lives on your disk. If you need to prove that to a security team, point them at the open-source code on GitHub.

Deep dive · Obsidian

Your vault, made chat-able.

A step-by-step for wiring Obsidian Copilot to a local Ollama, with the gotchas that will otherwise cost you an afternoon.

The stack you're building

You'll connect four pieces: Obsidian (your notes), Ollama (runs the model), a chat model (answers questions), and an embedding model (makes your notes searchable by meaning, not just keywords). The glue is the Copilot plugin by Logan Yang — it supports cloud models too, but we're going fully local.

Obsidian

Copilot plugin

Ollama daemon

Chat model

Embedding model

Step-by-step setup

Step 01

Install Ollama and pull two models

One model for chat, one for embeddings. nomic-embed-text is a small, fast embedder that pairs well with most chat models.

# chat model
$ ollama pull llama3.1:8b

# embedding model for RAG
$ ollama pull nomic-embed-text

Step 02

Fix the CORS trap (the afternoon-saving step)

Obsidian calls Ollama from its renderer process, and Ollama will refuse the call unless you whitelist the Obsidian origin. This is the single most common reason “it just won't connect.”

Quit the Ollama app, then in a terminal:

# macOS / Linux
$ OLLAMA_ORIGINS=app://obsidian.md* ollama serve

Windows (PowerShell): $env:OLLAMA_ORIGINS=##TQ##app://obsidian.md*##TQ##; ollama serve

Make this permanent on macOS: launchctl setenv OLLAMA_ORIGINS "app://obsidian.md*", then restart the Ollama app. You won't have to run ollama serve manually anymore.

Step 03

Install the Copilot plugin

In Obsidian: Settings → Community plugins → turn off restricted mode if it's on → Browse → search Copilot → install and enable the one by Logan Yang.

Step 04

Wire up the chat model

Open Copilot settings. Under Add Custom Model:

Model name: llama3.1:8b — must match exactly what ollama list shows
Provider: Ollama
Base URL: http://localhost:11434 (default, usually auto-filled)
CORS: enable it in the model form

Click Verify. Green check means you're in. Set it as your default chat model at the top of settings.

Step 05

Wire up the embedding model

In Copilot's QA settings, set:

Embedding model: nomic-embed-text
Provider: Ollama

Then click Index Vault (or similar). First-time indexing takes a few minutes for large vaults. Subsequent reindexes are incremental.

Starter prompts to try

Once it's live, open the Copilot side panel and try these — each one tests a different capability:

Summarize: “Summarize my notes from the last 30 days about [topic].”
Retrieve: “What have I written about [person / project / concept]?”
Connect: “Find connections between [note A] and [note B].”
Extract: “Pull out every action item I've written this week.”
Teach: “Based on my notes, what don't I understand yet about [topic]?”

Smart Connections — the lightweight alternative

If Copilot feels like too much apparatus, Smart Connections is a simpler plugin. It embeds your notes into a local vector DB and shows the most similar notes in a sidebar as you write — no chat, just discovery. Pair it with Copilot for chat, or use it alone for serendipity.

Troubleshooting

It won't connect

CORS: re-check OLLAMA_ORIGINS is set and Ollama was restarted
Port conflict: is anything else on 11434?
Firewall: try curl localhost:11434 — should return “Ollama is running”

Answers are thin

Context window too short — try num_ctx 32768 in the model settings
Wrong embedding model — small vaults do fine with default; large vaults benefit from bigger embedders
Chat model too small — a 3B model will be weak on synthesis; try 8B+

Tip on context: Copilot respects the num_ctx you set on the Ollama model. If you're getting truncated answers on long notes, run ollama run llama3.1:8b, then /set parameter num_ctx 32768, then /save llama3.1:8b to bake it in.

Deep dive · Ollama

Ollama API cookbook.

The Ollama daemon speaks two protocols: its own native API and an OpenAI-compatible one. Pick the second and most of your existing code works unchanged.

The one-line mental model

Ollama runs a local HTTP server at http://localhost:11434. Any client that can make an HTTP request can use it. If that client already speaks the OpenAI API, point its base_url at http://localhost:11434/v1 and the API key can be any non-empty string.

Quickstart — curl

The simplest way to verify everything works. This hits Ollama's native chat API:

$ curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1:8b",
  "messages": [{"role": "user", "content": "why is the sky blue?"}],
  "stream": false
}'

Python with the OpenAI SDK

This is the pattern I reach for first. If you already use openai, change two lines and your whole codebase runs locally.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required, but unused
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "You are a terse assistant."},
        {"role": "user", "content": "Summarize the French Revolution in 3 bullets."},
    ],
)

print(response.choices[0].message.content)

Streaming

For responsive UIs, stream the response token-by-token. Same SDK, one flag flipped:

stream = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "tell me a short story"}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

Tool / function calling

Recent Ollama versions and recent models (Llama 3.1, Qwen 2.5, Mistral) support structured tool calls. Same OpenAI-compatible shape:

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "weather in Paris?"}],
    tools=tools,
)

Heads up: smaller models are unreliable at tool-calling. Stick to 7B+ and check response.choices[0].message.tool_calls — it may be None if the model decided not to call the tool.

Embeddings

Build your own semantic search or RAG pipeline:

response = client.embeddings.create(
    model="nomic-embed-text",
    input="the quick brown fox",
)
vector = response.data[0].embedding  # list[float], 768 dims

Give Ollama a nicer face — Open WebUI in 3 commands

If you want a ChatGPT-style web interface for your Ollama setup, Open WebUI is the standard pick. Runs in Docker:

$ docker run -d -p 3000:8080 \
    --add-host=host.docker.internal:host-gateway \
    -v open-webui:/app/backend/data \
    --name open-webui --restart always \
    ghcr.io/open-webui/open-webui:main

Visit http://localhost:3000. Create an account (stored locally), and it auto-discovers your Ollama models.

Common pitfalls

“Connection refused” — Ollama isn't running. Start it: ollama serve (or open the Mac app).
“Model not found” — you didn't pull it yet. ollama pull <model>.
“CORS error” from a browser app — set OLLAMA_ORIGINS to your app's origin before starting Ollama.
Answers get cut off — context too short. Save a version of the model with a bigger num_ctx.

Deep dive · Modelfiles

Modelfile gallery.

Five ready-to-use personas plus the syntax reference you'll keep coming back to. Copy, customize, share.

Modelfile syntax reference

A Modelfile is a plain text file — no extension required. These are the instructions you'll use most:

FROM <model> — the base model to build on (required)
SYSTEM """...""" — the system prompt, baked in
PARAMETER temperature 0.7 — 0.0 is deterministic, 1.0+ is chaotic
PARAMETER num_ctx 32768 — context window size
PARAMETER top_p 0.9 — nucleus sampling cutoff
TEMPLATE """...""" — chat template override (rarely needed)
MESSAGE user "...", MESSAGE assistant "..." — few-shot examples baked into context

Build any of the personas below with:

$ ollama create <name> -f Modelfile
$ ollama run <name>

The gallery

Persona 01 · Product

Steve Jobs

Direct. Simplicity over features.

Good for product critiques, pitch-deck review, and “should we even build this?” questions. Will tell you your idea is junk.

FROM llama3.1:8b

SYSTEM """You are Steve Jobs. You care about one thing: insanely great products.

Rules:
- Simplicity over features. Subtract before you add.
- Ask piercing questions about who this is for and why it matters.
- Demand taste. Call out mediocrity.
- Short, declarative sentences. No hedging.
- When something is bad, say so."""

PARAMETER temperature 0.7

Persona 02 · Creative

Rick Rubin

Taste as a discipline.

Good for creative work — writing, music, visual art. Asks what you're really trying to say. Slow, patient, unafraid of silence.

FROM llama3.1:8b

SYSTEM """You are Rick Rubin, the record producer. Your job is to help the artist find what they are really trying to say.

Rules:
- Ask more than you answer. Questions before opinions.
- Point at the feeling, not the technique.
- Name what is fake without meanness.
- Celebrate what is true.
- Short, calm sentences. No filler."""

PARAMETER temperature 0.8

Persona 03 · Research

Skeptical Research Assistant

Show your work.

Good for fact-checking, argument-sharpening, and steelmanning. Will refuse to speculate without flagging it.

FROM llama3.1:8b

SYSTEM """You are a skeptical research assistant.

Rules:
- Distinguish clearly: what is established, what is contested, what is speculation.
- Flag confidence levels: high, medium, low, unknown.
- When you do not know, say so clearly. Do not invent citations.
- Offer the strongest counterargument before giving your own view.
- Prefer primary sources. Name them when you can."""

PARAMETER temperature 0.3

Persona 04 · Engineering

Terse Code Reviewer

No preamble, no praise.

Good for real PR review. Will not say “great start!” It reads the code and ships a list.

FROM qwen2.5-coder:7b

SYSTEM """You are a code reviewer. Assume the author is senior and short on time.

Rules:
- No preamble, no praise, no summaries.
- Output a terse bulleted list of issues, grouped: Correctness, Performance, Readability, Security.
- For each issue: file:line if known, then one sentence, then suggested fix.
- If the code is fine, say LGTM and stop.
- Never invent problems to look useful."""

PARAMETER temperature 0.2

Persona 05 · Visual

Mermaid Diagram Only

Outputs diagrams, never prose.

Good for system-design chats and flowcharts. Describe what you want; it returns a valid Mermaid diagram block and nothing else.

FROM llama3.1:8b

SYSTEM """You output Mermaid diagrams, nothing else.

Rules:
- Every response is a valid Mermaid code block: three backticks, then mermaid, then the diagram, then three backticks.
- No explanatory prose before or after.
- Pick the best Mermaid type (flowchart, sequenceDiagram, stateDiagram, erDiagram, classDiagram) for the request.
- Prefer clear node labels over clever ones.
- If the request is ambiguous, make a reasonable choice and diagram that."""

PARAMETER temperature 0.4

Sharing Modelfiles

A Modelfile is just a text file — drop it in a gist, email it, commit it to a repo. The person on the other end runs ollama create with it and gets the exact same persona you did. It's the most lightweight way to share AI tools that exists: no API keys, no accounts, no SaaS dependency.

You can also publish built models to ollama.com/library with ollama push yourname/model-name if you want a distributable version.

Temperature cheat sheet: 0.0–0.3 for code, extraction, and facts. 0.5–0.7 for general chat. 0.8–1.2 for creative writing and brainstorming. Above 1.5 and the model starts losing coherence.

Deep dive · Mobile

AI in your pocket.

On-device LLMs aren't a laptop-only story. Modern phones run small, capable models without a network — here's what's worth installing right now.

Why on-device on a phone is interesting

Phones added neural processors years before they had anything to do. Now they do: Gemma 4, Llama 3.2, Phi-3 Mini, and Qwen 2.5 all ship variants designed specifically for mobile hardware. The pitch is the same as on a laptop — privacy, offline, zero per-query cost — with one sharper edge: your phone is always with you. A plane, a subway, a remote trail, a hospital waiting room. The cloud is intermittent there. Your device is not.

Realistic expectations. Mobile LLMs are small — usually 1B to 4B parameters. They're great at summarization, short Q&A, transcription, image description, and structured tasks. They are not a replacement for Claude or GPT on complex reasoning. Think of them as a competent pocket assistant, not a research companion.

The apps worth installing

App 01 · Google

Google AI Edge Gallery

The reference implementation from the folks making Gemma.

Google's open-source showcase app for on-device AI. Download Gemma 4 E2B or E4B right inside the app. Fully offline after the first download, no account required. This is the one that feels most like a real product while also being fully transparent — source code on GitHub, the whole thing.

What you get:

AI Chat with Thinking Mode — watch the model's reasoning step by step
Ask Image — take a photo, ask questions about it (multimodal)
Audio Scribe — on-device transcription and translation
Prompt Lab — tune temperature, top-k, test single-turn prompts
Agent Skills — tool-using agents that can hit Wikipedia, show maps, generate QR codes
Benchmarks — see tokens/sec on your specific device

Platform: iOS 17+ · Android 12+ Cost: Free Source: Open

App Store ↗ Play Store ↗ GitHub ↗

App 02 · Community

PocketPal AI

Any GGUF model, right on your phone.

Free, open-source, cross-platform. Built on llama.cpp, so anything you'd run on a laptop with Ollama can run in PocketPal — pick from its catalog or sideload a GGUF you already have. The power-user flavor of mobile LLM.

Why people use it:

Enable Metal acceleration on iOS (shift work from CPU to GPU) for a dramatic speed jump
Turn on Flash Attention on capable devices for more
Make “Pals” — system-prompt personas, exactly like Ollama Modelfiles but in your pocket
Import your own GGUF from Hugging Face

Platform: iOS · Android Cost: Free Source: Open

App 03 · Apple

Apollo AI

Open-source iOS chat, community-driven.

iOS-only, open-source, llama.cpp under the hood. More technical than its polish-forward competitors — it's the “let me tinker” app. If you want fine control over model parameters and you're on an iPhone or iPad, this is the one.

Platform: iOS / iPadOS Cost: Free Source: Open

App 04 · Apple

Haplo AI

Polished, plug-and-play, across your Apple devices.

The commercial option — slick interface, no account required, runs across iPhone, iPad, and Mac with the same library. Good pick if you want local AI that feels like a real product and you're not trying to maximize tinkering.

Platform: iOS / iPadOS / macOS Cost: Free tier + paid Source: Closed

What to expect, performance-wise

Works well on mobile

Short Q&A, text rewriting, grammar fixes
Summarizing a single article or email
Image description and OCR (with multimodal models like Gemma 3n/4)
Voice transcription (Audio Scribe, on-device Whisper variants)
Brainstorming short lists, titles, captions

Works poorly on mobile

Long documents — context windows are small (often 512–2k tokens)
Heavy reasoning — the model is 2B, not 200B
Coding tasks beyond snippets
Sustained chat — battery and heat will become apparent
Accurate factual recall on obscure topics

A pragmatic workflow

The pattern that actually sticks: AI Edge Gallery as your first install (smooth, reliable, multimodal), then add PocketPal if you want to run your own models or experiment more. Use them for the narrow set of things mobile LLMs are genuinely good at — transcription, photo Q&A, quick rewrites — and let your laptop handle everything else.

Battery warning. LLM inference is compute-heavy. Expect noticeable battery drain and phone warmth during long sessions. If your phone gets hot, pause — the OS will throttle, and you'll stop getting good tokens/sec anyway.

Running LLMson your ownmachine.

Why bother runningmodels locally?

Privacy

Cost

Offline

Control

Three tools dominatethe local-LLM space.

GPT4All

LM Studio

Ollama

Choose by yourdevice.

Apple Silicon — M1 to M4

Intel Mac / older laptop

Windows / Linux

GPT4All — thefriendliest front door.

LM Studio — everythingsurfaced.

Ollama — thefoundation layer.

One daemon, many clients

Bake personasinto the model itself.

1. Write the Modelfile

2. Build & run it

Obsidian: your notesbecome the AI's memory.

Copilot for Obsidian

Smart Connections

Ollama MCP

Where to begin.

Decision tree

Resources

All slides

LocalDocs — private RAG, zero setup.

What it actually does

How RAG works (simplified)

Setup in four steps

What works well, what struggles

Works well

Struggles with

Tips from the field

Your vault, made chat-able.

The stack you're building

Step-by-step setup

Starter prompts to try

Smart Connections — the lightweight alternative

Troubleshooting

It won't connect

Answers are thin

Ollama API cookbook.

The one-line mental model

Quickstart — curl

Python with the OpenAI SDK

Streaming

Tool / function calling

Embeddings

Give Ollama a nicer face — Open WebUI in 3 commands

Common pitfalls

Modelfile gallery.

Modelfile syntax reference

The gallery

Steve Jobs

Rick Rubin

Skeptical Research Assistant

Terse Code Reviewer

Mermaid Diagram Only

Sharing Modelfiles

AI in your pocket.

Why on-device on a phone is interesting

The apps worth installing

Google AI Edge Gallery

PocketPal AI

Apollo AI

Haplo AI

What to expect, performance-wise

Works well on mobile

Works poorly on mobile

A pragmatic workflow

Running LLMs
on your own
machine.

Why bother running
models locally?

Three tools dominate
the local-LLM space.

Choose by your
device.

GPT4All — the
friendliest front door.

LM Studio — everything
surfaced.

Ollama — the
foundation layer.

Bake personas
into the model itself.

Obsidian: your notes
become the AI's memory.