🧠

Advanced

Build a Multi-Agent System

Run multiple AI agents with different roles — one for code, one for research, one for memory. Orchestrate them locally with shared context.

⏱ ~60 minutes 💻 Mac / Linux / Windows 🧠 Ollama + Python

What You'll Need

Ollama running locally with at least one model pulled (see our Ollama guide)
Python 3.10+
16GB+ RAM recommended (agents share the same Ollama instance)
Basic comfort with Python — this is the most code-heavy guide

💡 What's a multi-agent system? Instead of one AI doing everything, you split responsibilities across specialized agents. A coder agent writes code. A researcher agent finds information. A memory agent tracks what you've discussed before. A router decides which agent handles each request. Each agent can use a different model, different system prompt, and different tools.

1 Design Your Agent Roles

Start by deciding what agents you need. Here's a practical starting set:

Agent	Role	Best Model	Why Separate?
Coder	Write & review code	qwen2.5-coder:7b	Coding models outperform generalists at code
Writer	Draft text, emails, docs	llama3.1:8b	Creative tasks need a different temperature & style
Researcher	Summarize, analyze, explain	gemma2:9b	Strong at following complex instructions
Memory	Store & retrieve context	llama3.2:3b	Small model is fast for lookups; main job is database queries
Router	Classify & dispatch	llama3.2:1b	Tiny model just picks the right agent — needs to be fast

⚠️ Start with 2–3 agents. It's tempting to build a swarm of 10 agents, but start small. A router + coder + writer covers most use cases. Add more only when you hit a real limitation.

2 Build the Agent Framework

Each agent is a Python class with a system prompt, a model, and a method to call Ollama.

import requests
from dataclasses import dataclass

OLLAMA_URL = "http://localhost:11434/api/chat"

@dataclass
class Agent:
    name: str
    model: str
    system_prompt: str
    temperature: float = 0.7

    def chat(self, user_message: str, context: list = None) -> str:
        messages = [{"role": "system", "content": self.system_prompt}]
        if context:
            messages.extend(context)
        messages.append({"role": "user", "content": user_message})

        resp = requests.post(OLLAMA_URL, json={
            "model": self.model,
            "messages": messages,
            "stream": False,
            "options": {"temperature": self.temperature}
        })
        return resp.json()["message"]["content"]
        

Now define your agents:

coder = Agent(
    name="coder",
    model="qwen2.5-coder:7b",
    system_prompt="You are an expert programmer. Write clean, working code. Explain your reasoning briefly. Use best practices.",
    temperature=0.3,  # lower = more precise code
)

writer = Agent(
    name="writer",
    model="llama3.1:8b",
    system_prompt="You are a skilled writer. Write clear, engaging prose. Match the user's tone. Be concise.",
    temperature=0.8,  # higher = more creative
)

researcher = Agent(
    name="researcher",
    model="gemma2:9b",
    system_prompt="You are a research analyst. Analyze information thoroughly, cite your reasoning, and present findings clearly.",
    temperature=0.5,
)
        

Test an agent directly:

print(coder.chat("Write a Python function to read a CSV and return the top 5 rows by a given column."))
        

3 Add Shared Memory

Agents need shared context — what's been discussed, what's been decided, what facts are known. SQLite is perfect for this: zero config, fast, local.

import sqlite3, json
from datetime import datetime

class SharedMemory:
    def __init__(self, db_path="agents.db"):
        self.conn = sqlite3.connect(db_path)
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS memory (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                agent TEXT,
                role TEXT,
                content TEXT,
                timestamp TEXT DEFAULT CURRENT_TIMESTAMP
            )
        """)
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS facts (
                key TEXT PRIMARY KEY,
                value TEXT,
                source_agent TEXT,
                updated TEXT DEFAULT CURRENT_TIMESTAMP
            )
        """)

    def log(self, agent_name: str, role: str, content: str):
        self.conn.execute(
            "INSERT INTO memory (agent, role, content) VALUES (?, ?, ?)",
            (agent_name, role, content)
        )
        self.conn.commit()

    def get_recent(self, limit=10) -> list:
        rows = self.conn.execute(
            "SELECT agent, role, content FROM memory ORDER BY id DESC LIMIT ?",
            (limit,)
        ).fetchall()
        return [{"agent": r[0], "role": r[1], "content": r[2]} for r in reversed(rows)]

    def set_fact(self, key: str, value: str, agent: str):
        self.conn.execute(
            "INSERT OR REPLACE INTO facts (key, value, source_agent) VALUES (?, ?, ?)",
            (key, value, agent)
        )
        self.conn.commit()

    def get_fact(self, key: str) -> str | None:
        row = self.conn.execute(
            "SELECT value FROM facts WHERE key = ?", (key,)
        ).fetchone()
        return row[0] if row else None
        

Now wire memory into the agent's chat method:

memory = SharedMemory()

# Enhanced chat that logs everything and includes context
def chat_with_memory(agent: Agent, message: str) -> str:
    # Get recent conversation as context
    recent = memory.get_recent(limit=6)
    context = [{"role": m["role"], "content": m["content"]} for m in recent]

    # Log the user message
    memory.log(agent.name, "user", message)

    # Get response
    response = agent.chat(message, context=context)

    # Log the response
    memory.log(agent.name, "assistant", response)

    return response
        

💡 Two kinds of memory: The memory table is conversation history (what was said). The facts table is extracted knowledge ("user prefers Python", "project uses FastAPI"). Agents can read from both to stay context-aware across conversations.

4 Build a Router

The router agent looks at each incoming message and decides which specialist should handle it. This is the brain of the system.

router = Agent(
    name="router",
    model="llama3.2:1b",  # small and fast — just classifying
    system_prompt="""You are a request router. Given a user message, respond with ONLY the name of the best agent to handle it. Choose from: coder, writer, researcher.

Rules:
- coder: anything about code, programming, debugging, scripts, APIs
- writer: emails, blog posts, documentation, creative text, rewording
- researcher: analysis, explanations, comparisons, summaries, questions about concepts

Respond with ONLY the agent name, nothing else.""",
    temperature=0.1,  # very deterministic
)

agents = {
    "coder": coder,
    "writer": writer,
    "researcher": researcher,
}

def route_and_respond(message: str) -> tuple[str, str]:
    # Ask the router which agent to use
    choice = router.chat(message).strip().lower()

    # Fall back to researcher if routing fails
    agent = agents.get(choice, researcher)

    # Get the response from the chosen agent
    response = chat_with_memory(agent, message)

    return agent.name, response
        

Test the full pipeline:

# These should route to different agents
tests = [
    "Write a Python script to parse JSON files",
    "Draft an email declining a meeting politely",
    "What's the difference between REST and GraphQL?",
]

for msg in tests:
    agent_name, response = route_and_respond(msg)
    print(f"\n[{agent_name}] {msg}")
    print(response[:200] + "...")
        

⚠️ Routing isn't perfect. Small models sometimes misclassify. Two fixes: (1) use keyword matching as a fast pre-filter before hitting the LLM, (2) let users override with prefixes like /code or /write.

Add keyword pre-routing for speed:

def fast_route(message: str) -> str | None:
    msg = message.lower()
    if msg.startswith("/code"):  return "coder"
    if msg.startswith("/write"): return "writer"
    if msg.startswith("/research"): return "researcher"

    code_words = {"function", "code", "script", "debug", "api", "class", "import"}
    if any(w in msg for w in code_words): return "coder"

    return None  # fall through to LLM router
        

5 Expose as an API with FastAPI

Wrap everything in a FastAPI server so any client — a web UI, a CLI, or AI OS — can talk to your agent system.

# pip install fastapi uvicorn
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI(title="Multi-Agent API")

class ChatRequest(BaseModel):
    message: str
    agent: str | None = None  # optional: force a specific agent

class ChatResponse(BaseModel):
    agent: str
    response: str

@app.post("/chat", response_model=ChatResponse)
def chat(req: ChatRequest):
    if req.agent and req.agent in agents:
        # Direct agent call
        response = chat_with_memory(agents[req.agent], req.message)
        return ChatResponse(agent=req.agent, response=response)

    # Auto-route
    agent_name, response = route_and_respond(req.message)
    return ChatResponse(agent=agent_name, response=response)

@app.get("/agents")
def list_agents():
    return [{"name": a.name, "model": a.model} for a in agents.values()]

@app.get("/memory/recent")
def recent_memory(limit: int = 20):
    return memory.get_recent(limit=limit)
        

Run the server:

uvicorn multi_agent:app --host 0.0.0.0 --port 8100

Now you can call it from anywhere:

# Auto-route
curl -X POST http://localhost:8100/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Write a regex to match email addresses"}'

# Force a specific agent
curl -X POST http://localhost:8100/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Rewrite this more casually", "agent": "writer"}'

# List available agents
curl http://localhost:8100/agents
        

💡 This is how AI OS works under the hood. AI OS uses this exact pattern — specialized agents with shared memory, a router, and a FastAPI server. The guides you've been reading (Ollama, Whisper, Vision) are the capabilities. This guide is the architecture that ties them together.

How It Fits Together

                     ┌─────────────┐
                     │   User      │
                     │   Request   │
                     └──────┬──────┘
                            │
                     ┌──────▼──────┐
                     │   Router    │  ← classifies intent
                     │  (1B model) │
                     └──┬───┬───┬──┘
                        │   │   │
              ┌─────────┘   │   └─────────┐
              │             │             │
       ┌──────▼─────┐ ┌────▼─────┐ ┌────▼──────┐
       │   Coder    │ │  Writer  │ │ Researcher│
       │ (qwen 7b)  │ │(llama 8b)│ │(gemma 9b) │
       └──────┬─────┘ └────┬─────┘ └────┬──────┘
              │            │            │
              └────────────┼────────────┘
                           │
                    ┌──────▼──────┐
                    │   Shared    │
                    │   Memory    │  ← SQLite
                    │  (history   │
                    │  + facts)   │
                    └─────────────┘
        

✅ What You've Built

Specialized agents with different models and system prompts
Shared memory via SQLite — conversation history and extracted facts
A router that classifies requests and dispatches to the right agent
Keyword pre-routing for speed plus LLM fallback for ambiguous requests
A FastAPI server exposing the whole system as an API

Next Steps

Add tool use — let the coder agent run code in a sandbox, let the researcher agent search files. Agents that can act are far more useful than agents that only talk.
Add vision and speech — plug in Vision AI and Whisper as input channels. "What do you see?" routes to the vision agent.
Fine-tune your agents — use the fine-tuning guide to train each agent on examples of its specific task.
Agent-to-agent delegation — let agents call each other. The researcher finds info, hands it to the writer to draft a summary, hands that to the coder to format it.
Run on your AI server — deploy the multi-agent API on your always-on AI server so it's available 24/7 from any device.

💡 The complete picture. You now have every piece: Ollama (models) → Model Library (storage) → Fine-Tune (customize) → Whisper (hear) → Vision (see) → Multi-Agent (orchestrate) → Server (host) → NAS (store). That's a full local AI operating system. That's what AI OS is.