← Back to all guides
🧠
Advanced

Build a Multi-Agent System

Run multiple AI agents with different roles β€” one for code, one for research, one for memory. Orchestrate them locally with shared context.

⏱ ~60 minutes πŸ’» Mac / Linux / Windows 🧠 Ollama + Python

What You'll Need

  • Ollama running locally with at least one model pulled (see our Ollama guide)
  • Python 3.10+
  • 16GB+ RAM recommended (agents share the same Ollama instance)
  • Basic comfort with Python β€” this is the most code-heavy guide
πŸ’‘ What's a multi-agent system? Instead of one AI doing everything, you split responsibilities across specialized agents. A coder agent writes code. A researcher agent finds information. A memory agent tracks what you've discussed before. A router decides which agent handles each request. Each agent can use a different model, different system prompt, and different tools.

1 Design Your Agent Roles

Start by deciding what agents you need. Here's a practical starting set:

Agent Role Best Model Why Separate?
Coder Write & review code qwen2.5-coder:7b Coding models outperform generalists at code
Writer Draft text, emails, docs llama3.1:8b Creative tasks need a different temperature & style
Researcher Summarize, analyze, explain gemma2:9b Strong at following complex instructions
Memory Store & retrieve context llama3.2:3b Small model is fast for lookups; main job is database queries
Router Classify & dispatch llama3.2:1b Tiny model just picks the right agent β€” needs to be fast
⚠️ Start with 2–3 agents. It's tempting to build a swarm of 10 agents, but start small. A router + coder + writer covers most use cases. Add more only when you hit a real limitation.

2 Build the Agent Framework

Each agent is a Python class with a system prompt, a model, and a method to call Ollama.

import requests from dataclasses import dataclass OLLAMA_URL = "http://localhost:11434/api/chat" @dataclass class Agent: name: str model: str system_prompt: str temperature: float = 0.7 def chat(self, user_message: str, context: list = None) -> str: messages = [{"role": "system", "content": self.system_prompt}] if context: messages.extend(context) messages.append({"role": "user", "content": user_message}) resp = requests.post(OLLAMA_URL, json={ "model": self.model, "messages": messages, "stream": False, "options": {"temperature": self.temperature} }) return resp.json()["message"]["content"]

Now define your agents:

coder = Agent( name="coder", model="qwen2.5-coder:7b", system_prompt="You are an expert programmer. Write clean, working code. Explain your reasoning briefly. Use best practices.", temperature=0.3, # lower = more precise code ) writer = Agent( name="writer", model="llama3.1:8b", system_prompt="You are a skilled writer. Write clear, engaging prose. Match the user's tone. Be concise.", temperature=0.8, # higher = more creative ) researcher = Agent( name="researcher", model="gemma2:9b", system_prompt="You are a research analyst. Analyze information thoroughly, cite your reasoning, and present findings clearly.", temperature=0.5, )

Test an agent directly:

print(coder.chat("Write a Python function to read a CSV and return the top 5 rows by a given column."))

3 Add Shared Memory

Agents need shared context β€” what's been discussed, what's been decided, what facts are known. SQLite is perfect for this: zero config, fast, local.

import sqlite3, json from datetime import datetime class SharedMemory: def __init__(self, db_path="agents.db"): self.conn = sqlite3.connect(db_path) self.conn.execute(""" CREATE TABLE IF NOT EXISTS memory ( id INTEGER PRIMARY KEY AUTOINCREMENT, agent TEXT, role TEXT, content TEXT, timestamp TEXT DEFAULT CURRENT_TIMESTAMP ) """) self.conn.execute(""" CREATE TABLE IF NOT EXISTS facts ( key TEXT PRIMARY KEY, value TEXT, source_agent TEXT, updated TEXT DEFAULT CURRENT_TIMESTAMP ) """) def log(self, agent_name: str, role: str, content: str): self.conn.execute( "INSERT INTO memory (agent, role, content) VALUES (?, ?, ?)", (agent_name, role, content) ) self.conn.commit() def get_recent(self, limit=10) -> list: rows = self.conn.execute( "SELECT agent, role, content FROM memory ORDER BY id DESC LIMIT ?", (limit,) ).fetchall() return [{"agent": r[0], "role": r[1], "content": r[2]} for r in reversed(rows)] def set_fact(self, key: str, value: str, agent: str): self.conn.execute( "INSERT OR REPLACE INTO facts (key, value, source_agent) VALUES (?, ?, ?)", (key, value, agent) ) self.conn.commit() def get_fact(self, key: str) -> str | None: row = self.conn.execute( "SELECT value FROM facts WHERE key = ?", (key,) ).fetchone() return row[0] if row else None

Now wire memory into the agent's chat method:

memory = SharedMemory() # Enhanced chat that logs everything and includes context def chat_with_memory(agent: Agent, message: str) -> str: # Get recent conversation as context recent = memory.get_recent(limit=6) context = [{"role": m["role"], "content": m["content"]} for m in recent] # Log the user message memory.log(agent.name, "user", message) # Get response response = agent.chat(message, context=context) # Log the response memory.log(agent.name, "assistant", response) return response
πŸ’‘ Two kinds of memory: The memory table is conversation history (what was said). The facts table is extracted knowledge ("user prefers Python", "project uses FastAPI"). Agents can read from both to stay context-aware across conversations.

4 Build a Router

The router agent looks at each incoming message and decides which specialist should handle it. This is the brain of the system.

router = Agent( name="router", model="llama3.2:1b", # small and fast β€” just classifying system_prompt="""You are a request router. Given a user message, respond with ONLY the name of the best agent to handle it. Choose from: coder, writer, researcher. Rules: - coder: anything about code, programming, debugging, scripts, APIs - writer: emails, blog posts, documentation, creative text, rewording - researcher: analysis, explanations, comparisons, summaries, questions about concepts Respond with ONLY the agent name, nothing else.""", temperature=0.1, # very deterministic ) agents = { "coder": coder, "writer": writer, "researcher": researcher, } def route_and_respond(message: str) -> tuple[str, str]: # Ask the router which agent to use choice = router.chat(message).strip().lower() # Fall back to researcher if routing fails agent = agents.get(choice, researcher) # Get the response from the chosen agent response = chat_with_memory(agent, message) return agent.name, response

Test the full pipeline:

# These should route to different agents tests = [ "Write a Python script to parse JSON files", "Draft an email declining a meeting politely", "What's the difference between REST and GraphQL?", ] for msg in tests: agent_name, response = route_and_respond(msg) print(f"\n[{agent_name}] {msg}") print(response[:200] + "...")
⚠️ Routing isn't perfect. Small models sometimes misclassify. Two fixes: (1) use keyword matching as a fast pre-filter before hitting the LLM, (2) let users override with prefixes like /code or /write.

Add keyword pre-routing for speed:

def fast_route(message: str) -> str | None: msg = message.lower() if msg.startswith("/code"): return "coder" if msg.startswith("/write"): return "writer" if msg.startswith("/research"): return "researcher" code_words = {"function", "code", "script", "debug", "api", "class", "import"} if any(w in msg for w in code_words): return "coder" return None # fall through to LLM router

5 Expose as an API with FastAPI

Wrap everything in a FastAPI server so any client β€” a web UI, a CLI, or AI OS β€” can talk to your agent system.

# pip install fastapi uvicorn from fastapi import FastAPI from pydantic import BaseModel app = FastAPI(title="Multi-Agent API") class ChatRequest(BaseModel): message: str agent: str | None = None # optional: force a specific agent class ChatResponse(BaseModel): agent: str response: str @app.post("/chat", response_model=ChatResponse) def chat(req: ChatRequest): if req.agent and req.agent in agents: # Direct agent call response = chat_with_memory(agents[req.agent], req.message) return ChatResponse(agent=req.agent, response=response) # Auto-route agent_name, response = route_and_respond(req.message) return ChatResponse(agent=agent_name, response=response) @app.get("/agents") def list_agents(): return [{"name": a.name, "model": a.model} for a in agents.values()] @app.get("/memory/recent") def recent_memory(limit: int = 20): return memory.get_recent(limit=limit)

Run the server:

uvicorn multi_agent:app --host 0.0.0.0 --port 8100

Now you can call it from anywhere:

# Auto-route curl -X POST http://localhost:8100/chat \ -H "Content-Type: application/json" \ -d '{"message": "Write a regex to match email addresses"}' # Force a specific agent curl -X POST http://localhost:8100/chat \ -H "Content-Type: application/json" \ -d '{"message": "Rewrite this more casually", "agent": "writer"}' # List available agents curl http://localhost:8100/agents
πŸ’‘ This is how AI OS works under the hood. AI OS uses this exact pattern β€” specialized agents with shared memory, a router, and a FastAPI server. The guides you've been reading (Ollama, Whisper, Vision) are the capabilities. This guide is the architecture that ties them together.

How It Fits Together

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ User β”‚ β”‚ Request β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β” β”‚ Router β”‚ ← classifies intent β”‚ (1B model) β”‚ β””β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”˜ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ └─────────┐ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β” β”‚ Coder β”‚ β”‚ Writer β”‚ β”‚ Researcherβ”‚ β”‚ (qwen 7b) β”‚ β”‚(llama 8b)β”‚ β”‚(gemma 9b) β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β” β”‚ Shared β”‚ β”‚ Memory β”‚ ← SQLite β”‚ (history β”‚ β”‚ + facts) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

βœ… What You've Built

  • Specialized agents with different models and system prompts
  • Shared memory via SQLite β€” conversation history and extracted facts
  • A router that classifies requests and dispatches to the right agent
  • Keyword pre-routing for speed plus LLM fallback for ambiguous requests
  • A FastAPI server exposing the whole system as an API

Next Steps

  • Add tool use β€” let the coder agent run code in a sandbox, let the researcher agent search files. Agents that can act are far more useful than agents that only talk.
  • Add vision and speech β€” plug in Vision AI and Whisper as input channels. "What do you see?" routes to the vision agent.
  • Fine-tune your agents β€” use the fine-tuning guide to train each agent on examples of its specific task.
  • Agent-to-agent delegation β€” let agents call each other. The researcher finds info, hands it to the writer to draft a summary, hands that to the coder to format it.
  • Run on your AI server β€” deploy the multi-agent API on your always-on AI server so it's available 24/7 from any device.
πŸ’‘ The complete picture. You now have every piece: Ollama (models) β†’ Model Library (storage) β†’ Fine-Tune (customize) β†’ Whisper (hear) β†’ Vision (see) β†’ Multi-Agent (orchestrate) β†’ Server (host) β†’ NAS (store). That's a full local AI operating system. That's what AI OS is.

πŸ“š Learning Links

Videos

Official Docs

Community