Run multiple AI agents with different roles β one for code, one for research, one for memory. Orchestrate them locally with shared context.
β± ~60 minutesπ» Mac / Linux / Windowsπ§ Ollama + Python
What You'll Need
Ollama running locally with at least one model pulled (see our Ollama guide)
Python 3.10+
16GB+ RAM recommended (agents share the same Ollama instance)
Basic comfort with Python β this is the most code-heavy guide
π‘ What's a multi-agent system? Instead of one AI doing everything, you split responsibilities across specialized agents. A coder agent writes code. A researcher agent finds information. A memory agent tracks what you've discussed before. A router decides which agent handles each request. Each agent can use a different model, different system prompt, and different tools.
1 Design Your Agent Roles
Start by deciding what agents you need. Here's a practical starting set:
Agent
Role
Best Model
Why Separate?
Coder
Write & review code
qwen2.5-coder:7b
Coding models outperform generalists at code
Writer
Draft text, emails, docs
llama3.1:8b
Creative tasks need a different temperature & style
Researcher
Summarize, analyze, explain
gemma2:9b
Strong at following complex instructions
Memory
Store & retrieve context
llama3.2:3b
Small model is fast for lookups; main job is database queries
Router
Classify & dispatch
llama3.2:1b
Tiny model just picks the right agent β needs to be fast
β οΈ Start with 2β3 agents. It's tempting to build a swarm of 10 agents, but start small. A router + coder + writer covers most use cases. Add more only when you hit a real limitation.
2 Build the Agent Framework
Each agent is a Python class with a system prompt, a model, and a method to call Ollama.
coder = Agent( name="coder", model="qwen2.5-coder:7b", system_prompt="You are an expert programmer. Write clean, working code. Explain your reasoning briefly. Use best practices.", temperature=0.3, # lower = more precise code)writer = Agent( name="writer", model="llama3.1:8b", system_prompt="You are a skilled writer. Write clear, engaging prose. Match the user's tone. Be concise.", temperature=0.8, # higher = more creative)researcher = Agent( name="researcher", model="gemma2:9b", system_prompt="You are a research analyst. Analyze information thoroughly, cite your reasoning, and present findings clearly.", temperature=0.5,)
Test an agent directly:
print(coder.chat("Write a Python function to read a CSV and return the top 5 rows by a given column."))
3 Add Shared Memory
Agents need shared context β what's been discussed, what's been decided, what facts are known. SQLite is perfect for this: zero config, fast, local.
import sqlite3, jsonfrom datetime import datetimeclass SharedMemory: def __init__(self, db_path="agents.db"): self.conn = sqlite3.connect(db_path) self.conn.execute(""" CREATE TABLE IF NOT EXISTS memory ( id INTEGER PRIMARY KEY AUTOINCREMENT, agent TEXT, role TEXT, content TEXT, timestamp TEXT DEFAULT CURRENT_TIMESTAMP ) """) self.conn.execute(""" CREATE TABLE IF NOT EXISTS facts ( key TEXT PRIMARY KEY, value TEXT, source_agent TEXT, updated TEXT DEFAULT CURRENT_TIMESTAMP ) """) def log(self, agent_name: str, role: str, content: str): self.conn.execute( "INSERT INTO memory (agent, role, content) VALUES (?, ?, ?)", (agent_name, role, content) ) self.conn.commit() def get_recent(self, limit=10) -> list: rows = self.conn.execute( "SELECT agent, role, content FROM memory ORDER BY id DESC LIMIT ?", (limit,) ).fetchall() return [{"agent": r[0], "role": r[1], "content": r[2]} for r in reversed(rows)] def set_fact(self, key: str, value: str, agent: str): self.conn.execute( "INSERT OR REPLACE INTO facts (key, value, source_agent) VALUES (?, ?, ?)", (key, value, agent) ) self.conn.commit() def get_fact(self, key: str) -> str | None: row = self.conn.execute( "SELECT value FROM facts WHERE key = ?", (key,) ).fetchone() return row[0] if row else None
Now wire memory into the agent's chat method:
memory = SharedMemory()# Enhanced chat that logs everything and includes contextdef chat_with_memory(agent: Agent, message: str) -> str: # Get recent conversation as context recent = memory.get_recent(limit=6) context = [{"role": m["role"], "content": m["content"]} for m in recent] # Log the user message memory.log(agent.name, "user", message) # Get response response = agent.chat(message, context=context) # Log the response memory.log(agent.name, "assistant", response) return response
π‘ Two kinds of memory: The memory table is conversation history (what was said). The facts table is extracted knowledge ("user prefers Python", "project uses FastAPI"). Agents can read from both to stay context-aware across conversations.
4 Build a Router
The router agent looks at each incoming message and decides which specialist should handle it. This is the brain of the system.
router = Agent( name="router", model="llama3.2:1b", # small and fast β just classifying system_prompt="""You are a request router. Given a user message, respond with ONLY the name of the best agent to handle it. Choose from: coder, writer, researcher.Rules:- coder: anything about code, programming, debugging, scripts, APIs- writer: emails, blog posts, documentation, creative text, rewording- researcher: analysis, explanations, comparisons, summaries, questions about conceptsRespond with ONLY the agent name, nothing else.""", temperature=0.1, # very deterministic)agents = { "coder": coder, "writer": writer, "researcher": researcher,}def route_and_respond(message: str) -> tuple[str, str]: # Ask the router which agent to use choice = router.chat(message).strip().lower() # Fall back to researcher if routing fails agent = agents.get(choice, researcher) # Get the response from the chosen agent response = chat_with_memory(agent, message) return agent.name, response
Test the full pipeline:
# These should route to different agentstests = [ "Write a Python script to parse JSON files", "Draft an email declining a meeting politely", "What's the difference between REST and GraphQL?",]for msg in tests: agent_name, response = route_and_respond(msg) print(f"\n[{agent_name}] {msg}") print(response[:200] + "...")
β οΈ Routing isn't perfect. Small models sometimes misclassify. Two fixes: (1) use keyword matching as a fast pre-filter before hitting the LLM, (2) let users override with prefixes like /code or /write.
Add keyword pre-routing for speed:
def fast_route(message: str) -> str | None: msg = message.lower() if msg.startswith("/code"): return "coder" if msg.startswith("/write"): return "writer" if msg.startswith("/research"): return "researcher" code_words = {"function", "code", "script", "debug", "api", "class", "import"} if any(w in msg for w in code_words): return "coder" return None # fall through to LLM router
5 Expose as an API with FastAPI
Wrap everything in a FastAPI server so any client β a web UI, a CLI, or AI OS β can talk to your agent system.
# pip install fastapi uvicornfrom fastapi import FastAPIfrom pydantic import BaseModelapp = FastAPI(title="Multi-Agent API")class ChatRequest(BaseModel): message: str agent: str | None = None # optional: force a specific agentclass ChatResponse(BaseModel): agent: str response: str@app.post("/chat", response_model=ChatResponse)def chat(req: ChatRequest): if req.agent and req.agent in agents: # Direct agent call response = chat_with_memory(agents[req.agent], req.message) return ChatResponse(agent=req.agent, response=response) # Auto-route agent_name, response = route_and_respond(req.message) return ChatResponse(agent=agent_name, response=response)@app.get("/agents")def list_agents(): return [{"name": a.name, "model": a.model} for a in agents.values()]@app.get("/memory/recent")def recent_memory(limit: int = 20): return memory.get_recent(limit=limit)
# Auto-routecurl -X POST http://localhost:8100/chat \ -H "Content-Type: application/json" \ -d '{"message": "Write a regex to match email addresses"}'# Force a specific agentcurl -X POST http://localhost:8100/chat \ -H "Content-Type: application/json" \ -d '{"message": "Rewrite this more casually", "agent": "writer"}'# List available agentscurl http://localhost:8100/agents
π‘ This is how AI OS works under the hood. AI OS uses this exact pattern β specialized agents with shared memory, a router, and a FastAPI server. The guides you've been reading (Ollama, Whisper, Vision) are the capabilities. This guide is the architecture that ties them together.
Specialized agents with different models and system prompts
Shared memory via SQLite β conversation history and extracted facts
A router that classifies requests and dispatches to the right agent
Keyword pre-routing for speed plus LLM fallback for ambiguous requests
A FastAPI server exposing the whole system as an API
Next Steps
Add tool use β let the coder agent run code in a sandbox, let the researcher agent search files. Agents that can act are far more useful than agents that only talk.
Add vision and speech β plug in Vision AI and Whisper as input channels. "What do you see?" routes to the vision agent.
Fine-tune your agents β use the fine-tuning guide to train each agent on examples of its specific task.
Agent-to-agent delegation β let agents call each other. The researcher finds info, hands it to the writer to draft a summary, hands that to the coder to format it.
Run on your AI server β deploy the multi-agent API on your always-on AI server so it's available 24/7 from any device.
π‘ The complete picture. You now have every piece:
Ollama (models) β
Model Library (storage) β
Fine-Tune (customize) β
Whisper (hear) β
Vision (see) β
Multi-Agent (orchestrate) β
Server (host) β
NAS (store).
That's a full local AI operating system. That's what AI OS is.