Use a webcam with local vision models for object detection, image description, and OCR — no cloud APIs, no data leaving your machine.
⏱ ~20 minutes💻 Mac / Linux / Windows📷 Webcam or USB camera
What You'll Need
Python 3.8+ installed
A webcam — built-in laptop cam works, USB webcam is better
Ollama running locally (for the LLaVA vision model — see our Ollama guide)
8GB+ RAM (16GB recommended if running YOLO + Ollama together)
💡 Two approaches in this guide: We cover both LLaVA (describe/understand images via Ollama — "what's in this photo?") and YOLO (real-time object detection with bounding boxes — "there's a person, a cup, and a cat"). They're complementary: YOLO tells you what and where, LLaVA tells you why and how.
Run it. If you see a captured image, your camera is ready.
⚠️ macOS camera permissions: You'll need to grant Terminal (or your IDE) camera access in System Settings → Privacy & Security → Camera. Python may not prompt you — just add it manually if the camera won't open.
2 Describe Images with LLaVA (via Ollama)
LLaVA is a vision-language model that can describe, analyze, and answer questions about images. It runs through Ollama just like a text model.
# Pull the vision modelollama pull llava:7b
Now send an image to it from Python:
import cv2, base64, requests# Capture a photocap = cv2.VideoCapture(0)ret, frame = cap.read()cap.release()# Encode as base64_, buf = cv2.imencode(".jpg", frame)img_b64 = base64.b64encode(buf).decode()# Ask LLaVA what it seesresp = requests.post("http://localhost:11434/api/chat", json={ "model": "llava:7b", "messages": [{ "role": "user", "content": "Describe what you see in this image in detail.", "images": [img_b64] }], "stream": False})print(resp.json()["message"]["content"])
You'll get a natural language description: "I see a person sitting at a desk with a laptop computer. There's a coffee mug to their left and bookshelves in the background..."
💡 Try different prompts: • "Read any text visible in this image" — OCR
• "Is anyone in this image? Describe them." — person detection
• "What room is this? List all objects you can see." — scene inventory
• "Rate the lighting in this photo on a scale of 1-10" — analysis
3 Real-Time Object Detection with YOLO
YOLO (You Only Look Once) detects objects in real time and draws bounding boxes around them. It's fast enough for live video.
from ultralytics import YOLOimport cv2# Load YOLOv8 (downloads automatically on first run, ~6MB)model = YOLO("yolov8n.pt") # nano model — fastestcap = cv2.VideoCapture(0)print("Press 'q' to quit")while True: ret, frame = cap.read() if not ret: break results = model(frame, verbose=False) annotated = results[0].plot() # draw boxes + labels cv2.imshow("YOLO Vision", annotated) if cv2.waitKey(1) & 0xFF == ord("q"): breakcap.release()cv2.destroyAllWindows()
You'll see a live video feed with colored boxes around detected objects — people, chairs, cups, phones, laptops, pets — labeled with confidence scores.
YOLO Model
Size
Speed (CPU)
Accuracy
Best For
yolov8n
6MB
~30 FPS
Good
Real-time on any hardware, webcam
yolov8s
22MB
~15 FPS
Better
Good balance of speed and accuracy
yolov8m
50MB
~8 FPS
High
When accuracy matters more than speed
yolov8x
131MB
~3 FPS
Highest
Photo analysis, not real-time on CPU
💡 Tip: YOLO detects 80 common objects out of the box (person, car, dog, chair, phone, etc.). Need custom objects? You can fine-tune it on your own images — but the default covers most use cases.
4 Combine YOLO + LLaVA for Smart Vision
The real power: YOLO spots objects fast, then LLaVA analyzes what it found. Detect → crop → describe.
from ultralytics import YOLOimport cv2, base64, requestsyolo = YOLO("yolov8n.pt")cap = cv2.VideoCapture(0)ret, frame = cap.read()cap.release()# Detect objectsresults = yolo(frame, verbose=False)detections = results[0].boxesprint(f"Found {len(detections)} objects")for box in detections: cls = results[0].names[int(box.cls)] conf = float(box.conf) print(f" - {cls} ({conf:.0%})")# Send the full frame to LLaVA for deeper analysis_, buf = cv2.imencode(".jpg", frame)img_b64 = base64.b64encode(buf).decode()resp = requests.post("http://localhost:11434/api/chat", json={ "model": "llava:7b", "messages": [{ "role": "user", "content": "YOLO detected: " + ", ".join( results[0].names[int(b.cls)] for b in detections ) + ". Describe the scene and what's happening.", "images": [img_b64] }], "stream": False})print("\nLLaVA says:", resp.json()["message"]["content"])
💡 This is how AI OS will use vision. YOLO runs fast as a "what's there" check. When something interesting is detected, LLaVA gives the deeper understanding. Pair this with Whisper and you have a system that can see, hear, and think — all locally.
5 Build a Motion-Triggered Camera
A practical project: only capture and analyze when something moves. Great for monitoring a room, a pet, or a 3D printer.
import cv2, time, base64, requestsfrom ultralytics import YOLOyolo = YOLO("yolov8n.pt")cap = cv2.VideoCapture(0)prev_frame = NoneMOTION_THRESHOLD = 25000 # adjust sensitivityCOOLDOWN = 10 # seconds between analyseslast_analysis = 0print("Watching for motion... (Ctrl+C to stop)")try: while True: ret, frame = cap.read() if not ret: break gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) gray = cv2.GaussianBlur(gray, (21, 21), 0) if prev_frame is None: prev_frame = gray continue delta = cv2.absdiff(prev_frame, gray) motion_score = delta.sum() / 1000 prev_frame = gray if motion_score > MOTION_THRESHOLD and time.time() - last_analysis > COOLDOWN: print(f"\n🔔 Motion detected (score: {motion_score:.0f})") results = yolo(frame, verbose=False) for box in results[0].boxes: cls = results[0].names[int(box.cls)] print(f" Detected: {cls} ({float(box.conf):.0%})") last_analysis = time.time()except KeyboardInterrupt: print("\nStopped.") cap.release()
⚠️ Privacy first. This is a local-only system — nothing goes to any cloud. But if you're pointing a camera at shared spaces, be thoughtful about who you're recording. Vision AI is powerful; use it responsibly.
✅ What You've Set Up
A webcam working with Python and OpenCV for image capture
LLaVA via Ollama for image description, OCR, and scene understanding
YOLO for real-time object detection with bounding boxes
A combined pipeline: YOLO detects → LLaVA analyzes
A motion-triggered camera that only processes when something happens
Next Steps
Add an Edge TPU — a $25 Coral USB Accelerator runs YOLO at 100+ FPS and barely uses CPU. Plug and play.
Log detections — write to a JSON file or SQLite database with timestamps. Build a history of what your camera has seen.
Pair with Whisper — add a mic to create a multimodal system: "What do you see?" → camera capture → LLaVA describes → TTS reads it back.
Run on a Pi — a Raspberry Pi 5 + USB camera + Coral TPU makes a dedicated vision station for ~$100.
Train custom detection — fine-tune YOLO on your own objects with as few as 50 labeled images using Roboflow.
💡 The full stack:Whisper (hear) + Vision AI (see) + Ollama (think) + a speaker (speak) = a fully local AI assistant that perceives the world. No internet required.