← Back to all guides
📷
Intermediate

Local Vision AI with a Webcam

Use a webcam with local vision models for object detection, image description, and OCR — no cloud APIs, no data leaving your machine.

⏱ ~20 minutes 💻 Mac / Linux / Windows 📷 Webcam or USB camera

What You'll Need

  • Python 3.8+ installed
  • A webcam — built-in laptop cam works, USB webcam is better
  • Ollama running locally (for the LLaVA vision model — see our Ollama guide)
  • 8GB+ RAM (16GB recommended if running YOLO + Ollama together)
💡 Two approaches in this guide: We cover both LLaVA (describe/understand images via Ollama — "what's in this photo?") and YOLO (real-time object detection with bounding boxes — "there's a person, a cup, and a cat"). They're complementary: YOLO tells you what and where, LLaVA tells you why and how.

1 Get Your Camera Working with Python

# Install OpenCV pip install opencv-python

Test that your camera works:

import cv2 cap = cv2.VideoCapture(0) # 0 = default camera if not cap.isOpened(): print("Can't open camera") exit() ret, frame = cap.read() if ret: cv2.imwrite("test_photo.jpg", frame) print(f"Captured {frame.shape[1]}x{frame.shape[0]} image → test_photo.jpg") cap.release()

Run it. If you see a captured image, your camera is ready.

⚠️ macOS camera permissions: You'll need to grant Terminal (or your IDE) camera access in System Settings → Privacy & Security → Camera. Python may not prompt you — just add it manually if the camera won't open.

2 Describe Images with LLaVA (via Ollama)

LLaVA is a vision-language model that can describe, analyze, and answer questions about images. It runs through Ollama just like a text model.

# Pull the vision model ollama pull llava:7b

Now send an image to it from Python:

import cv2, base64, requests # Capture a photo cap = cv2.VideoCapture(0) ret, frame = cap.read() cap.release() # Encode as base64 _, buf = cv2.imencode(".jpg", frame) img_b64 = base64.b64encode(buf).decode() # Ask LLaVA what it sees resp = requests.post("http://localhost:11434/api/chat", json={ "model": "llava:7b", "messages": [{ "role": "user", "content": "Describe what you see in this image in detail.", "images": [img_b64] }], "stream": False }) print(resp.json()["message"]["content"])

You'll get a natural language description: "I see a person sitting at a desk with a laptop computer. There's a coffee mug to their left and bookshelves in the background..."

💡 Try different prompts:
"Read any text visible in this image" — OCR
"Is anyone in this image? Describe them." — person detection
"What room is this? List all objects you can see." — scene inventory
"Rate the lighting in this photo on a scale of 1-10" — analysis

3 Real-Time Object Detection with YOLO

YOLO (You Only Look Once) detects objects in real time and draws bounding boxes around them. It's fast enough for live video.

# Install ultralytics (YOLO) pip install ultralytics

Run real-time detection on your webcam:

from ultralytics import YOLO import cv2 # Load YOLOv8 (downloads automatically on first run, ~6MB) model = YOLO("yolov8n.pt") # nano model — fastest cap = cv2.VideoCapture(0) print("Press 'q' to quit") while True: ret, frame = cap.read() if not ret: break results = model(frame, verbose=False) annotated = results[0].plot() # draw boxes + labels cv2.imshow("YOLO Vision", annotated) if cv2.waitKey(1) & 0xFF == ord("q"): break cap.release() cv2.destroyAllWindows()

You'll see a live video feed with colored boxes around detected objects — people, chairs, cups, phones, laptops, pets — labeled with confidence scores.

YOLO Model Size Speed (CPU) Accuracy Best For
yolov8n 6MB ~30 FPS Good Real-time on any hardware, webcam
yolov8s 22MB ~15 FPS Better Good balance of speed and accuracy
yolov8m 50MB ~8 FPS High When accuracy matters more than speed
yolov8x 131MB ~3 FPS Highest Photo analysis, not real-time on CPU
💡 Tip: YOLO detects 80 common objects out of the box (person, car, dog, chair, phone, etc.). Need custom objects? You can fine-tune it on your own images — but the default covers most use cases.

4 Combine YOLO + LLaVA for Smart Vision

The real power: YOLO spots objects fast, then LLaVA analyzes what it found. Detect → crop → describe.

from ultralytics import YOLO import cv2, base64, requests yolo = YOLO("yolov8n.pt") cap = cv2.VideoCapture(0) ret, frame = cap.read() cap.release() # Detect objects results = yolo(frame, verbose=False) detections = results[0].boxes print(f"Found {len(detections)} objects") for box in detections: cls = results[0].names[int(box.cls)] conf = float(box.conf) print(f" - {cls} ({conf:.0%})") # Send the full frame to LLaVA for deeper analysis _, buf = cv2.imencode(".jpg", frame) img_b64 = base64.b64encode(buf).decode() resp = requests.post("http://localhost:11434/api/chat", json={ "model": "llava:7b", "messages": [{ "role": "user", "content": "YOLO detected: " + ", ".join( results[0].names[int(b.cls)] for b in detections ) + ". Describe the scene and what's happening.", "images": [img_b64] }], "stream": False }) print("\nLLaVA says:", resp.json()["message"]["content"])
💡 This is how AI OS will use vision. YOLO runs fast as a "what's there" check. When something interesting is detected, LLaVA gives the deeper understanding. Pair this with Whisper and you have a system that can see, hear, and think — all locally.

5 Build a Motion-Triggered Camera

A practical project: only capture and analyze when something moves. Great for monitoring a room, a pet, or a 3D printer.

import cv2, time, base64, requests from ultralytics import YOLO yolo = YOLO("yolov8n.pt") cap = cv2.VideoCapture(0) prev_frame = None MOTION_THRESHOLD = 25000 # adjust sensitivity COOLDOWN = 10 # seconds between analyses last_analysis = 0 print("Watching for motion... (Ctrl+C to stop)") try: while True: ret, frame = cap.read() if not ret: break gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) gray = cv2.GaussianBlur(gray, (21, 21), 0) if prev_frame is None: prev_frame = gray continue delta = cv2.absdiff(prev_frame, gray) motion_score = delta.sum() / 1000 prev_frame = gray if motion_score > MOTION_THRESHOLD and time.time() - last_analysis > COOLDOWN: print(f"\n🔔 Motion detected (score: {motion_score:.0f})") results = yolo(frame, verbose=False) for box in results[0].boxes: cls = results[0].names[int(box.cls)] print(f" Detected: {cls} ({float(box.conf):.0%})") last_analysis = time.time() except KeyboardInterrupt: print("\nStopped.") cap.release()
⚠️ Privacy first. This is a local-only system — nothing goes to any cloud. But if you're pointing a camera at shared spaces, be thoughtful about who you're recording. Vision AI is powerful; use it responsibly.

✅ What You've Set Up

  • A webcam working with Python and OpenCV for image capture
  • LLaVA via Ollama for image description, OCR, and scene understanding
  • YOLO for real-time object detection with bounding boxes
  • A combined pipeline: YOLO detects → LLaVA analyzes
  • A motion-triggered camera that only processes when something happens

Next Steps

  • Add an Edge TPU — a $25 Coral USB Accelerator runs YOLO at 100+ FPS and barely uses CPU. Plug and play.
  • Log detections — write to a JSON file or SQLite database with timestamps. Build a history of what your camera has seen.
  • Pair with Whisper — add a mic to create a multimodal system: "What do you see?" → camera capture → LLaVA describes → TTS reads it back.
  • Run on a Pi — a Raspberry Pi 5 + USB camera + Coral TPU makes a dedicated vision station for ~$100.
  • Train custom detection — fine-tune YOLO on your own objects with as few as 50 labeled images using Roboflow.
💡 The full stack: Whisper (hear) + Vision AI (see) + Ollama (think) + a speaker (speak) = a fully local AI assistant that perceives the world. No internet required.

📚 Learning Links

Videos

Official Docs

Community