Local Vision AI with a Webcam

Use a webcam with local vision models for object detection, image description, and OCR — no cloud APIs, no data leaving your machine.

⏱ ~20 minutes 💻 Mac / Linux / Windows 📷 Webcam or USB camera

What You'll Need

Python 3.8+ installed
A webcam — built-in laptop cam works, USB webcam is better
Ollama running locally (for the LLaVA vision model — see our Ollama guide)
8GB+ RAM (16GB recommended if running YOLO + Ollama together)

💡 Two approaches in this guide: We cover both LLaVA (describe/understand images via Ollama — "what's in this photo?") and YOLO (real-time object detection with bounding boxes — "there's a person, a cup, and a cat"). They're complementary: YOLO tells you what and where, LLaVA tells you why and how.

1 Get Your Camera Working with Python

# Install OpenCV
pip install opencv-python
        

Test that your camera works:

import cv2

cap = cv2.VideoCapture(0)  # 0 = default camera
if not cap.isOpened():
    print("Can't open camera")
    exit()

ret, frame = cap.read()
if ret:
    cv2.imwrite("test_photo.jpg", frame)
    print(f"Captured {frame.shape[1]}x{frame.shape[0]} image → test_photo.jpg")

cap.release()
        

Run it. If you see a captured image, your camera is ready.

⚠️ macOS camera permissions: You'll need to grant Terminal (or your IDE) camera access in System Settings → Privacy & Security → Camera. Python may not prompt you — just add it manually if the camera won't open.

2 Describe Images with LLaVA (via Ollama)

LLaVA is a vision-language model that can describe, analyze, and answer questions about images. It runs through Ollama just like a text model.

# Pull the vision model
ollama pull llava:7b
        

Now send an image to it from Python:

import cv2, base64, requests

# Capture a photo
cap = cv2.VideoCapture(0)
ret, frame = cap.read()
cap.release()

# Encode as base64
_, buf = cv2.imencode(".jpg", frame)
img_b64 = base64.b64encode(buf).decode()

# Ask LLaVA what it sees
resp = requests.post("http://localhost:11434/api/chat", json={
    "model": "llava:7b",
    "messages": [{
        "role": "user",
        "content": "Describe what you see in this image in detail.",
        "images": [img_b64]
    }],
    "stream": False
})

print(resp.json()["message"]["content"])
        

You'll get a natural language description: "I see a person sitting at a desk with a laptop computer. There's a coffee mug to their left and bookshelves in the background..."

💡 Try different prompts:
• "Read any text visible in this image" — OCR
• "Is anyone in this image? Describe them." — person detection
• "What room is this? List all objects you can see." — scene inventory
• "Rate the lighting in this photo on a scale of 1-10" — analysis

3 Real-Time Object Detection with YOLO

YOLO (You Only Look Once) detects objects in real time and draws bounding boxes around them. It's fast enough for live video.

# Install ultralytics (YOLO)
pip install ultralytics
        

Run real-time detection on your webcam:

from ultralytics import YOLO
import cv2

# Load YOLOv8 (downloads automatically on first run, ~6MB)
model = YOLO("yolov8n.pt")  # nano model — fastest

cap = cv2.VideoCapture(0)
print("Press 'q' to quit")

while True:
    ret, frame = cap.read()
    if not ret:
        break

    results = model(frame, verbose=False)
    annotated = results[0].plot()  # draw boxes + labels

    cv2.imshow("YOLO Vision", annotated)
    if cv2.waitKey(1) & 0xFF == ord("q"):
        break

cap.release()
cv2.destroyAllWindows()
        

You'll see a live video feed with colored boxes around detected objects — people, chairs, cups, phones, laptops, pets — labeled with confidence scores.

YOLO Model	Size	Speed (CPU)	Accuracy	Best For
`yolov8n`	6MB	~30 FPS	Good	Real-time on any hardware, webcam
`yolov8s`	22MB	~15 FPS	Better	Good balance of speed and accuracy
`yolov8m`	50MB	~8 FPS	High	When accuracy matters more than speed
`yolov8x`	131MB	~3 FPS	Highest	Photo analysis, not real-time on CPU

💡 Tip: YOLO detects 80 common objects out of the box (person, car, dog, chair, phone, etc.). Need custom objects? You can fine-tune it on your own images — but the default covers most use cases.

4 Combine YOLO + LLaVA for Smart Vision

The real power: YOLO spots objects fast, then LLaVA analyzes what it found. Detect → crop → describe.

from ultralytics import YOLO
import cv2, base64, requests

yolo = YOLO("yolov8n.pt")
cap = cv2.VideoCapture(0)
ret, frame = cap.read()
cap.release()

# Detect objects
results = yolo(frame, verbose=False)
detections = results[0].boxes

print(f"Found {len(detections)} objects")
for box in detections:
    cls = results[0].names[int(box.cls)]
    conf = float(box.conf)
    print(f"  - {cls} ({conf:.0%})")

# Send the full frame to LLaVA for deeper analysis
_, buf = cv2.imencode(".jpg", frame)
img_b64 = base64.b64encode(buf).decode()

resp = requests.post("http://localhost:11434/api/chat", json={
    "model": "llava:7b",
    "messages": [{
        "role": "user",
        "content": "YOLO detected: " + ", ".join(
            results[0].names[int(b.cls)] for b in detections
        ) + ". Describe the scene and what's happening.",
        "images": [img_b64]
    }],
    "stream": False
})
print("\nLLaVA says:", resp.json()["message"]["content"])
        

💡 This is how AI OS will use vision. YOLO runs fast as a "what's there" check. When something interesting is detected, LLaVA gives the deeper understanding. Pair this with Whisper and you have a system that can see, hear, and think — all locally.

5 Build a Motion-Triggered Camera

A practical project: only capture and analyze when something moves. Great for monitoring a room, a pet, or a 3D printer.

import cv2, time, base64, requests
from ultralytics import YOLO

yolo = YOLO("yolov8n.pt")
cap = cv2.VideoCapture(0)

prev_frame = None
MOTION_THRESHOLD = 25000  # adjust sensitivity
COOLDOWN = 10  # seconds between analyses
last_analysis = 0

print("Watching for motion... (Ctrl+C to stop)")
try:
    while True:
        ret, frame = cap.read()
        if not ret:
            break

        gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
        gray = cv2.GaussianBlur(gray, (21, 21), 0)

        if prev_frame is None:
            prev_frame = gray
            continue

        delta = cv2.absdiff(prev_frame, gray)
        motion_score = delta.sum() / 1000
        prev_frame = gray

        if motion_score > MOTION_THRESHOLD and time.time() - last_analysis > COOLDOWN:
            print(f"\n🔔 Motion detected (score: {motion_score:.0f})")
            results = yolo(frame, verbose=False)
            for box in results[0].boxes:
                cls = results[0].names[int(box.cls)]
                print(f"  Detected: {cls} ({float(box.conf):.0%})")
            last_analysis = time.time()

except KeyboardInterrupt:
    print("\nStopped.")
    cap.release()
        

⚠️ Privacy first. This is a local-only system — nothing goes to any cloud. But if you're pointing a camera at shared spaces, be thoughtful about who you're recording. Vision AI is powerful; use it responsibly.

✅ What You've Set Up

A webcam working with Python and OpenCV for image capture
LLaVA via Ollama for image description, OCR, and scene understanding
YOLO for real-time object detection with bounding boxes
A combined pipeline: YOLO detects → LLaVA analyzes
A motion-triggered camera that only processes when something happens

Next Steps

Add an Edge TPU — a $25 Coral USB Accelerator runs YOLO at 100+ FPS and barely uses CPU. Plug and play.
Log detections — write to a JSON file or SQLite database with timestamps. Build a history of what your camera has seen.
Pair with Whisper — add a mic to create a multimodal system: "What do you see?" → camera capture → LLaVA describes → TTS reads it back.
Run on a Pi — a Raspberry Pi 5 + USB camera + Coral TPU makes a dedicated vision station for ~$100.
Train custom detection — fine-tune YOLO on your own objects with as few as 50 labeled images using Roboflow.

💡 The full stack: Whisper (hear) + Vision AI (see) + Ollama (think) + a speaker (speak) = a fully local AI assistant that perceives the world. No internet required.

What You'll Need

1 Get Your Camera Working with Python

2 Describe Images with LLaVA (via Ollama)

3 Real-Time Object Detection with YOLO

4 Combine YOLO + LLaVA for Smart Vision

5 Build a Motion-Triggered Camera

✅ What You've Set Up

Next Steps

📚 Learning Links

Videos

Official Docs

Community