← Back to all guides
🎙️
Beginner

Voice-to-Text with Whisper

Set up OpenAI's Whisper locally for speech-to-text. Transcribe audio files or use your mic for real-time dictation — all offline.

⏱ ~10 minutes 💻 Mac / Linux / Windows 🎙️ USB mic recommended

What You'll Need

  • Python 3.8+ installed (python3 --version to check)
  • A computer with at least 8GB RAM (the tiny model works on less)
  • A microphone — built-in works, but a USB mic gives much better accuracy
  • ~1–3GB disk space depending on the model size you choose
💡 Tip: Whisper runs on CPU by default. If you have an NVIDIA GPU with CUDA, or an Apple Silicon Mac, it will automatically use the GPU for 5–10x faster transcription.

1 Install Whisper

There are two main ways to run Whisper locally. We'll cover both.

Option A: OpenAI Whisper (original)

# Install via pip pip install -U openai-whisper # Also needs ffmpeg for audio processing # Mac: brew install ffmpeg # Linux: sudo apt install ffmpeg # Windows: download from https://ffmpeg.org/download.html

Option B: Faster-Whisper (recommended)

A community fork that runs 4x faster with less memory. Same accuracy.

pip install faster-whisper
💡 Tip: We recommend faster-whisper for most people. Same models, same quality, much faster. The examples below work with both — just change the import.

2 Transcribe an Audio File

The simplest use case — turn any audio or video file into text.

Command line (openai-whisper):

# Transcribe any audio/video file whisper recording.mp3 --model base # Specify language for better accuracy whisper recording.mp3 --model base --language en # Output as text, SRT subtitles, or VTT whisper recording.mp3 --model base --output_format txt

Python script (faster-whisper):

from faster_whisper import WhisperModel # Load the model (downloads on first run) model = WhisperModel("base", device="cpu") # Transcribe segments, info = model.transcribe("recording.mp3") for segment in segments: print(f"[{segment.start:.1f}s - {segment.end:.1f}s] {segment.text}")

You'll see timestamped text output — each segment with its start/end time and the transcribed words.

3 Choose The Right Model

Whisper has 5 model sizes. Bigger = more accurate, but slower and needs more RAM.

Model Size RAM Speed Best For
tiny 75MB ~1GB ~32x realtime Quick & dirty, real-time on weak hardware
base 142MB ~1GB ~16x realtime Good balance — our default recommendation
small 466MB ~2GB ~6x realtime Noticeably better accuracy
medium 1.5GB ~5GB ~2x realtime High accuracy, handles accents well
large-v3 3GB ~10GB ~1x realtime Best accuracy, needs GPU for practical use
⚠️ Realtime speed note: "16x realtime" means a 1-minute file takes ~4 seconds to transcribe. On CPU only. GPU users will see much faster speeds across all sizes.

4 Real-Time Mic Transcription

This is where it gets fun — talk into your mic and see text appear instantly. We'll use sounddevice to capture audio.

# Install audio capture library pip install sounddevice numpy

Here's a minimal real-time transcription script:

import sounddevice as sd import numpy as np import tempfile, wave from faster_whisper import WhisperModel model = WhisperModel("base", device="cpu") RATE = 16000 CHUNK_SEC = 5 # transcribe every 5 seconds print("Listening... (Ctrl+C to stop)") try: while True: audio = sd.rec(int(RATE * CHUNK_SEC), samplerate=RATE, channels=1, dtype="float32") sd.wait() # Save chunk to temp WAV tmp = tempfile.NamedTemporaryFile(suffix=".wav", delete=False) with wave.open(tmp.name, "wb") as wf: wf.setnchannels(1) wf.setsampwidth(2) wf.setframerate(RATE) wf.writeframes((audio * 32767).astype(np.int16).tobytes()) segments, _ = model.transcribe(tmp.name) for seg in segments: print(seg.text, end=" ", flush=True) except KeyboardInterrupt: print("\nDone.")

Run it and start talking. Every 5 seconds, it transcribes the chunk and prints the text. Press Ctrl+C to stop.

💡 Tip: For better real-time results, use the tiny or base model. The small model gives better accuracy but may lag behind on CPU. Adjust CHUNK_SEC to balance latency vs. accuracy.

5 Pair Whisper with Ollama

The real power comes when you pipe your voice into a local LLM. Talk → transcribe → AI responds.

# After transcribing with Whisper, send to Ollama import requests def ask_ollama(text): resp = requests.post("http://localhost:11434/api/chat", json={ "model": "llama3.2", "messages": [{"role": "user", "content": text}], "stream": False }) return resp.json()["message"]["content"] # In your transcription loop: transcript = "What's the weather like on Mars?" answer = ask_ollama(transcript) print(f"AI: {answer}")
💡 Tip: This is basically what AI OS does under the hood — Whisper handles your voice, Ollama handles your thinking. The Ollama guide covers installation if you haven't set it up yet.

✅ What You've Set Up

  • Whisper installed and running locally — no cloud, no API keys
  • Ability to transcribe any audio/video file to text
  • Real-time microphone transcription with a Python script
  • A voice → AI pipeline pairing Whisper with Ollama

Next Steps

  • Get a better mic — a USB condenser mic dramatically improves transcription accuracy. Check our hardware store.
  • Try different languages — Whisper supports 99 languages. Add --language ja for Japanese, --language es for Spanish, etc.
  • Add voice activity detection — use silero-vad to only transcribe when someone is speaking (saves CPU).
  • Build a voice assistant — combine Whisper + Ollama + a TTS engine for a fully local voice AI.
⚠️ Mic permissions: On macOS, you'll need to grant Terminal (or your IDE) microphone access in System Settings → Privacy & Security → Microphone. On Linux, make sure your user is in the audio group.

📚 Learning Links

Videos

Official Docs

Community & Tools