Voice-to-Text with Whisper

Set up OpenAI's Whisper locally for speech-to-text. Transcribe audio files or use your mic for real-time dictation — all offline.

⏱ ~10 minutes 💻 Mac / Linux / Windows 🎙️ USB mic recommended

What You'll Need

Python 3.8+ installed (python3 --version to check)
A computer with at least 8GB RAM (the tiny model works on less)
A microphone — built-in works, but a USB mic gives much better accuracy
~1–3GB disk space depending on the model size you choose

💡 Tip: Whisper runs on CPU by default. If you have an NVIDIA GPU with CUDA, or an Apple Silicon Mac, it will automatically use the GPU for 5–10x faster transcription.

1 Install Whisper

There are two main ways to run Whisper locally. We'll cover both.

Option A: OpenAI Whisper (original)

# Install via pip
pip install -U openai-whisper

# Also needs ffmpeg for audio processing
# Mac:
brew install ffmpeg
# Linux:
sudo apt install ffmpeg
# Windows: download from https://ffmpeg.org/download.html
        

Option B: Faster-Whisper (recommended)

A community fork that runs 4x faster with less memory. Same accuracy.

pip install faster-whisper

💡 Tip: We recommend faster-whisper for most people. Same models, same quality, much faster. The examples below work with both — just change the import.

2 Transcribe an Audio File

The simplest use case — turn any audio or video file into text.

Command line (openai-whisper):

# Transcribe any audio/video file
whisper recording.mp3 --model base

# Specify language for better accuracy
whisper recording.mp3 --model base --language en

# Output as text, SRT subtitles, or VTT
whisper recording.mp3 --model base --output_format txt
        

Python script (faster-whisper):

from faster_whisper import WhisperModel

# Load the model (downloads on first run)
model = WhisperModel("base", device="cpu")

# Transcribe
segments, info = model.transcribe("recording.mp3")

for segment in segments:
    print(f"[{segment.start:.1f}s - {segment.end:.1f}s] {segment.text}")
        

You'll see timestamped text output — each segment with its start/end time and the transcribed words.

3 Choose The Right Model

Whisper has 5 model sizes. Bigger = more accurate, but slower and needs more RAM.

Model	Size	RAM	Speed	Best For
`tiny`	75MB	~1GB	~32x realtime	Quick & dirty, real-time on weak hardware
`base`	142MB	~1GB	~16x realtime	Good balance — our default recommendation
`small`	466MB	~2GB	~6x realtime	Noticeably better accuracy
`medium`	1.5GB	~5GB	~2x realtime	High accuracy, handles accents well
`large-v3`	3GB	~10GB	~1x realtime	Best accuracy, needs GPU for practical use

⚠️ Realtime speed note: "16x realtime" means a 1-minute file takes ~4 seconds to transcribe. On CPU only. GPU users will see much faster speeds across all sizes.

4 Real-Time Mic Transcription

This is where it gets fun — talk into your mic and see text appear instantly. We'll use sounddevice to capture audio.

# Install audio capture library
pip install sounddevice numpy
        

Here's a minimal real-time transcription script:

import sounddevice as sd
import numpy as np
import tempfile, wave
from faster_whisper import WhisperModel

model = WhisperModel("base", device="cpu")
RATE = 16000
CHUNK_SEC = 5  # transcribe every 5 seconds

print("Listening... (Ctrl+C to stop)")
try:
    while True:
        audio = sd.rec(int(RATE * CHUNK_SEC), samplerate=RATE,
                       channels=1, dtype="float32")
        sd.wait()
        # Save chunk to temp WAV
        tmp = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
        with wave.open(tmp.name, "wb") as wf:
            wf.setnchannels(1)
            wf.setsampwidth(2)
            wf.setframerate(RATE)
            wf.writeframes((audio * 32767).astype(np.int16).tobytes())
        segments, _ = model.transcribe(tmp.name)
        for seg in segments:
            print(seg.text, end=" ", flush=True)
except KeyboardInterrupt:
    print("\nDone.")
        

Run it and start talking. Every 5 seconds, it transcribes the chunk and prints the text. Press Ctrl+C to stop.

💡 Tip: For better real-time results, use the tiny or base model. The small model gives better accuracy but may lag behind on CPU. Adjust CHUNK_SEC to balance latency vs. accuracy.

5 Pair Whisper with Ollama

The real power comes when you pipe your voice into a local LLM. Talk → transcribe → AI responds.

# After transcribing with Whisper, send to Ollama
import requests

def ask_ollama(text):
    resp = requests.post("http://localhost:11434/api/chat", json={
        "model": "llama3.2",
        "messages": [{"role": "user", "content": text}],
        "stream": False
    })
    return resp.json()["message"]["content"]

# In your transcription loop:
transcript = "What's the weather like on Mars?"
answer = ask_ollama(transcript)
print(f"AI: {answer}")
        

💡 Tip: This is basically what AI OS does under the hood — Whisper handles your voice, Ollama handles your thinking. The Ollama guide covers installation if you haven't set it up yet.

✅ What You've Set Up

Whisper installed and running locally — no cloud, no API keys
Ability to transcribe any audio/video file to text
Real-time microphone transcription with a Python script
A voice → AI pipeline pairing Whisper with Ollama

Next Steps

Get a better mic — a USB condenser mic dramatically improves transcription accuracy. Check our hardware store.
Try different languages — Whisper supports 99 languages. Add --language ja for Japanese, --language es for Spanish, etc.
Add voice activity detection — use silero-vad to only transcribe when someone is speaking (saves CPU).
Build a voice assistant — combine Whisper + Ollama + a TTS engine for a fully local voice AI.

⚠️ Mic permissions: On macOS, you'll need to grant Terminal (or your IDE) microphone access in System Settings → Privacy & Security → Microphone. On Linux, make sure your user is in the audio group.

What You'll Need

1 Install Whisper

2 Transcribe an Audio File

3 Choose The Right Model

4 Real-Time Mic Transcription

5 Pair Whisper with Ollama

✅ What You've Set Up

Next Steps

📚 Learning Links

Videos

Official Docs

Community & Tools