Voice-to-Text with Whisper
Set up OpenAI's Whisper locally for speech-to-text. Transcribe audio files or use your mic for real-time dictation — all offline.
What You'll Need
- Python 3.8+ installed (
python3 --versionto check) - A computer with at least 8GB RAM (the tiny model works on less)
- A microphone — built-in works, but a USB mic gives much better accuracy
- ~1–3GB disk space depending on the model size you choose
1 Install Whisper
There are two main ways to run Whisper locally. We'll cover both.
Option A: OpenAI Whisper (original)
Option B: Faster-Whisper (recommended)
A community fork that runs 4x faster with less memory. Same accuracy.
2 Transcribe an Audio File
The simplest use case — turn any audio or video file into text.
Command line (openai-whisper):
Python script (faster-whisper):
You'll see timestamped text output — each segment with its start/end time and the transcribed words.
3 Choose The Right Model
Whisper has 5 model sizes. Bigger = more accurate, but slower and needs more RAM.
| Model | Size | RAM | Speed | Best For |
|---|---|---|---|---|
tiny |
75MB | ~1GB | ~32x realtime | Quick & dirty, real-time on weak hardware |
base |
142MB | ~1GB | ~16x realtime | Good balance — our default recommendation |
small |
466MB | ~2GB | ~6x realtime | Noticeably better accuracy |
medium |
1.5GB | ~5GB | ~2x realtime | High accuracy, handles accents well |
large-v3 |
3GB | ~10GB | ~1x realtime | Best accuracy, needs GPU for practical use |
4 Real-Time Mic Transcription
This is where it gets fun — talk into your mic and see text appear instantly. We'll use sounddevice to capture audio.
Here's a minimal real-time transcription script:
Run it and start talking. Every 5 seconds, it transcribes the chunk and prints the text. Press Ctrl+C to stop.
tiny or base model. The small model gives better accuracy but may lag behind on CPU. Adjust CHUNK_SEC to balance latency vs. accuracy.
5 Pair Whisper with Ollama
The real power comes when you pipe your voice into a local LLM. Talk → transcribe → AI responds.
✅ What You've Set Up
- Whisper installed and running locally — no cloud, no API keys
- Ability to transcribe any audio/video file to text
- Real-time microphone transcription with a Python script
- A voice → AI pipeline pairing Whisper with Ollama
Next Steps
- Get a better mic — a USB condenser mic dramatically improves transcription accuracy. Check our hardware store.
- Try different languages — Whisper supports 99 languages. Add
--language jafor Japanese,--language esfor Spanish, etc. - Add voice activity detection — use
silero-vadto only transcribe when someone is speaking (saves CPU). - Build a voice assistant — combine Whisper + Ollama + a TTS engine for a fully local voice AI.
audio group.