4.5 KiB
4.5 KiB
Project History: Python Whisper Live Transcription
This document tracks the evolution, technical decisions, and optimizations made for this live audio transcription and translation tool.
Phase 1: Basic Live Transcription
- Goal: Create a simple script to transcribe live audio.
- Initial Stack:
openai-whisper,sounddevice,numpy. - Approach:
- Captured audio in 2-second chunks using
sounddevice. - Used the
tiny.enmodel for initial testing.
- Captured audio in 2-second chunks using
- Outcome: Successful basic transcription, but limited by continuous processing (even during silence).
Phase 2: Voice Activity Detection (VAD)
- Goal: Improve efficiency by only transcribing when someone is speaking.
- Stack Addition:
silero-vad. - Approach:
- Integrated Silero VAD to monitor the audio stream.
- Transcription is only triggered after a speech segment is followed by a period of silence (500ms).
- Outcome: Significantly reduced CPU usage and cleaner output.
Phase 3: Apple Silicon Optimization (M-Series/M2)
- Goal: Leverage the M2's Neural Engine and GPU for better performance.
- Stack Transition:
mlx-whisper(via Apple's MLX framework). - Decision: Switched from
openai-whispertomlx-whisperand upgraded the model tosmall.enfor better accuracy without sacrificing speed. - Outcome: Faster inference and better battery efficiency.
Phase 4: Local Translation
- Approach A (LLM): Tried using
SmolLM2-135Mviamlx-lmfor translation.- Issue: The LLM was "too talkative," often adding conversational filler or explaining the translation instead of just providing it.
- Approach B (Dedicated MT): Switched to
MarianMT(Helsinki-NLP/opus-mt-en-es).- Decision: Chose a dedicated Translation Model for cleaner, direct mapping from English to Spanish.
- Technical Hurdle (LZMA Error):
- The local Python environment lacked
_lzmasupport, causingtransformersandhuggingface_hubto crash. - Solution: Implemented a comprehensive
lzmamock in the script to provide necessary constants (FORMAT_XZ, etc.) and bypass the system-level limitation.
- The local Python environment lacked
Current Status
The project now features a high-performance, Apple Silicon-optimized pipeline that:
- Detects speech using Silero VAD.
- Transcribes using MLX-Whisper (small.en).
- Translates using MarianMT (EN-ES).
- Operates entirely locally with hardware acceleration.
Phase 5: Simultaneous Multi-Language Translation
- Goal: Provide translations in Spanish, French, and Arabic at the same time.
- Approach:
- Refactored the script to support a dictionary of multiple
MarianMTmodels. - Each transcribed English segment is passed through each loaded translation engine sequentially.
- Refactored the script to support a dictionary of multiple
- Performance on M2: Loading 3-4 specialized models + Whisper is highly efficient, using ~1.5GB of RAM and providing near-instant results.
Phase 6: Memory & Generation Safety
- Issue: Occasionally, long inputs or model glitches caused "runaway" translation generation, which could consume excessive memory.
- Solution:
- Artificially truncated input transcription to a maximum of 250 characters.
- Added
max_new_tokens=150to the translation generation call to ensure the model terminates even if it gets stuck in a loop.
Phase 7: Multilingual Detection & Bridge Translation
- Goal: Support input in any language, detect it, and translate to English + others.
- Approach:
- Switched to
whisper-small-mlx(multilingual). - Hub-and-Spoke Model: If a non-English language is detected, Whisper's
task="translate"is used to create an English "bridge" text, which is then fed into the specialized MarianMT models.
- Switched to
- Outcome: Full support for multilingual input with centralized translation.
Phase 8: Compilation to Binary
- Goal: Distribute the script as a single, standalone executable for macOS terminal.
- Tool:
PyInstaller. - Process:
- Used
--onefileto bundle the entire Python runtime and its heavy dependencies (Torch, MLX, Transformers). - Excluded build artifacts (
build/,dist/,.spec) from the repository.
- Used
- Build Script:
chmod +x build.sh ./build.sh - Troubleshooting: Fixed a runtime
ModuleNotFoundError: No module named 'mlx._reprlib_fix'by explicitly adding--collect-all mlxand--hidden-import=mlx._reprlib_fixto the PyInstaller configuration. - Outcome: A standalone
whisper-transcribebinary in thedist/directory.