2.4 KiB
2.4 KiB
Project History: Python Whisper Live Transcription
This document tracks the evolution, technical decisions, and optimizations made for this live audio transcription and translation tool.
Phase 1: Basic Live Transcription
- Goal: Create a simple script to transcribe live audio.
- Initial Stack:
openai-whisper,sounddevice,numpy. - Approach:
- Captured audio in 2-second chunks using
sounddevice. - Used the
tiny.enmodel for initial testing.
- Captured audio in 2-second chunks using
- Outcome: Successful basic transcription, but limited by continuous processing (even during silence).
Phase 2: Voice Activity Detection (VAD)
- Goal: Improve efficiency by only transcribing when someone is speaking.
- Stack Addition:
silero-vad. - Approach:
- Integrated Silero VAD to monitor the audio stream.
- Transcription is only triggered after a speech segment is followed by a period of silence (500ms).
- Outcome: Significantly reduced CPU usage and cleaner output.
Phase 3: Apple Silicon Optimization (M-Series/M2)
- Goal: Leverage the M2's Neural Engine and GPU for better performance.
- Stack Transition:
mlx-whisper(via Apple's MLX framework). - Decision: Switched from
openai-whispertomlx-whisperand upgraded the model tosmall.enfor better accuracy without sacrificing speed. - Outcome: Faster inference and better battery efficiency.
Phase 4: Local Translation
- Approach A (LLM): Tried using
SmolLM2-135Mviamlx-lmfor translation.- Issue: The LLM was "too talkative," often adding conversational filler or explaining the translation instead of just providing it.
- Approach B (Dedicated MT): Switched to
MarianMT(Helsinki-NLP/opus-mt-en-es).- Decision: Chose a dedicated Translation Model for cleaner, direct mapping from English to Spanish.
- Technical Hurdle (LZMA Error):
- The local Python environment lacked
_lzmasupport, causingtransformersandhuggingface_hubto crash. - Solution: Implemented a comprehensive
lzmamock in the script to provide necessary constants (FORMAT_XZ, etc.) and bypass the system-level limitation.
- The local Python environment lacked
Current Status
The project now features a high-performance, Apple Silicon-optimized pipeline that:
- Detects speech using Silero VAD.
- Transcribes using MLX-Whisper (small.en).
- Translates using MarianMT (EN-ES).
- Operates entirely locally with hardware acceleration.