62 lines
3.8 KiB
Markdown
62 lines
3.8 KiB
Markdown
# Project History: Python Whisper Live Transcription
|
|
|
|
This document tracks the evolution, technical decisions, and optimizations made for this live audio transcription and translation tool.
|
|
|
|
## Phase 1: Basic Live Transcription
|
|
- **Goal:** Create a simple script to transcribe live audio.
|
|
- **Initial Stack:** `openai-whisper`, `sounddevice`, `numpy`.
|
|
- **Approach:**
|
|
- Captured audio in 2-second chunks using `sounddevice`.
|
|
- Used the `tiny.en` model for initial testing.
|
|
- **Outcome:** Successful basic transcription, but limited by continuous processing (even during silence).
|
|
|
|
## Phase 2: Voice Activity Detection (VAD)
|
|
- **Goal:** Improve efficiency by only transcribing when someone is speaking.
|
|
- **Stack Addition:** `silero-vad`.
|
|
- **Approach:**
|
|
- Integrated Silero VAD to monitor the audio stream.
|
|
- Transcription is only triggered after a speech segment is followed by a period of silence (500ms).
|
|
- **Outcome:** Significantly reduced CPU usage and cleaner output.
|
|
|
|
## Phase 3: Apple Silicon Optimization (M-Series/M2)
|
|
- **Goal:** Leverage the M2's Neural Engine and GPU for better performance.
|
|
- **Stack Transition:** `mlx-whisper` (via Apple's MLX framework).
|
|
- **Decision:** Switched from `openai-whisper` to `mlx-whisper` and upgraded the model to `small.en` for better accuracy without sacrificing speed.
|
|
- **Outcome:** Faster inference and better battery efficiency.
|
|
|
|
## Phase 4: Local Translation
|
|
- **Approach A (LLM):** Tried using `SmolLM2-135M` via `mlx-lm` for translation.
|
|
- **Issue:** The LLM was "too talkative," often adding conversational filler or explaining the translation instead of just providing it.
|
|
- **Approach B (Dedicated MT):** Switched to `MarianMT` (`Helsinki-NLP/opus-mt-en-es`).
|
|
- **Decision:** Chose a dedicated Translation Model for cleaner, direct mapping from English to Spanish.
|
|
- **Technical Hurdle (LZMA Error):**
|
|
- The local Python environment lacked `_lzma` support, causing `transformers` and `huggingface_hub` to crash.
|
|
- **Solution:** Implemented a comprehensive `lzma` mock in the script to provide necessary constants (`FORMAT_XZ`, etc.) and bypass the system-level limitation.
|
|
|
|
## Current Status
|
|
The project now features a high-performance, Apple Silicon-optimized pipeline that:
|
|
1. Detects speech using **Silero VAD**.
|
|
2. Transcribes using **MLX-Whisper (small.en)**.
|
|
3. Translates using **MarianMT (EN-ES)**.
|
|
4. Operates entirely locally with hardware acceleration.
|
|
|
|
## Phase 5: Simultaneous Multi-Language Translation
|
|
- **Goal:** Provide translations in Spanish, French, and Arabic at the same time.
|
|
- **Approach:**
|
|
- Refactored the script to support a dictionary of multiple `MarianMT` models.
|
|
- Each transcribed English segment is passed through each loaded translation engine sequentially.
|
|
- **Performance on M2:** Loading 3-4 specialized models + Whisper is highly efficient, using ~1.5GB of RAM and providing near-instant results.
|
|
|
|
## Phase 6: Memory & Generation Safety
|
|
- **Issue:** Occasionally, long inputs or model glitches caused "runaway" translation generation, which could consume excessive memory.
|
|
- **Solution:**
|
|
- Artificially truncated input transcription to a maximum of 250 characters.
|
|
- Added `max_new_tokens=150` to the translation generation call to ensure the model terminates even if it gets stuck in a loop.
|
|
|
|
## Phase 7: Multilingual Detection & Bridge Translation
|
|
- **Goal:** Support input in any language, detect it, and translate to English + others.
|
|
- **Approach:**
|
|
- Switched to `whisper-small-mlx` (multilingual).
|
|
- **Hub-and-Spoke Model:** If a non-English language is detected, Whisper's `task="translate"` is used to create an English "bridge" text, which is then fed into the specialized MarianMT models.
|
|
- **Outcome:** Full support for multilingual input with centralized translation.
|