# Project History: Python Whisper Live Transcription This document tracks the evolution, technical decisions, and optimizations made for this live audio transcription and translation tool. ## Phase 1: Basic Live Transcription - **Goal:** Create a simple script to transcribe live audio. - **Initial Stack:** `openai-whisper`, `sounddevice`, `numpy`. - **Approach:** - Captured audio in 2-second chunks using `sounddevice`. - Used the `tiny.en` model for initial testing. - **Outcome:** Successful basic transcription, but limited by continuous processing (even during silence). ## Phase 2: Voice Activity Detection (VAD) - **Goal:** Improve efficiency by only transcribing when someone is speaking. - **Stack Addition:** `silero-vad`. - **Approach:** - Integrated Silero VAD to monitor the audio stream. - Transcription is only triggered after a speech segment is followed by a period of silence (500ms). - **Outcome:** Significantly reduced CPU usage and cleaner output. ## Phase 3: Apple Silicon Optimization (M-Series/M2) - **Goal:** Leverage the M2's Neural Engine and GPU for better performance. - **Stack Transition:** `mlx-whisper` (via Apple's MLX framework). - **Decision:** Switched from `openai-whisper` to `mlx-whisper` and upgraded the model to `small.en` for better accuracy without sacrificing speed. - **Outcome:** Faster inference and better battery efficiency. ## Phase 4: Local Translation - **Approach A (LLM):** Tried using `SmolLM2-135M` via `mlx-lm` for translation. - **Issue:** The LLM was "too talkative," often adding conversational filler or explaining the translation instead of just providing it. - **Approach B (Dedicated MT):** Switched to `MarianMT` (`Helsinki-NLP/opus-mt-en-es`). - **Decision:** Chose a dedicated Translation Model for cleaner, direct mapping from English to Spanish. - **Technical Hurdle (LZMA Error):** - The local Python environment lacked `_lzma` support, causing `transformers` and `huggingface_hub` to crash. - **Solution:** Implemented a comprehensive `lzma` mock in the script to provide necessary constants (`FORMAT_XZ`, etc.) and bypass the system-level limitation. ## Current Status The project now features a high-performance, Apple Silicon-optimized pipeline that: 1. Detects speech using **Silero VAD**. 2. Transcribes using **MLX-Whisper (small.en)**. 3. Translates using **MarianMT (EN-ES)**. 4. Operates entirely locally with hardware acceleration. ## Phase 5: Simultaneous Multi-Language Translation - **Goal:** Provide translations in Spanish, French, and Arabic at the same time. - **Approach:** - Refactored the script to support a dictionary of multiple `MarianMT` models. - Each transcribed English segment is passed through each loaded translation engine sequentially. - **Performance on M2:** Loading 3-4 specialized models + Whisper is highly efficient, using ~1.5GB of RAM and providing near-instant results. ## Phase 6: Memory & Generation Safety - **Issue:** Occasionally, long inputs or model glitches caused "runaway" translation generation, which could consume excessive memory. - **Solution:** - Artificially truncated input transcription to a maximum of 250 characters. - Added `max_new_tokens=150` to the translation generation call to ensure the model terminates even if it gets stuck in a loop. ## Phase 7: Multilingual Detection & Bridge Translation - **Goal:** Support input in any language, detect it, and translate to English + others. - **Approach:** - Switched to `whisper-small-mlx` (multilingual). - **Hub-and-Spoke Model:** If a non-English language is detected, Whisper's `task="translate"` is used to create an English "bridge" text, which is then fed into the specialized MarianMT models. - **Outcome:** Full support for multilingual input with centralized translation. ## Phase 8: Compilation to Binary - **Goal:** Distribute the script as a single, standalone executable for macOS terminal. - **Tool:** `PyInstaller`. - **Process:** - Used `--onefile` to bundle the entire Python runtime and its heavy dependencies (Torch, MLX, Transformers). - Excluded build artifacts (`build/`, `dist/`, `.spec`) from the repository. - **Build Script:** ```bash chmod +x build.sh ./build.sh ``` - **Troubleshooting:** Fixed a runtime `ModuleNotFoundError: No module named 'mlx._reprlib_fix'` by explicitly adding `--collect-all mlx` and `--hidden-import=mlx._reprlib_fix` to the PyInstaller configuration. Also added `multiprocessing.freeze_support()` to fix infinite loops in the compiled binary. ## Phase 9: Real-Time Server Ingest - **Goal:** Send live captions and translations to a central ingest server. - **Backend:** `https://emiapi.reynafamily.com/live-captions/ingest`. - **Approach:** - Implemented a background `ingest_worker` thread to handle HTTP POST requests without stalling the audio processing. - **Flat JSON Schema:** Used a key-value format as requested (e.g., `original`, `es`, `en`, `fr`). - **Reliability:** Integrated exponential backoff retries (1s to 15s) to handle network or server failures.