Project History: Python Whisper Live Transcription

This document tracks the evolution, technical decisions, and optimizations made for this live audio transcription and translation tool.

Phase 1: Basic Live Transcription

Goal: Create a simple script to transcribe live audio.
Initial Stack: openai-whisper, sounddevice, numpy.
Approach:
- Captured audio in 2-second chunks using sounddevice.
- Used the tiny.en model for initial testing.
Outcome: Successful basic transcription, but limited by continuous processing (even during silence).

Goal: Improve efficiency by only transcribing when someone is speaking.
Stack Addition: silero-vad.
Approach:
- Integrated Silero VAD to monitor the audio stream.
- Transcription is only triggered after a speech segment is followed by a period of silence (500ms).
Outcome: Significantly reduced CPU usage and cleaner output.

Goal: Leverage the M2's Neural Engine and GPU for better performance.
Stack Transition: mlx-whisper (via Apple's MLX framework).
Decision: Switched from openai-whisper to mlx-whisper and upgraded the model to small.en for better accuracy without sacrificing speed.
Outcome: Faster inference and better battery efficiency.

Approach A (LLM): Tried using SmolLM2-135M via mlx-lm for translation.
- Issue: The LLM was "too talkative," often adding conversational filler or explaining the translation instead of just providing it.
Approach B (Dedicated MT): Switched to MarianMT (Helsinki-NLP/opus-mt-en-es).
- Decision: Chose a dedicated Translation Model for cleaner, direct mapping from English to Spanish.
Technical Hurdle (LZMA Error):
- The local Python environment lacked _lzma support, causing transformers and huggingface_hub to crash.
- Solution: Implemented a comprehensive lzma mock in the script to provide necessary constants (FORMAT_XZ, etc.) and bypass the system-level limitation.

The project now features a high-performance, Apple Silicon-optimized pipeline that:

Goal: Provide translations in Spanish, French, and Arabic at the same time.
Approach:
- Refactored the script to support a dictionary of multiple MarianMT models.
- Each transcribed English segment is passed through each loaded translation engine sequentially.
Performance on M2: Loading 3-4 specialized models + Whisper is highly efficient, using ~1.5GB of RAM and providing near-instant results.

Issue: Occasionally, long inputs or model glitches caused "runaway" translation generation, which could consume excessive memory.
Solution:
- Artificially truncated input transcription to a maximum of 250 characters.
- Added max_new_tokens=150 to the translation generation call to ensure the model terminates even if it gets stuck in a loop.

Goal: Support input in any language, detect it, and translate to English + others.
Approach:
- Switched to whisper-small-mlx (multilingual).
- Hub-and-Spoke Model: If a non-English language is detected, Whisper's task="translate" is used to create an English "bridge" text, which is then fed into the specialized MarianMT models.
Outcome: Full support for multilingual input with centralized translation.

Goal: Distribute the script as a single, standalone executable for macOS terminal.
Tool: PyInstaller.
Process:
- Used --onefile to bundle the entire Python runtime and its heavy dependencies (Torch, MLX, Transformers).
- Excluded build artifacts (build/, dist/, .spec) from the repository.
Build Script:
```
chmod +x build.sh
./build.sh
```
Troubleshooting: Fixed a runtime ModuleNotFoundError: No module named 'mlx._reprlib_fix' by explicitly adding --collect-all mlx and --hidden-import=mlx._reprlib_fix to the PyInstaller configuration.
Outcome: A standalone whisper-transcribe binary in the dist/ directory.