Add project history and decision log

2026-02-26 21:17:22 -05:00
parent 7a47781141
commit c4a39ef29e
1 changed files with 41 additions and 0 deletions
@@ -0,0 +1,41 @@
+# Project History: Python Whisper Live Transcription
+
+This document tracks the evolution, technical decisions, and optimizations made for this live audio transcription and translation tool.
+
+## Phase 1: Basic Live Transcription
+- **Goal:** Create a simple script to transcribe live audio.
+- **Initial Stack:** `openai-whisper`, `sounddevice`, `numpy`.
+- **Approach:** 
+    - Captured audio in 2-second chunks using `sounddevice`.
+    - Used the `tiny.en` model for initial testing.
+- **Outcome:** Successful basic transcription, but limited by continuous processing (even during silence).
+
+## Phase 2: Voice Activity Detection (VAD)
+- **Goal:** Improve efficiency by only transcribing when someone is speaking.
+- **Stack Addition:** `silero-vad`.
+- **Approach:** 
+    - Integrated Silero VAD to monitor the audio stream.
+    - Transcription is only triggered after a speech segment is followed by a period of silence (500ms).
+- **Outcome:** Significantly reduced CPU usage and cleaner output.
+
+## Phase 3: Apple Silicon Optimization (M-Series/M2)
+- **Goal:** Leverage the M2's Neural Engine and GPU for better performance.
+- **Stack Transition:** `mlx-whisper` (via Apple's MLX framework).
+- **Decision:** Switched from `openai-whisper` to `mlx-whisper` and upgraded the model to `small.en` for better accuracy without sacrificing speed.
+- **Outcome:** Faster inference and better battery efficiency.
+
+## Phase 4: Local Translation
+- **Approach A (LLM):** Tried using `SmolLM2-135M` via `mlx-lm` for translation.
+    - **Issue:** The LLM was "too talkative," often adding conversational filler or explaining the translation instead of just providing it.
+- **Approach B (Dedicated MT):** Switched to `MarianMT` (`Helsinki-NLP/opus-mt-en-es`).
+    - **Decision:** Chose a dedicated Translation Model for cleaner, direct mapping from English to Spanish.
+- **Technical Hurdle (LZMA Error):** 
+    - The local Python environment lacked `_lzma` support, causing `transformers` and `huggingface_hub` to crash.
+    - **Solution:** Implemented a comprehensive `lzma` mock in the script to provide necessary constants (`FORMAT_XZ`, etc.) and bypass the system-level limitation.
+
+## Current Status
+The project now features a high-performance, Apple Silicon-optimized pipeline that:
+1. Detects speech using **Silero VAD**.
+2. Transcribes using **MLX-Whisper (small.en)**.
+3. Translates using **MarianMT (EN-ES)**.
+4. Operates entirely locally with hardware acceleration.