From c4a39ef29e51febf7879fcf051f15ece666376c9 Mon Sep 17 00:00:00 2001 From: Adolfo Reyna Date: Thu, 26 Feb 2026 21:17:22 -0500 Subject: [PATCH] Add project history and decision log --- history.md | 41 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) create mode 100644 history.md diff --git a/history.md b/history.md new file mode 100644 index 0000000..7f4634d --- /dev/null +++ b/history.md @@ -0,0 +1,41 @@ +# Project History: Python Whisper Live Transcription + +This document tracks the evolution, technical decisions, and optimizations made for this live audio transcription and translation tool. + +## Phase 1: Basic Live Transcription +- **Goal:** Create a simple script to transcribe live audio. +- **Initial Stack:** `openai-whisper`, `sounddevice`, `numpy`. +- **Approach:** + - Captured audio in 2-second chunks using `sounddevice`. + - Used the `tiny.en` model for initial testing. +- **Outcome:** Successful basic transcription, but limited by continuous processing (even during silence). + +## Phase 2: Voice Activity Detection (VAD) +- **Goal:** Improve efficiency by only transcribing when someone is speaking. +- **Stack Addition:** `silero-vad`. +- **Approach:** + - Integrated Silero VAD to monitor the audio stream. + - Transcription is only triggered after a speech segment is followed by a period of silence (500ms). +- **Outcome:** Significantly reduced CPU usage and cleaner output. + +## Phase 3: Apple Silicon Optimization (M-Series/M2) +- **Goal:** Leverage the M2's Neural Engine and GPU for better performance. +- **Stack Transition:** `mlx-whisper` (via Apple's MLX framework). +- **Decision:** Switched from `openai-whisper` to `mlx-whisper` and upgraded the model to `small.en` for better accuracy without sacrificing speed. +- **Outcome:** Faster inference and better battery efficiency. + +## Phase 4: Local Translation +- **Approach A (LLM):** Tried using `SmolLM2-135M` via `mlx-lm` for translation. + - **Issue:** The LLM was "too talkative," often adding conversational filler or explaining the translation instead of just providing it. +- **Approach B (Dedicated MT):** Switched to `MarianMT` (`Helsinki-NLP/opus-mt-en-es`). + - **Decision:** Chose a dedicated Translation Model for cleaner, direct mapping from English to Spanish. +- **Technical Hurdle (LZMA Error):** + - The local Python environment lacked `_lzma` support, causing `transformers` and `huggingface_hub` to crash. + - **Solution:** Implemented a comprehensive `lzma` mock in the script to provide necessary constants (`FORMAT_XZ`, etc.) and bypass the system-level limitation. + +## Current Status +The project now features a high-performance, Apple Silicon-optimized pipeline that: +1. Detects speech using **Silero VAD**. +2. Transcribes using **MLX-Whisper (small.en)**. +3. Translates using **MarianMT (EN-ES)**. +4. Operates entirely locally with hardware acceleration.