Files
whisper-translation/history.md
2026-02-26 21:26:34 -05:00

3.3 KiB

Project History: Python Whisper Live Transcription

This document tracks the evolution, technical decisions, and optimizations made for this live audio transcription and translation tool.

Phase 1: Basic Live Transcription

  • Goal: Create a simple script to transcribe live audio.
  • Initial Stack: openai-whisper, sounddevice, numpy.
  • Approach:
    • Captured audio in 2-second chunks using sounddevice.
    • Used the tiny.en model for initial testing.
  • Outcome: Successful basic transcription, but limited by continuous processing (even during silence).

Phase 2: Voice Activity Detection (VAD)

  • Goal: Improve efficiency by only transcribing when someone is speaking.
  • Stack Addition: silero-vad.
  • Approach:
    • Integrated Silero VAD to monitor the audio stream.
    • Transcription is only triggered after a speech segment is followed by a period of silence (500ms).
  • Outcome: Significantly reduced CPU usage and cleaner output.

Phase 3: Apple Silicon Optimization (M-Series/M2)

  • Goal: Leverage the M2's Neural Engine and GPU for better performance.
  • Stack Transition: mlx-whisper (via Apple's MLX framework).
  • Decision: Switched from openai-whisper to mlx-whisper and upgraded the model to small.en for better accuracy without sacrificing speed.
  • Outcome: Faster inference and better battery efficiency.

Phase 4: Local Translation

  • Approach A (LLM): Tried using SmolLM2-135M via mlx-lm for translation.
    • Issue: The LLM was "too talkative," often adding conversational filler or explaining the translation instead of just providing it.
  • Approach B (Dedicated MT): Switched to MarianMT (Helsinki-NLP/opus-mt-en-es).
    • Decision: Chose a dedicated Translation Model for cleaner, direct mapping from English to Spanish.
  • Technical Hurdle (LZMA Error):
    • The local Python environment lacked _lzma support, causing transformers and huggingface_hub to crash.
    • Solution: Implemented a comprehensive lzma mock in the script to provide necessary constants (FORMAT_XZ, etc.) and bypass the system-level limitation.

Current Status

The project now features a high-performance, Apple Silicon-optimized pipeline that:

  1. Detects speech using Silero VAD.
  2. Transcribes using MLX-Whisper (small.en).
  3. Translates using MarianMT (EN-ES).
  4. Operates entirely locally with hardware acceleration.

Phase 5: Simultaneous Multi-Language Translation

  • Goal: Provide translations in Spanish, French, and Arabic at the same time.
  • Approach:
    • Refactored the script to support a dictionary of multiple MarianMT models.
    • Each transcribed English segment is passed through each loaded translation engine sequentially.
  • Performance on M2: Loading 3-4 specialized models + Whisper is highly efficient, using ~1.5GB of RAM and providing near-instant results.

Phase 6: Memory & Generation Safety

  • Issue: Occasionally, long inputs or model glitches caused "runaway" translation generation, which could consume excessive memory.
  • Solution:
    • Artificially truncated input transcription to a maximum of 250 characters.
    • Added max_new_tokens=150 to the translation generation call to ensure the model terminates even if it gets stuck in a loop.