Skip to content

Conversation

@TechNavii
Copy link

Context

MLX Audio’s Japanese TTS/STT UX had several rough edges: dependencies caused runtime crashes, the TTS language picker allowed unsupported combinations, the download button did not work time to time and transcript playback defaulted to a placeholder clip. This PR focuses on stabilising japanese so users can reliably synthesize and transcribe in Japanese.

Description

  • Harden the backend dependency chain for Kokoro TTS (auto-install friendly requirements, configure fugashi/MeCab
    dictionaries, and guard language requests).
  • Expand the TTS UI so language options reflect the selected model’s capabilities, automatically select valid voices,
    and generate real downloads.
  • Simplify STT uploads to always auto-detect language, capture the uploaded audio alongside the transcript, and
    improve the transcript playback experience.

Changes in the codebase

  • requirements.txt, mlx_audio/server.py, mlx_audio/tts/utils.py, README.md: add misaki extras (misaki[ja], misaki[zh],
    pyopenjtalk, fugashi[unidic-lite]), alias save_weights for older mlx_lm, bootstrap fugashi/MeCab env vars, and
    document the behavior.
  • mlx_audio/ui/app/text-to-speech/page.tsx: enforce per-model language support (auto-detect only when supported),
    auto-select matching Kokoro voices, and implement a working audio download button.
  • mlx_audio/ui/app/speech-to-text/page.tsx + [id]/page.tsx: remove the manual language dropdown, always send empty
    language for auto-detect STT, persist the uploaded audio data URL, feed it into the transcript player, and fix the
    playback timer.

Changes outside the codebase

  • No infrastructure or third-party service changes beyond the new pip dependencies (misaki extras and fugashi’s
    UniDic-lite). The server now configures MeCab automatically; no manual OS-level install steps are required.

Additional information

  • The TTS language guardrails automatically downgrade to the first supported language when switching between models
    (e.g., Marvis → English only).
  • Transcript playback now uses the actual uploaded audio; durations come from metadata rather than static defaults.
  • STT auto-detect keeps working with Whisper-family models without extra backend logic.

Checklist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Model type kokoro not supported

1 participant