Improve Japanese TTS/STT UX and dependency handling #260

TechNavii · 2025-11-08T05:39:24Z

Context

MLX Audio’s Japanese TTS/STT UX had several rough edges: dependencies caused runtime crashes, the TTS language picker allowed unsupported combinations, the download button did not work time to time and transcript playback defaulted to a placeholder clip. This PR focuses on stabilising japanese so users can reliably synthesize and transcribe in Japanese.

Description

Harden the backend dependency chain for Kokoro TTS (auto-install friendly requirements, configure fugashi/MeCab
dictionaries, and guard language requests).
Expand the TTS UI so language options reflect the selected model’s capabilities, automatically select valid voices,
and generate real downloads.
Simplify STT uploads to always auto-detect language, capture the uploaded audio alongside the transcript, and
improve the transcript playback experience.

Changes in the codebase

requirements.txt, mlx_audio/server.py, mlx_audio/tts/utils.py, README.md: add misaki extras (misaki[ja], misaki[zh],
pyopenjtalk, fugashi[unidic-lite]), alias save_weights for older mlx_lm, bootstrap fugashi/MeCab env vars, and
document the behavior.
mlx_audio/ui/app/text-to-speech/page.tsx: enforce per-model language support (auto-detect only when supported),
auto-select matching Kokoro voices, and implement a working audio download button.
mlx_audio/ui/app/speech-to-text/page.tsx + [id]/page.tsx: remove the manual language dropdown, always send empty
language for auto-detect STT, persist the uploaded audio data URL, feed it into the transcript player, and fix the
playback timer.

Changes outside the codebase

No infrastructure or third-party service changes beyond the new pip dependencies (misaki extras and fugashi’s
UniDic-lite). The server now configures MeCab automatically; no manual OS-level install steps are required.

Additional information

The TTS language guardrails automatically downgrade to the first supported language when switching between models
(e.g., Marvis → English only).
Transcript playback now uses the actual uploaded audio; durations come from metadata rather than static defaults.
STT auto-detect keeps working with Whisper-family models without extra backend logic.

Checklist

Tests added/updated
Documentation updated
Issue referenced (e.g., Closes Model type kokoro not supported #123)

Tooru added 3 commits November 8, 2025 14:26

Improve multilingual deps and server guardrails

26eb0ed

Enhance TTS UI (language guardrails, download)

531eebf

Add STT auto-detect and reliable playback

a00397a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve Japanese TTS/STT UX and dependency handling #260

Improve Japanese TTS/STT UX and dependency handling #260

Uh oh!

TechNavii commented Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Improve Japanese TTS/STT UX and dependency handling #260

Are you sure you want to change the base?

Improve Japanese TTS/STT UX and dependency handling #260

Uh oh!

Conversation

TechNavii commented Nov 8, 2025

Context

Description

Changes in the codebase

Changes outside the codebase

Additional information

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant