Skip to content

Implement audio chat provider #227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jul 10, 2025
Merged

Implement audio chat provider #227

merged 11 commits into from
Jul 10, 2025

Conversation

julien-nc
Copy link
Member

Requires nextcloud/assistant#291 (for the mp3 recording).
Related with nextcloud/server#53759

If connected to OpenAI, we can use the chat completion endpoint with the gpt-4o-audio-preview model which accepts audio messages as input.

If not connected to OpenAI, it's safer to split the process in: STT -> LLM -> TTS.

@julien-nc julien-nc added enhancement New feature or request 3. to review labels Jul 2, 2025
@julien-nc julien-nc force-pushed the enh/noid/voice2voice branch from d285a78 to c9fa7c1 Compare July 2, 2025 12:27
Copy link
Contributor

@kyteinsky kyteinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just nitpicks :)

@julien-nc julien-nc force-pushed the enh/noid/voice2voice branch from a8ac1af to 264930e Compare July 3, 2025 16:15
@julien-nc julien-nc requested a review from lukasdotcom July 4, 2025 12:44
Copy link
Member

@lukasdotcom lukasdotcom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@julien-nc julien-nc force-pushed the enh/noid/voice2voice branch 3 times, most recently from 9adb87e to 720d619 Compare July 9, 2025 14:44
@julien-nc
Copy link
Member Author

A few new things:

  • The "one step" is actually 2 steps because we need the input transcription (that is not done by multimodal models)
  • We return the response "audio ID" as optional output so the assistant can use it when scheduling following tasks. This helps to make sure the multimodal model keeps responding with audio

@julien-nc julien-nc requested a review from lukasdotcom July 9, 2025 14:46
Copy link
Member

@lukasdotcom lukasdotcom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting it into two functions is a good idea, and I have some feedback on the audio id portion.

@julien-nc julien-nc force-pushed the enh/noid/voice2voice branch from 720d619 to 688d951 Compare July 10, 2025 09:09
Copy link
Contributor

@kyteinsky kyteinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

throw new RuntimeException($serviceName . ' text to speech generation failed with: ' . $e->getMessage());
}
} else {
$output = base64_decode($message['audio']['data']);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this audio data expires at a certain time expires_at. Should we check that and fallback to the above TTS approach to get the new audio and audio id?

@julien-nc julien-nc requested a review from lukasdotcom July 10, 2025 10:20
…he server task type is available, handle the case when the chat endpoint does not return audio

Signed-off-by: Julien Veyssier <[email protected]>
@julien-nc julien-nc force-pushed the enh/noid/voice2voice branch from 6a48831 to 3ea9b77 Compare July 10, 2025 10:21
Copy link
Member

@lukasdotcom lukasdotcom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than what @kyteinsky said everything else LGTM.

@julien-nc
Copy link
Member Author

@kyteinsky @lukasdotcom The expiration is checked where the history array is built. Currently the only place where we do that is when the assistant schedules an audio chat task, in https://github.com/nextcloud/assistant/pull/292/files#diff-bacc1ea613cd8effdacf1a14dd4d819266ed4315d83dc81e94cad93b8d071ff4R591-R592

In integration_openai, we won't receive the expiration date in the history input, we can't check it here.

@julien-nc julien-nc merged commit acbcb17 into main Jul 10, 2025
29 checks passed
@julien-nc julien-nc deleted the enh/noid/voice2voice branch July 10, 2025 12:45
@kyteinsky kyteinsky mentioned this pull request Jul 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3. to review enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants