I'm interested in potential alternatives to ChatGPT's advanced voice mode. When ...

cjbprime · 2025-05-26T17:13:07 1748279587

I don't know that ChatGPT's voice mode is using audio as a transformer input directly.

It could just be using speech to text (e.g. Whisper) on your input, and then using its text model on the text of your words. Or has OpenAI said that they aren't doing this?

mrshu · 2025-05-26T21:56:50 1748296610

OpenAI does not provide many details about their models these days but they do mention that the "Advanced voice" within ChatGPT operates on audio input directly:

> Advanced voice uses natively multimodal models, such as GPT-4o, which means that it directly “hears” and generates audio, providing for more natural, real-time conversations that pick up on non-verbal cues, such as the speed you’re talking, and can respond with emotion.

From https://help.openai.com/en/articles/8400625-voice-mode-faq

amrrs · 2025-05-26T10:39:55 1748255995

Google Gemini Live is pretty good.

If you want to try only voice, Try unmute.sh by Kyutai which will be eventually open-sourced

spuz · 2025-05-27T11:24:19 1748345059

Thanks - it seems that Gemini Live is pretty far behind advanced voice mode at the moment. For example, I can't get it to speak slower when I want to understand what it is saying.

I'm still interested in what keyword I could use to search for the latest research in voice models.