I'm interested in potential alternatives to ChatGPT's advanced voice mode. When I see the word "multimodal" I'm hopeful the model understands text + voice but instead it almost always seems to refer to text + images. Is there a keyword that I can use to look for models that work with voice similar to ChatGPT's advanced voice mode?
I don't know that ChatGPT's voice mode is using audio as a transformer input directly.
It could just be using speech to text (e.g. Whisper) on your input, and then using its text model on the text of your words. Or has OpenAI said that they aren't doing this?
OpenAI does not provide many details about their models these days but they do mention that the "Advanced voice" within ChatGPT operates on audio input directly:
> Advanced voice uses natively multimodal models, such as GPT-4o, which means that it directly “hears” and generates audio, providing for more natural, real-time conversations that pick up on non-verbal cues, such as the speed you’re talking, and can respond with emotion.
Thanks - it seems that Gemini Live is pretty far behind advanced voice mode at the moment. For example, I can't get it to speak slower when I want to understand what it is saying.
I'm still interested in what keyword I could use to search for the latest research in voice models.