This is great. Poking into the source, I find it interesting that the author implemented a custom turn detection strategy, instead of using Silero VAD (which is standard in the voice agents space). Iām very curious why they did it this way and what benefits they observed.
For folks that are curious about the state of the voice agents space, Daily (the WebRTC company) has a great guide [1], as well as an open-source framework that allows you to build AI voice chat similar to OP's with lots of utilities [2].
Disclaimer: I work at Cartesia, which services a lot of these voice agents use cases, and Daily is a friend.
It's in fact using Silero via RealtimeSTT. RealtimeSTT tells when silence starts. Then a binary sentence classification model is used on the realtime transcription text which infers blazingly fast (10ms) and returns a probability between 0 and 1 indicating if the current spoken sentence is considered "complete". The turn detection component takes this information to calculate the silence waiting time until "turn is over".
This is the exact strategy I'm using for the real-time voice agent I'm building. Livekit also published a custom turn detection model that works really well based on the video they released, which was cool to see.
For folks that are curious about the state of the voice agents space, Daily (the WebRTC company) has a great guide [1], as well as an open-source framework that allows you to build AI voice chat similar to OP's with lots of utilities [2].
Disclaimer: I work at Cartesia, which services a lot of these voice agents use cases, and Daily is a friend.
[1]: https://voiceaiandvoiceagents.com [2]: https://docs.pipecat.ai/getting-started/overview