This is great. Poking into the source, I find it interesting that the author imp...

koljab · 2025-05-05T23:46:18 1746488778

It's in fact using Silero via RealtimeSTT. RealtimeSTT tells when silence starts. Then a binary sentence classification model is used on the realtime transcription text which infers blazingly fast (10ms) and returns a probability between 0 and 1 indicating if the current spoken sentence is considered "complete". The turn detection component takes this information to calculate the silence waiting time until "turn is over".

thekaranchawla · 2025-05-06T17:25:29 1746552329

This is the exact strategy I'm using for the real-time voice agent I'm building. Livekit also published a custom turn detection model that works really well based on the video they released, which was cool to see.

Code: https://github.com/livekit/agents/tree/main/livekit-plugins/... Blog: https://blog.livekit.io/using-a-transformer-to-improve-end-o...