> The Google Recorder app (...) transcribes meetings and interviews to text, instantly giving you a searchable transcription that is synced to the recorded audio (...) With this new update, the recorder can now identify and label each speaker automatically—an impressive feat. Google Recorder is exclusive to the Pixel 6 and newer Pixel devices.
Not that I know of, but I use Descript for content creation and it offers something like what you're describing. they have automatic speaker labels built in to their transcription service.
There is a discussion on the Whisper github page called something like “diarization” which details a few attempts to attain this functionality with additional tools.
I’ve had limited success combining the results of pyannotate with whisper transcripts (based on time stamps).
It’s ok, but the quality of speaker identification is nowhere near as good as the transcription itself.
I’d love to see models which try and use stereo information in recordings to solve the problem. Or, given a fixed camera and static speakers, I thought it should even be possible to use video to add information about who is speaking. There doesn’t seem to be anything like that right now tho.
For example, instead of:
You get something like: