Has anyone combined Whisper with a model that can assign owners to voices yet? F...

sva_ · on Dec 7, 2022

This is called speaker diarization, basically one of the 3 components of speaker recognition (verification, identification, diarization).

You can do this pretty conveniently using pyannote-audio[0].

Coincidentally I did a small presentation on this at a university seminar yesterday :). I could post a Jupyter notebook if you're interested.

PS: Bai & Zhang (2020) is a great review on the literature [1]

[0] https://github.com/pyannote/pyannote-audio

[1] https://arxiv.org/abs/2012.00931

wanderingmind · on Dec 7, 2022

Yes please posting a jupyter notebook will be of great help

swyx · on Dec 7, 2022

yes please!

zachlatta · on Dec 7, 2022

I modified a Jupyter notebook that’s been going around to do this and have nice Markdown output.

This is the best AI diarization and transcription I’ve been able to get so far: https://github.com/zachlatta/openai-whisper-speaker-identifi...

e12e · on Dec 7, 2022

Not whisper, but:

> The Google Recorder app (...) transcribes meetings and interviews to text, instantly giving you a searchable transcription that is synced to the recorded audio (...) With this new update, the recorder can now identify and label each speaker automatically—an impressive feat. Google Recorder is exclusive to the Pixel 6 and newer Pixel devices.

https://arstechnica.com/gadgets/2022/12/pixel-7-update-adds-...

lordswork · on Dec 7, 2022

That's awesome. I wonder if there's a way to upload existing recordings to Google Recorder to get the transcriptions.

itake · on Dec 7, 2022

I'd be down to code this up if someone if anyone's interested. contact is in profile.

gok · on Dec 7, 2022

This is called "diarization"

sbmsr · on Dec 7, 2022

Not that I know of, but I use Descript for content creation and it offers something like what you're describing. they have automatic speaker labels built in to their transcription service.

would be lovely to see this feature open sourced.

CiceroCiceronis · on Dec 7, 2022

There is a discussion on the Whisper github page called something like “diarization” which details a few attempts to attain this functionality with additional tools.

bigfudge · on Dec 7, 2022

I’ve had limited success combining the results of pyannotate with whisper transcripts (based on time stamps).

It’s ok, but the quality of speaker identification is nowhere near as good as the transcription itself.

I’d love to see models which try and use stereo information in recordings to solve the problem. Or, given a fixed camera and static speakers, I thought it should even be possible to use video to add information about who is speaking. There doesn’t seem to be anything like that right now tho.

nxmnxm99 · on Dec 7, 2022

Some attempts have been made but it still sucks.