Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Has anyone combined Whisper with a model that can assign owners to voices yet?

For example, instead of:

    Hello there!
    Hi 
    How are you?
    Good, and you?
You get something like:

    Voice A: Hello there!
    Voice B: Hi 
    Voice A: How are you?
    Voice B: Good, and you?


This is called speaker diarization, basically one of the 3 components of speaker recognition (verification, identification, diarization).

You can do this pretty conveniently using pyannote-audio[0].

Coincidentally I did a small presentation on this at a university seminar yesterday :). I could post a Jupyter notebook if you're interested.

PS: Bai & Zhang (2020) is a great review on the literature [1]

[0] https://github.com/pyannote/pyannote-audio

[1] https://arxiv.org/abs/2012.00931


Yes please posting a jupyter notebook will be of great help


yes please!


I modified a Jupyter notebook that’s been going around to do this and have nice Markdown output.

This is the best AI diarization and transcription I’ve been able to get so far: https://github.com/zachlatta/openai-whisper-speaker-identifi...


Not whisper, but:

> The Google Recorder app (...) transcribes meetings and interviews to text, instantly giving you a searchable transcription that is synced to the recorded audio (...) With this new update, the recorder can now identify and label each speaker automatically—an impressive feat. Google Recorder is exclusive to the Pixel 6 and newer Pixel devices.

https://arstechnica.com/gadgets/2022/12/pixel-7-update-adds-...


That's awesome. I wonder if there's a way to upload existing recordings to Google Recorder to get the transcriptions.


I'd be down to code this up if someone if anyone's interested. contact is in profile.


This is called "diarization"


Not that I know of, but I use Descript for content creation and it offers something like what you're describing. they have automatic speaker labels built in to their transcription service.

would be lovely to see this feature open sourced.


There is a discussion on the Whisper github page called something like “diarization” which details a few attempts to attain this functionality with additional tools.


I’ve had limited success combining the results of pyannotate with whisper transcripts (based on time stamps).

It’s ok, but the quality of speaker identification is nowhere near as good as the transcription itself.

I’d love to see models which try and use stereo information in recordings to solve the problem. Or, given a fixed camera and static speakers, I thought it should even be possible to use video to add information about who is speaking. There doesn’t seem to be anything like that right now tho.


Some attempts have been made but it still sucks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: