I built a similar thing with the GPT-4 APIs a few weeks ago; thanks for the reminder that I must put it on GitHub at some point, as it's only about 30 lines of code.
It should be doable. What I built only works for videos with transcripts, but I've been looking to improve it using OpenAI's Whisper for Speech-To-Text. I'm just lazy so I haven't gotten around to it (...which is why I spent an hour throwing together a 30-line script to summarize youtube videos for me)
For you or anyone else reading this I recently ran across this video documenting setting up and using whisper. It's probably a little overdetailed, but I found the github docs a little underdetailed so might be useful. Whisper is pretty powerful. One of the more useful open source ai tools available right now.
But as you implied in your comment, it should be possible to do it quite well with any video by transcripting with whisper and then sending the text to gpt or another LLM to summarize.
I’ve done something similar here https://github.com/mcdallas/summarize it feeds an audio file to whisper and then summarizes the transcript. You can easily wrap it with yt-dlp to download the audio portion of a video
I also did the same but its a web app, https://github.com/mkagenius/audioGPT (i also have it hosted but I am afraid if i post the link, it would eat through all my credits)
I’m currently working on this with the caveat that I want to do the work locally. Using whisper but the summarization portion if this task is not straightforward given the limited context size of models.
Does anyone have any additional insight into this problem?
I'll check it out (or maybe let my script check it out first), thanks.
From what I remember the Whisper API docs weren't too bad, but I didn't try actually implementing anything, so you could be right that they're underdetailed.