Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I was curious how good a transcription I could get from what may be the best multimoldal LLM currently, Gemini-1.5-Pro-Experiment-0801, so I had it transcribe five minutes of an interview between Ezra Klein and Nancy Pelosi from earlier today. The results are here:

https://www.gally.net/temp/20240809geminitranscription/index...

Aside from some minor punctuation and capitalization issues, Gemini’s transcription looks nearly perfect to me. There were only one or two words that I think it misheard. If I had transcribed the audio myself, I would have made more mistakes than that.

One passage struck me in particular:

  And then he comes up with "weird," which becomes viral and the rest, and here he is. 
How did Gemini know to put “weird” in quotation marks, to indicate—correctly—that the speaker was referring to Walz’s use of the word as a word? According to Politico, Walz first used the word in that context in the media on July 23.

https://www.politico.com/news/2024/07/26/trump-vance-weird-0...



Maybe two factors helped achieve the impressive result with the quotation marks:

- auditory cues

- the sentence would be gramatically incorrect and make no sense without them

Just guessing out of the blue.

But I think it's likely that LLMs (and other speech recognition systems) need to exploit sentence context to recognize individual words and punctuation, and this is an example were it went well.

Human listening is similar in a way, we can recognize words even when spoken very mumbly or fast, if we have context.

So we always hear phrased rather than words.


It's very likely that the model is capable of picking up on the verbal cues surrounding quotes.

Do you have the audio or video file?

I'd like to run it through our AI video editor and see how it punctuates the transcript.


The mp3 file that I gave to Gemini (a five-minute excerpt from the audio podcast) is linked in the source code of the page. Here is the full URL:

https://www.gally.net/temp/20240809geminitranscription/inter...

The full interview including video is on the New York Times website, though you might need a subscription to view it:

https://www.nytimes.com/2024/08/09/opinion/ezra-klein-podcas...

The NYT’s closed captions do not put “weird” in quotation marks; they also divide sentences weirdly and have other mistakes. But they get some things better than Gemini, such as capitalizing “House” when it means the House of Representatives.

I haven’t compared the audio-only podcast version and the video version carefully; it’s possible that parts of the audio were edited or re-recorded for one or the other.

Let us know how your AI video editor does!


OK I ran it.

We have a grammar correction step which "fixed" the sentence to "...he comes up with something weird, which becomes viral...".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: