I was curious how good a transcription I could get from what may be the best multimoldal LLM currently, Gemini-1.5-Pro-Experiment-0801, so I had it transcribe five minutes of an interview between Ezra Klein and Nancy Pelosi from earlier today. The results are here:
Aside from some minor punctuation and capitalization issues, Gemini’s transcription looks nearly perfect to me. There were only one or two words that I think it misheard. If I had transcribed the audio myself, I would have made more mistakes than that.
One passage struck me in particular:
And then he comes up with "weird," which becomes viral and the rest, and here he is.
How did Gemini know to put “weird” in quotation marks, to indicate—correctly—that the speaker was referring to Walz’s use of the word as a word? According to Politico, Walz first used the word in that context in the media on July 23.
Maybe two factors helped achieve the impressive result with the quotation marks:
- auditory cues
- the sentence would be gramatically incorrect and make no sense without them
Just guessing out of the blue.
But I think it's likely that LLMs (and other speech recognition systems) need to exploit sentence context to recognize individual words and punctuation, and this is an example were it went well.
Human listening is similar in a way, we can recognize words even when spoken very mumbly or fast, if we have context.
The NYT’s closed captions do not put “weird” in quotation marks; they also divide sentences weirdly and have other mistakes. But they get some things better than Gemini, such as capitalizing “House” when it means the House of Representatives.
I haven’t compared the audio-only podcast version and the video version carefully; it’s possible that parts of the audio were edited or re-recorded for one or the other.
https://www.gally.net/temp/20240809geminitranscription/index...
Aside from some minor punctuation and capitalization issues, Gemini’s transcription looks nearly perfect to me. There were only one or two words that I think it misheard. If I had transcribed the audio myself, I would have made more mistakes than that.
One passage struck me in particular:
How did Gemini know to put “weird” in quotation marks, to indicate—correctly—that the speaker was referring to Walz’s use of the word as a word? According to Politico, Walz first used the word in that context in the media on July 23.https://www.politico.com/news/2024/07/26/trump-vance-weird-0...