Why can't Gemini, the product, do that by itself? Isn't the point of all this AI...

vlovich123 · 2025-11-18T22:41:29 1763505689

Multimodal models are only now starting to come into the space and even then I don’t know they really support diarization yet (and often multimodal is thinking+speech/images, not sure about audio).

jrk · 2025-11-18T23:23:53 1763508233

I think they weren’t asking “why can’t Gemini 3, the model, just do good transcription,” they were asking “why can’t Gemini, the API/app, recognize the task as something best solved not by a single generic model call, but by breaking it down into an initial subtask for a specialized ASR model followed by LLM cleanup, automatically, rather than me having to manually break down the task to achieve that result.”

darkwater · 2025-11-19T09:18:10 1763543890

Exactly that. There is a layer (or more than one) between the user submitting the YT video and the actual model "reading" it and writing the digest. If the required outcome is to write a digest of a 3 hours video, and to achieve the best result it needs to pass first into a specialized transcription model and then in a generic one that can summarize, well, why Google/Gemini doesn't do it out of the box? I mean, I'm probably oversimplifying but if you read the presentation post by Pichar itself, well, I would not expect less than this.

refulgentis · 2025-11-19T00:11:55 1763511115

Speech recognition, as described above, is an AI too :) These LLMs are huge AIs that I guess could eventually replace all other AIs, but that’s sort of speculation no one with knowledge of the field would endorse.

Separately, in my role as wizened 16 year old veteran of HN: it was jarring to read that. There’s a “rules” section, but don’t be turned off by the name, it is more like a nice collection of guidelines of how to interact in a way that encourages productive discussion that illuminates. One of the key rules is not to interpret things weakly. Here, someone spelled out exactly how to do it, and we shouldn’t then assume its not AI, then tie to a vague demeaning description of “AI hype”, then ask an unanswerable question of what’s the point of “AI hype”.

If you’re nontechnical, to be clear, it would be hard to be nontechnical and new to HN and know how to ask that a different way, I suppose.

darkwater · 2025-11-19T09:20:20 1763544020

> There’s a “rules” section, but don’t be turned off by the name, it is more like a nice collection of guidelines of how to interact in a way that encourages productive discussion that illuminates. One of the key rules is not to interpret things weakly. Here, someone spelled out exactly how to do it, and we shouldn’t then assume its not AI, then tie to a vague demeaning description of “AI hype”, then ask an unanswerable question of what’s the point of “AI hype”.

I think you misunderstood my comment. https://news.ycombinator.com/item?id=45973656 has got the right reading of it.