An update to Gemini diffusion is one of my most eagerly anticipated AI releases....

ACCount37 · 2025-11-22T14:51:48 1763823108

It's not a very promising direction because autoregressive LLMs still deliver better output quality per model weight, as a rule.

Now, is it possible that a model can combine advantages of both? Combine fast generation and multidirectional causality of diffusion with precision, capabilities and generalization of autoregression?

Maybe. This paper is research in that direction. So far, it's not a clear upgrade over autoregressive LLMs.

euleriancon · 2025-11-22T17:31:58 1763832718

Diffusion LMs do seem to be able to get more out of the same data. In a world where we are already training transformer based LLMs on all text available, diffusion LMs ability to continue learning on a fixed set of data may be able to outperform transformers

https://arxiv.org/abs/2511.03276

nbardy · 2025-11-22T18:50:46 1763837446

There’s another paper that shows you can get the same effect by training auto regression on Fill in the middle data.

So it’s more about the mask modeling objective than Diffusion.

albertzeyer · 2025-11-23T00:10:20 1763856620

Which paper is that?

vintermann · 2025-11-23T13:09:55 1763903395

As a rule, but the devil is in the details. The thing, the one big thing I want to use multimodal LLMs for, is accessing the data in historical mostly handwritten texts.

None of the big LLMs do an acceptable job. This is a task a trained human can do, but it's a lot of work. You have to learn, not just the script style of the period (which can vary far more than people think), but even the idiosyncracies of a given writer. All the time, you run into an unreadable word, and you need to look around for context which might give a clue, or other places the same word (or a similar looking word) is used in cleaner contexts. It's very much not a beginning-to-end task, trying to read a document from start to end would be like solving a crossword puzzle in strict left to right, top to bottom order.

Maybe autoregressive models can eventually become powerful enough that they can just do that! But so far, they haven't. And I have a lot more faith in that the diffusion approach is closer to how you have to do it.

ACCount37 · 2025-11-24T11:07:01 1763982421

That looks like something that can be solved by autoregressive models of today, no architectural changes needed.

What you need is: good image understanding, at least GPT-5 tier, general purpose reasoning over images training, and then some domain-specific training, or at least some few-shot guidance to get it to adopt the correct reasoning patterns.

If I had to guess which model would be able to do it best out of the box, few-shot, I'd say Gemini 3 Pro.

There is nothing preventing an autoregressive LLM from revisiting images and rewriting the texts as new clues come in. This is how they can solve puzzles like sudoku.

vintermann · 2025-11-24T11:43:48 1763984628

Try for yourself, if you want to:

https://urn.digitalarkivet.no/URN:NBN:no-a1450-rg60085808000...

fragmede · 2025-11-22T19:03:42 1763838222

> still deliver better output quality per model weight, as a rule.

is it possible to quantify that and just have a linked slider for quality and speed? If I can get an answer that's 80% right in 1/10th the time, and then iterate on that who comes out ahead?

jrk · 2025-11-22T21:09:29 1763845769

Yes but you can also do the same thing with autoregressive models just by making them smaller. This tradeoff always exists, the question is whether the Pareto curve for diffusion models ever crosses or dominates the best autoregressive option at the same throughput (or quality).

ricochet11 · 2025-11-23T00:56:00 1763859360

Perhaps it’s an issue is that text often has directionality.

https://arxiv.org/abs/2401.17505

ilaksh · 2025-11-22T18:21:29 1763835689

4-5 times faster with minimal change in quality seems like a clear upgrade in efficiency.

zaptrem · 2025-11-22T18:49:15 1763837355

Latency may be better, but throughput (the thing companies care about) may be the same or worse, since every step the entire diffusion window has to be passed through the model. With AR models only the most recent token goes through, which is much more compute efficient allowing you to be memory bound. Trade off with these models is more than one token per forward pass, but idk the point where that becomes worth it (probably depends on model and diffusion window size)