Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The algorithms are not deterministic: they output a probability distribution over next tokens, which is then sampled. That’s why clicking “retry” gives you a different answer. An LM could easily (in principle) compute a 50/50 distribution when asked to flip a coin.


They are still deterministic. You can set temperature to zero to get the output to be consistent, but even the temperature usually uses a seed or psuedo random number generator. Though this would depend on the implementation.

https://github.com/huggingface/transformers/blob/d538293f62f...


As someone which tried really hard to get deterministic outcome out of them, they really are not.

Layers can be computed in slightly different orders (due to parallelism), on different GPU models, and this will cause small numerical differences which will compound due to auto-regression.


Could someone elighten me on how to compute layers in parallel? I was under the impression that the linearity of the layer computation was why we were mostly bandwidth constrained. If you can compute the layers In parallel then why do we need high bandwidth?



all things being equal, if you fix all of those things and the hardware isn't buggy, you get the same results, and I've set up CI with golden values that requires this to be true. indeed, occasionally you have to change golden values depending on implementation but mathematically the algorithm is deterministic, even if in practice determinidm requires a bit more effort.


But the reality is that all things aren’t equal and you can’t fix all of those things, not in a way that is practical. You’d have to run everything serially (or at least in a way you can guarantee identical order) and likely emulated so you can guarantee identical precision and operations. You’ll be waiting a long time for results.

Sure, it’s theoretically deterministic, but so are many natural processes like air pressure, or the three body problem, or nuclear decay, if only we had all the inputs and fixed all the variables, but the reality is that we can’t and it’s not particularly useful to say that well if we could it’d be deterministic.


It's definitely reachable in practice. Gemini 2.0 Flash is 100% deterministic at temperature 0, for example. I guess it's due to the TPU hardware (but then why other Gemini models are not like that...).


Anyways, this is all immaterial to the original question, which is if LLMs can do randomness [for single user with a given query], so from a practical standpoint the question itself needs to survive "all things being equal", that is is to say, suppose I stand up an LLM on my own GPU rig, and the algorithmic scheduler doesn't do too many out of order operations (very possible depending on the ollama or vllm build).


Setting the temperature to zero reduces the process to greedy search, which does a lot more things to the output than just making it non-random.


Yes so it's basically asking whether that probability distribution is 50/50 or not. And it turns out that it's sometimes very skewed. Which is a non-obvious result.


So, what ‘algorithms’ are you talking about? The randomness comes from the input value (the random seed). Once you give it a random seed, a special number generator (PRNG) makes a sequence from that seed. When the LLM needs to ‘flip a coin,’ it just consumes a value from the PRNG’s output sequence.

Think of each new ‘interaction’ with the LLM as having two things that can change: the context and the PRNG state. We can also think of the PRNG state as having two things: the random seed (which makes the output sequence), and the index of the last consumed random value from the PRNG. If the context, random seed, and index are the same, then the LLM will always give the same answer. Just to be clear, the only ‘randomness’ in these state values comes from the random seed itself.

The LLM doesn’t make any randomness, it needs randomness as an input (hyper)parameter.


The raw output of a transformer model is a list of logits, confidence scores for each token in its vocabulary. It's only deterministic in this sense (same input = same scores). But it can easily assign equal scores to 1 and 0 and zero to other tokens, and you'll have to sample it randomly to produce the result. Whether you consider it external or internal doesn't matter, transformers are inherently probabilistic by design. Randomness is all they produce. And typically they aren't trained with the case of temperature 0 and greedy sampling in mind.


> But it can easily assign equal scores to 1 and 0 and zero to other tokens, and you’ll have to sample it randomly to produce the result. Whether you consider it external or internal doesn’t matter, transformers are inherently probabilistic by design.

The transformer is operating on the probability functions in a fully deterministic fashion, you might be missing the forest for the trees here. In your hypothetical, the transformer does not have a non-deterministic way of selecting the 1 or 0 token, so it will rely on a noise source which can. It does not produce any randomness at all.


It's one way to look at it, but consider that you need the noise source in case 1 and 0 are strictly equal, necessarily. You can't tell which one is the answer until you decided randomly.


Right, so the LLM needs some randomness to make that decision. The LLM performs a series of deterministic operations until it needs the randomness to make this decisions, there is no randomness within the LLM itself.


But the randomness doesn't directly translate to a random outcome in results. It may randomly choose from a thousand possible choices, where 90% of the choices are some variant of 'the coin comes up heads'.

I think a more useful approach is to give the LLM access to an api that returns a random number, and let it ask for one during response formulation, when needed.


i think gp would consider the sampling bit a part of the API, not a part of the algorithm.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: