Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

To put it in other words: it is not clear when and how they hallucinate. With a person, their competence could be understood and also their limits. But a llm can happily give different answers based on trivial changes in the question with no warning.


In a conversation (conversation and attached pictures at https://bsky.app/profile/liotier.bsky.social/post/3ldxvutf76...), I delete a spurious "de" ("Produce de two-dimensional chart [..]" to "Produce two-dimensional [..]") and ChatGPT generates a new version of the graph, illustrating a different function although nothing else has changed and there was a whole conversation to suggest that ChatGPT held a firm model of the problem. Confirmed my current doctrine: use LLM to give me concepts from a huge messy corpus, then check those against sources from said corpus.


LLM's are non-deterministic: they'll happily give different answers to the same prompt based on nothing at all. This is actually great if you want to use them for "creative" content generation tasks, which is IMHO what they're best at. (Along with processing of natural language input.)

Expecting them to do non-trivial amounts of technical or mathematical reasoning, or even something as simple as code generation (other than "translate these complex natural-language requirements into a first sketch of viable computer code") is a total dead end; these will always be language systems first and foremost.


This confuses me. You have your model, you have your tokens.

If the tokens are bit-for-bit-identical, where does the non-determinism come in?

If the tokens are only roughly-the-same-thing-to-a-human, sure I guess, but convergence on roughly the same output for roughly the same input should be inherently a goal of LLM development.


Most any LLM has a "temperature" setting, a set of randomness added to the otherwise fixed weights to intentionally cause exactly this nondeterministic behavior. Good for creative tasks, bad for repeatability. If you're running one of the open models, set the temperature down to 0 and it suddenly becomes perfectly consistent.


You can get deterministic output with even with a high temp.

Whatever "random" seed was used can be reused.


The model outputs probabilities, which you have to sample randomly. Choosing the "highest" probability every time leads to poor results in practice, such as the model tending to repeat itself. It's a sort of Monte-Carlo approach.


The trained model is just a bunch of statistics. To use those statistics to generate text you need to "sample" from the model. If you always sampled by taking the model's #1 token prediction that would be deterministic, but more commonly a random top-K or top-p token selection is made, which is where the randomness comes in.


It is technically possible to make it fully deterministic if you have a complete control over the model, quantization and sampling processes. The GP probably meant to say that most commercially available LLM services don't usually give such control.


Actually you just have to set temperature to zero.


> If the tokens are bit-for-bit-identical, where does the non-determinism come in?

By design, most LLM’s have a randomization factor to their model. Some use the concept of “temperature” which makes them randomly choose the 2nd or 3rd highest ranked next token, the higher the temperature the more often/lower they pick a non-best next token. OpenAI described this in their papers around the GPT-2 timeframe IIRC.


Computers are deterministic. LLMs run on computers. If you use the same seed for the random number generator you’ll see that it will produce the same output given an input.


The unreliability of LLMs is mostly unrelated to their (artificially injected) non-determinism.


There's no need for there to be changes to the question. LLMs have a rng factor built in to the algorithm. It can happily give you the right answer and then the wrong one.


> trivial changes in the question

i love how those changes are often just a different seed in the randomness... as just chance.

run some repeated tests with "deeper than surface knowledge" on some niche subjects and got impressed that it gave the right answer... about 20% of the time.

(on earlier openAI models)


Ask survey designers how “trivial” changes to questions impact results from humans. It’s a huge thing in the field.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: