Hacker Newsnew | past | comments | ask | show | jobs | submit | ijk's commentslogin

I was hoping that this would be about Llama 1 and comparison with GPT-contaminated models.

Unfortunately, I am also worried that is the case.

There was an era where there were a lot of completely free sites, because they were mostly academic or passion projects, both of which are subsidized by other means.

Then there were ads. Banner adds, Google's less obtrusive text ads, etc. There were a number of sites completely supported by ads. Including a lot of blogs.

And forums. Google+ managed to kill a lot of niche communities by offering them a much easier way to create a community and then killing it off.

Now forums have been replaced by Discord and Reddit. Deep project sites still exist but are rarer. Social media has consolidated. Most people don't have personal home pages. There's a bunch of stuff that's paywalled behind Patreon.

And all of that has been happening before anyone threw AI into the mix.


Buying a book scanner and frequenting used book stores seems like a past time to start that'll pay off in the long term.


There is an awful lot of "looking for my keys under the street light" going around these days. I've seen a bunch of projects proposed that are either based on existing data (but have no useful application of that data) or have a specific application (but lack the data and evaluation required to perform that task). It doesn't matter how good your data is if no one has any use for things like it, and it doesn't matter how neat your application would be if the data doesn't match.

I'm including things like RL metrics as data here, for lack of a better umbrella term, though the number of proposed projects that I've seen that decided that ongoing evaluation of actual effectiveness was a distraction from the more important task of having expensive engineers make expensive servers into expensive heatsinks is maddening.


The importance of having good metrics cannot be overstated.

On the "applying X" problem - this almost feels to me like another argument against fine tuning? Because it seems like Applying can be a surprisingly broad skill, and frontier lab AIs are getting good at Applying in a broad fashion.


Not rainforest, but rather savanna [1].

The Arabian desert is technically considered to be part of the Sahara, climate-wise, and participes in the same cycle [2].

This article is about researching evidence for ehat those transitions looked like, focusing on evidence that dates around the end of that particular dry period, pre-Holocene.

> Prior to the onset of the Holocene humid period, little is known about the relatively arid period spanning the end of the Pleistocene and the earliest Holocene in Arabia. An absence of dated archaeological sites has led to a presumed absence of human occupation of the Arabian interior. However, superimpositions in the rock art record appear to show earlier phases of human activity, prior to the arrival of domesticated livestock25.

[1]: https://en.wikipedia.org/wiki/African_humid_period

[2]: https://www.nationalgeographic.com/environment/article/green...


Not strictly true: while this was previously believed to be the case, Anthropic demonstrated that transformers can "think ahead" in some sense, for example when planning rhymes in a poem [1]:

> Instead, we found that Claude plans ahead. Before starting the second line, it began "thinking" of potential on-topic words that would rhyme with "grab it". Then, with these plans in mind, it writes a line to end with the planned word.

They described the mechanism that it uses internally for planning [2]:

> Language models are trained to predict the next word, one word at a time. Given this, one might think the model would rely on pure improvisation. However, we find compelling evidence for a planning mechanism.

> Specifically, the model often activates features corresponding to candidate end-of-next-line words prior to writing the line, and makes use of these features to decide how to compose the line.

[1]: https://www.anthropic.com/research/tracing-thoughts-language...

[2]: https://transformer-circuits.pub/2025/attribution-graphs/bio...


Thank you for these links! Their "circuits" research is fascinating. In the example you mention, note how the planned rhyme is piggybacking on the newline token. The internal state that the emergent circuits can use is 1:1 mapped to the tokens. Model cannot trigger an insertion of a "null" token for the purpose of storing this plan-ahead information during inference. Neither there are any sort of "registers" available aside from the tokens. The "thinking" LLMs are not quite that, because the thinking tokens are still forced to become text.


So, what I think most people don't realize is that the amount of computation an LLM can do in one pass is strictly bounded. You can see that here with the layers. (This applies to a lot of neural networks [1].)

Remember, they feed in the context on one side of the network, pass it through each layer doing matrix multiplication, and get a value on the other end that we convert back into our representation space. You can view the bit in the middle as doing a kind of really fancy compression, if you like. The important thing is that there are only so many layers, and thus only so many operations.

Therefore, past a certain point they can't revise anything because it runs out of layers. This is one reason why reasoning can help answer more complicated questions. You can train a special token for this purpose [2].

[1]: https://proceedings.neurips.cc/paper_files/paper/2023/file/f...

[2]: https://arxiv.org/abs/2310.02226


There's been a few attempts at training a backspace token, though.

e.g.:

https://arxiv.org/abs/2502.04404

https://arxiv.org/abs/2306.05426


Adding knowledge works, depending on how to define knowledge and works; given sufficient data you can teach an LLM new things [1].

However, the frontier models keep improving at a quick enough rate that it's often more effective just to wait for the general solution to catch up with your task then to spend months training a model yourself. Unless you need a particular tightly controlled behavior or need a smaller faster model or what have you. Training new knowledge in can get weird [2].

And in-context learning takes literal seconds-to-minutes of time if your information fits in the context window, so it's a lot faster to go that route if you can.

[1] https://arxiv.org/abs/2404.00213

[2] https://openreview.net/forum?id=NGKQoaqLpo


That's consistent with other research I've seen, where varied presentation of the data is key to effective knowledge injection [1].

My assumption, based on the research is that training on different prompts but the same answer gives you more robust Q&A behavior; training on variations of how to express the same concept generalizes. Training on the same prompt and different answers gives you creative diversity [2].

[1] https://arxiv.org/abs/2404.00213 [2] https://arxiv.org/abs/2503.17126


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: