Hacker Newsnew | past | comments | ask | show | jobs | submit | wizzwizz4's commentslogin

It's not that hard to drive a car! Unfortunately, physics motivates us to have unreasonable expectations of our drivers, like "doesn't drive off the road at 100km/h ever", and "avoids all obstacles all of the time". That's the hard part.

You can't use minimax with alpha-beta pruning for proof search, but that's sufficient to play chess at a high level. I don't see what you're seeing. Chess and mathematics are completely different kinds of problem.

I know it as Stone Porridge. These stories probably share an origin. https://sites.pitt.edu/~dash/type1548.html is sitting in my browser bookmarks, and https://en.wikipedia.org/wiki/Stone_Soup has some variations not listed on that page.

In Sweden it is "Koka soppa på en spik". "Make soup with a (iron) nail".

We also eat nettle soup with a boiled egg-half. I would not call it bland, it is just a dish that does not scream with its loudest voice in your face.


I'd also like to add that I'd consider it a delicacy. Because it is pretty much the first vegetable that you can harvest in spring. And you don't have to have a garden, you can just go out and pick it from anywhere.

One of my all-time favorite stories.

My dear mother told me this story when I was just a boy. I was enchanted by the idea of this magical stone, too young to consider the clever trick the tramp was playing on the woman.

The sense of cooking being a magical endeavor has stayed with me ever since.


It wasn't a trick, the magic stone was a big salt rock. It was the most important ingredient from a flavor standpoint.

Hah I misinterpreted it a different way as a kid, for a long time I thought it was like a collective delusion where the shared experience of contributing insubstantial garnishes to a pot of water tricked everyone into finding it filling and enjoying it.

While that was the way it was taught to me as a kid, I thought it was more of a story about con men who came to a village and tricked the townsfolk to eat their entire winter rations in a grand feast and then skipped town before anyone realized what they did.

It's not the "modern expert system", unless you're throwing away the existing definition of "expert system" entirely, and re-using the term-of-art to mean "system that has something to do with experts".

I don't know what the parent was referring to, but IMO "expert system" is one of the more accurate and insightful ways of describing LLMs.

An expert system is generically a system of declarative rules, capturing an expert's knowledge, that can be used to solve problems.

Traditionally expert systems are symbolic systems, representing the rules in a language such as Prolog, with these rules having been laboriously hand derived, but none of this seems core to the definition.

A pre-trained LLM can be considered as an expert system that captures the rules of auto-regressive language generation needed to predict the training data. These rules are represented by the weights of a transformer, and were learnt by SGD rather than hand coded, but so what?


If you can extract anything resembling a declarative rule from the weights of a transformer, I will put you in for a Turing award.

Expert systems are a specific kind of thing (see https://en.wikipedia.org/wiki/Expert_system#Software_archite...): any definition you've read is a description. If the definition includes GPT models, the definition is imprecise.


Well, OK, perhaps not a declarative rule, more a procedural one (induction heads copying data around, and all that) given the mechanics of transformer layers, but does it really make a conceptual difference?

Would you quibble if an expert system was procedurally coded in C++ rather than in Prolog? "You see this pattern, do this".


Yes, it makes a conceptual difference. Expert systems make decisions according to an explicit, explicable world model consisting of a database of facts, which can be cleanly separated from the I/O subsystems. This does not describe a transformer-based generative language model. The mathematical approaches for bounding the behaviour of a language model are completely different to those involved in bounding the behaviour of an expert system. (And I do mean completely different: computer programs and formal logic are unified in fields like descriptive complexity theory, but I'm not aware of any way to sensibly unify mathematical models of expert systems and LLMs under the same umbrella – unless you cheat and say something like cybernetics.)

You could compile an expert system into C++, and I'd still call it an expert system (even if the declarative version was never written down), but most C++ programs are not expert systems. Heck, a lot of Prolog programs aren't! To the extent a C++ program representing GPT inference is an expert system, it's the trivial expert system with one fact.


That's not what "world-model" means: see https://en.wiktionary.org/wiki/world_model. Your [2] is equivocating in an attempt to misrepresent the state-of-the-art. Genie 3 is technically impressive, don't get me wrong, but it's strictly inferior to procedural generation techniques from the 20th century, physics simulation techniques from the 20th century, and PlayStation 2-era graphics engines. (Have you seen the character models in the 2001 PS2 port of Half-Life? That's good enough.)

Inferior in what sense? Genie 3 is addressing a fundamentally different problem to a physics sim or procgen: building a good-enough (and broad-enough) model of the real world to train agents that act in the real world. Sims are insufficient for that purpose, hence the "sim2real" gap that has stymied robotics development for years.

Genie 3 is inferior in the sense you just described: the sim2real gap would be greater, because it's a less accurate model of the aspects of the world that are relevant to robotics.

A random walk can do mathematics, with this kind of infrastructure.

Isabelle/HOL has a tool called Sledgehammer, which is the hackiest hack that ever hacked[0], basically amounting to "run a load of provers in parallel, with as much munging as it takes". (Plumbing them together is a serious research contribution, which I'm not at all belittling.) I've yet to see ChatGPT achieve anything like what it's capable of.

[0]: https://lawrencecpaulson.github.io/2022/04/13/Sledgehammer.h...


yeah but random walks can't improve upon the state of the art on many-dimensional numerical optimisation problems of the nature discussed here, on account of they're easy enough to to implement to have been tried already and had their usefulness exhausted; this does present a meaningful improvement over them in its domain.

When I see announcements that say "we used a language model for X, and got novel results!", I play a little game where I identify the actual function of the language model in the system, and then replace it with something actually suited for that task. Here, the language model is used as the mutation / crossover component of a search through the space of computer programs.

What you really want here is represent the programs using an information-dense scheme, endowed with a pseudoquasimetric such that semantically-similar programs are nearby (and vice versa); then explore the vicinity of successful candidates. Ordinary compression algorithms satisfy "information-dense", but the metrics they admit aren't that great. Something that does work pretty well is embedding the programs into the kind of high-dimensional vector space you get out of a predictive text model: there may be lots of non-programs in the space, but (for a high-quality model) those are mostly far away from the programs, so exploring the neighbourhood of programs won't encounter them often. Because I'm well aware of the flaws of such embeddings, I'd add some kind of token-level fuzzing to the output, biased to avoid obvious syntax errors: that usually won't move the embedding much, but will occasionally jump further (in vector space) than the system would otherwise search.

So, an appropriate replacement for this generative language model would be some kind of… generative language model. Which is why I'm impressed by this paper.

There are enough other contributions in this paper that slotting a bog-standard genetic algorithm over program source in place of the language model could achieve comparable results; but I wouldn't expect it to be nearly as effective in each generation. If the language model is a particularly expensive part of the runtime (as the paper suggests might be the case), then I expect it's worth trying to replace it with a cruder-but-cheaper bias function; but otherwise, you'd need something more sophisticated to beat it.

(P.S.: props for trying to bring this back on-topic, but this subthread was merely about AI hype, not actually about the paper.)

Edit: Just read §3.2 of the paper. The empirical observations match the theory I've described here.


A random walk could not do the mathematics in this article-- which was essentially the entire starting point for the article.

Embedding other people's work in a vector space, then sampling from the distribution at a different point in the vector space, is not a central member of the "transformative" category. The justifications for allowing transformative uses do not apply to it.

That does seem to be the plurality opinion yes. But you are responding to someone saying that what counts as transformative hasn't been decided by saying that you have decided. We don't know how human brains do it. What if we found that humans actually do it in the same way? Would that alter the dialog, or should we still give preference to humans? If we should, why should we?

> or should we still give preference to humans? If we should, why should we?

Because of the scaling abilities of a human brain, you cannot plug more brains into a building to pump out massive amounts of transformative work, it requires a lot for humans to be able to do it which creates a natural limit to the scale it's possible.

Scale and degree matter even if the process is 100% analogous to how humans do it, the natural limitation for computers to do it is only compute, which requires some physical server space, and electricity, both of which can be minimised with further technological advances. This completely changes the foundation of the concept for "transformative work" which before required a human being.


This is a good observation, and motivates adjusting the legal definition of "transformative" even if the previous definition did include what generative AI systems can now do.

> What if we found that humans actually do it in the same way?

We know that humans don't – or, at least, aren't limited to this approach. A quick perusal of an Organization for Transformative Works project (e.g. AO3, Fanlore) will reveal a lot of ideas which are novel. See the current featured article on Fanlore (https://fanlore.org/wiki/Stormtrooper_Rebellion), or the (after heavy filtering) the Crack Treated Seriously tag on AO3 (https://archiveofourown.org/works?work_search[sort_column]=k...). You can't get stuff like this from a large language model.


People could be doing their own transformative works, and then posting them to tumblr or whatever with a “Ghibli style” tag or something.

Critiques like this dismissis AI as a bunch of multiplications, while in reality it is backed by extensive research, implementation, and data preparation. There's an enormous complexity behind, making it difficult to categorize as simply transformative or not.

The Pirate Bay is also backed by extensive research, implementation, and data preparation. I'm not dismissising [sic] anything as "a bunch of multiplications" – you'll note I talked about embedding in vector spaces, not matrix multiplication. (I do, in fact, know what I'm talking about: if you want to dismiss [sic] my criticism, please actually engage with it like the other commenters have.)

Development of a product like ChatGPT has been orders of magnitude more resource-intenstive than the Pirate Bay, in any way. It's perplexing how when some think of LLMs, they talk like they could develop one in an afternoon; it's specifically their complexity that warrants the question whether they can be considered transformative in nature.

> A new language feature is released, you cannot apply it to old code, since that would make a big PR.

Good. Don't change code for the sake of shiny new things syndrome.

> A better static type checker, that finds some bugs for you, you cannot fix them as your PR would be too big,

Good. Report each bug separately, with a suggested fix, categorised by region of the code. Just because you ran the program, that doesn't mean you understand the code well enough to actually fix stuff: those bugs may be symptomatic of a deeper issue with the module they're part of. The last thing you need is to turn accidentally-correct code into subtly-wrong code.

If you do understand the code well enough, what's the harm in submitting each bugfix as a separate (independent) commit? It makes it easier for the reviewers to go "yup, yup, yup", rather than having to think "does this part affect that part?".


Look into SAG-AFTRA.

Unfortunately, (this kind of) AI doesn't accelerate review. (That's before you get into the ease of producing adversarial inputs: a moderation system not susceptible to these could be wired up backwards as a generation system that produces worthwhile research output, and we don't have one of those.)

I'm skeptical: use two different AIs which don't share the same weaknesses + random sample of manual reviews + blacklisting users that submit adversarial inputs for X years as a deterrent.

But how do you know an input is adversarial? There are other issues: verdicts are arbitrary, the false positive rate means you'd need manual review of all the rejects (unless you wanted to reject something like 5% of genuine research), you need the appeals process to exist and you can't automate that, so bad actors can still flood your bureaucracy even if you do implement an automated review process…

I'm not on the moderation bandwagon to begin with per the above, but if an organization invents a bunch of fake reasons that they find convincing, then any system they come up with is going to have its flaws. Ultimately, the goal is to make cooperation easy and defection costly.

> But how do you know an input is adversarial?

Prompt injection and jailbreaking attempts are pretty clear. I don't think anything else is particularly concerning.

> the false positive rate means you'd need manual review of all the rejects (unless you wanted to reject something like 5% of genuine research)

Not all rejects, just those that submit an appeal. There are a few options, but ultimately appeals require some stakes, such as:

1. Every appeal carries a receipt for a monetary donation to arxiv that's refunded only if the appeal succeeds.

2. Appeal failures trigger the ban hammer with exponentially increasing times, eg. 1 month, 3 months, 9 months, 27 months, etc.

Bad actors either respond to deterrence or get filtered out while funding the review process itself.


> I don't think anything else is particularly concerning.

You can always generate slop that passes an anti-slop filter, if the anti-slop filter uses the same technology as the slop generator. Side-effects may include: making it exceptionally difficult for humans to distinguish between adversarial slop, and legitimate papers. See also: generative adversarial networks.

> Not all rejects, just those that submit an appeal.

So, drastically altering the culture around how the arXiv works. You have correctly observed that "appeals require some stakes" under your system, but the arXiv isn't designed that way – and for good reason. An appeal is either "I think you made a procedural error" or "the valid procedural reasons no longer apply": adding penalties for using the appeals system creates a chilling effect, skewing the metrics that people need to gain insight as to whether a problem exists.

Look at the article numbers. Year, month, and then a 5-digit code. It is not expected that more than 100k articles will be submitted in a given month, across all categories. If the arXiv ever needs a system that scales in the way yours does, with such sloppy tolerances, then it'll be so different to what it is today that it should probably have a different name.

If we were to add stakes, I think "revoke endorsement, requiring a new set of endorsers" would be sufficient. (arXiv endorsers already need to fend off cranks, so I don't think this would significantly impact them.) Exponential banhammer isn't the right tool for this kind of job, and I think we certainly shouldn't be getting the financial system involved (see the famous paper A Fine is a Price by Uri Gneezy and Aldo Rustichini: https://rady.ucsd.edu/_files/faculty-research/uri-gneezy/fin...).


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: