More

Zababa · 2026-01-20T13:46:43 1768916803

Can you give examples of how that "LLM's do not think, understand, reason, reflect, comprehend and they never shall" or that "completely mechanical process" helps you understand better when LLM works and when they don't?

Many people are throwing around that they don't "think", that they aren't "conscious", that they don't "reason", but I don't see those people sharing interesting heuristics to use LLMs well. The "they don't reason" people tend to, in my opinion/experience, underestimate them by a lot, often claiming that they will never be able to do <thing that LLMs have been able to do for a year>.

To be fair, the "they reason/are conscious" people tend to, in my opinion/experience, overestimate how much a LLM being able to "act" a certain way in a certain situation says about the LLM/LLMs as a whole ("act" is not a perfect word here, another way of looking at it is that they visit only the coast of a country and conclude that the whole country must be sailors and have a sailing culture).

encyclopedism · 2026-01-20T13:59:23 1768917563

We know what an LLM is in fact you can build one from scratch if you like. e.g. https://www.manning.com/books/build-a-large-language-model-f...

It's an algorithm and a completely mechanical process which you can quite literally copy time and time again. Unless of course you think 'physical' computers have magical powers that a pen and paper Turing machine doesn't?

> Many people are throwing around that they don't "think", that they aren't "conscious", that they don't "reason", but I don't see those people sharing interesting heuristics to use LLMs well.

My digital thermometer doesn't think. Imbibing LLM's with thought will start leading to some absurd conclusions.

A cursory read of basic philosophy would help elucidate why casually saying LLM's think, reason etc is not good enough.

What is thinking? What is intelligence? What is consciousness? These questions are difficult to answer. There is NO clear definition. Some things are so hard to define (and people have tried for centuries) e.g. what is consciousness? That they are a problem set within themselves please see Hard problem of consciousness.

https://en.wikipedia.org/wiki/Hard_problem_of_consciousness

amenhotep · 2026-01-21T22:26:10 1769034370

A cursory read of basic philosophy would surely include the arguments against Searle's Chinese room, no? It's hardly settled.

Zababa · 2026-01-20T14:11:15 1768918275

>My digital thermometer doesn't think. Imbibing LLM's with thought will start leading to some absurd conclusions.

What kind of absurd conclusions? And what kind of non absurd conclusions can you make when you follow your let's call it "mechanistic" view?

>It's an algorithm and a completely mechanical process which you can quite literally copy time and time again. Unless of course you think 'physical' computers have magical powers that a pen and paper Turing machine doesn't?

I don't, just like I don't think a human or animal brain has any magical power that imbues it with "intelligence" and "reasoning".

>A cursory read of basic philosophy would help elucidate why casually saying LLM's think, reason etc is not good enough.

I'm not saying they do or they don't, I'm saying that from what I've seen having a strong opinion about whether they think or they don't seem to lead people to weird places.

>What is thinking? What is intelligence? What is consciousness? These questions are difficult to answer. There is NO clear definition.

You see pretty certain that whatever those three things are a LLM isn't doing it, a paper and pencil aren't doing it even when manipulated by a human, the system of a human manipulating a paper and pencil isn't doing it.

Zababa · 2026-01-20T11:02:05 1768906925

> Mistakes made by chatbots will be considered more important than honest human mistakes, resulting in the loss of more points.

>I thought this was fair. You can use chatbots, but you will be held accountable for it.

So you're held more accountable for the output actually? I'd be interested in how many students would choose to use LLMs if faults weren't penalized more.

jcattle · 2026-01-20T11:32:44 1768908764

I thought this part especially was quite ingenious.

If you have this great resource available to you (an LLM) you better show that you read and checked its output. If there's something in the LLM output you do not understand or check to be true, you better remove it.

If you do not use LLMs and just misunderstood something, you will have an (flawed) justification for why you wrote this. If there's something flawed in an LLM, the likelihood that you do not have any justification except for "the LLM said so" is quite high and should thus be penalized higher.

One shows a misunderstanding, the other doesn't necessarily show any understanding at all.

Zababa · 2026-01-20T12:37:00 1768912620

>If you have this great resource available to you (an LLM) you better show that you read and checked its output. If there's something in the LLM output you do not understand or check to be true, you better remove it.

You could say the same about what people find on the web, yet LLMs are penalized more than web search.

>If you do not use LLMs and just misunderstood something, you will have an (flawed) justification for why you wrote this. If there's something flawed in an LLM, the likelihood that you do not have any justification except for "the LLM said so" is quite high and should thus be penalized higher.

Swap "LLMs" for "websites" and you could say the exact same thing.

The author has this in their conclusions:

>One clear conclusion is that the vast majority of students do not trust chatbots. If they are explicitly made accountable for what a chatbot says, they immediately choose not to use it at all.

This is not true. What is true is that if the students are more accountable for their use of LLMs than their use of websites, they prefer using websites. What is "more" here? We have no idea, the author didn't say so. It could be that an error from a website or your own mind is -1 point and from a LLM is -2, so LLMs have to make two times less mistakes than websites and your mind. It could be -1 and -1.25. It could be -1 and -10.

The author even says themselves:

>In retrospect, my instructions were probably too harsh and discouraged some students from using chatbots.

But they don't note the bias they introduced against LLMs with their notation.

Zababa · 2026-01-14T13:25:17 1768397117

I don't think appreciating art separated from the author is solipsistic, in fact I'd argue the opposite. Needing a human presence to engage with art is very human-centric. Or maybe that's due to your definition of art? I can be stunned by how beautiful a sunset is, the same way that I am by a painting, even if no human had a hand in that sunset. I can appreciate the cleverness of a gull stealing some bread from a duck the same way I can appreciate the cleverness of a specific music being used at a specific point in a movie. I can shiver at the brutality of humanity watching Night and Fog, just like I can shiver at the brutality of a praying mantis, eating alive a roach.

>Maybe one day machines will be able to make art in the same way humans do: by going out into the world, having experiences, making mistakes, learning, connecting with others, loving and being loved, or being rejected soundly, and understanding deeply what it means to be a living thing in this universe.

I think this is a good description of the process of how some art is created, but not all? Some art is a pursuit of "what is beautiful" rather than "what it means to be human" ie a sensory experience, some art is accidental, some art just is. For some art knowing the person behind is important, to me; for some not; for some it adds to the experience; for some it removes from it.

I would also highlight some small contradiction:

>I can imagine that this is true for a lot of people. There are certainly folks out there who see music as an interesting sensory stimulus. This song makes you dance, this one makes you cry, this other one makes you feel nostalgic. To these people, the only thing that matters is what the music makes them feel. It's a strange, solipsistic way of engaging with art, but who am I to judge?

>Here's an admittedly extreme example, but it's demonstrative of how I personally relate to music. In the wake of the #MeToo movement (see https://en.wikipedia.org/wiki/MeToo_movement), some of the musicians I used to love as a teenager were outed as sexual predators. When I found out, I scoured my music library and deleted all their work. The music was still the exact same music I fell in love with all those years ago, but I could no longer listen to it without being reminded of the horrible actions of the musicians. Listening to it was triggering.

That seems to me a case of "the only thing that matters is what the music makes them feel".

JohnFen · 2026-01-14T14:59:16 1768402756

> I can be stunned by how beautiful a sunset is, the same way that I am by a painting, even if no human had a hand in that sunset.

As can I, but a gorgeous sunset is not art. It's beauty.

CapsAdmin · 2026-01-15T10:48:49 1768474129

If the definition of art is that a human must be involved, then fine. AI generated music is not art. But it is everything art is minus the human component? ie, it can be beautiful, ugly, etc, just like how a sunset can be beautiful and a rotting corpse can be ugly.

Zababa · 2026-01-14T09:29:50 1768382990

>Code is not an asset it's a liability

This would imply companies could delete all their code and do better, which doesn't seem true?

thehappypm · 2026-01-15T03:02:55 1768446175

A more accurate description of code is that it’s a depreciating asset, perhaps, or an asset that requires maintenance cost. Neither of which is a liability

Zababa · 2026-01-12T16:04:11 1768233851

> Rather: an intended part of the ordinary course of using a Sprite. Like git, but for the whole system.

What I've been waiting for, for a long time. Basically the thing you need if you want agents to run freely but still in a safe way kinda.

>For reasons we’ll get into when we write up how we built these things, you wouldn’t want to ship an app to millions of people on a Sprite. But most apps don’t want to serve millions of people. The most important day-to-day apps disproportionately won’t have million-person audiences.

I appreciate a lot this vision of personal computing.

I'll give sprites a try, they sound super cool.

Zababa · 2026-01-09T13:15:35 1767964535

Depends on what level of macro you want, but with modern phones you can get pretty close, usually with the wide angle lens.

On iPhones: https://support.apple.com/guide/iphone/take-macro-photos-and...

On Pixel: https://store.google.com/intl/en/ideas/articles/pixel-macro-...

I'd recommend playing around with it, it's a lot of fun!

hermitcrab · 2026-01-09T13:33:32 1767965612

Had a quick play with my iPhone 15. It doesn't give the sort of magnification you would need for insect close-ups. I will stick with my Nikon DSLR + 100mm macro lens!

Zababa · 2026-01-09T14:52:38 1767970358

Yeah it's far from being as good as a DLSR or mirrorless with a dedicated macro lens. Still, most people reading HN have one in their pocket and it can be a good test to see if you like the idea of macro. It does work with larger insects, on a pixel 10 pro my mantis fill most of the frame.

Zababa · 2026-01-09T09:58:41 1767952721

> With modern technology, the USPS estimates that a similar sized letter would take a maximum of five days. With planes, trains, and automobiles available to us, we’ve shaved off about two days.

> Two days. In 165 years.

Going back to the start of the article, just like the Pony Express was closed due to telegrams, letter speed is not that important most of the time thanks to phones, internet, etc.

Zababa · 2026-01-09T09:44:21 1767951861

>Is this a workaround to let us see “what it would look like”, or are there optical reasons why this produces an image that is inherently artificial, and could never really be perceived that way?

Both in a way. When you look at a landscape, your eyes and brain are constantly adjusting everything so what you look at "directly" is sharp, and you don't really realize most of what is in your field of view is low resolution, maybe a bit blurry. Same when looking at something really close.

When you look at a picture, if some parts of it are blurry, your eyes/brain can't adjust so that it becomes sharp, because it was captured blurry. Even if you had a camera that exactly reproduces your eye, the pictures would look nothing like what your eyes see, because your eyes and brain are a very different system from a camera.

In photo there is something called "depth of field", which is "the distance between the nearest and the farthest objects that are in acceptably sharp focus in an image captured with a camera" [1]. You can see on the wikipedia page that there's an equation for approximating depth of field, that has in it 2u², where u is the distance to the subject. That means the closer the subject, the smaller the size of depth of field. You can test this with your eye. Take an object 30cm away, put your finger between your eye and the object, and you can change the focus of your eye between your finger and the object. When you focus on your finger the object is a bit blurry, when you focus on the object your finger is a bit blurry [2]. Now take two object that are 15cm away from each other, but 2m or more away from you. Changing the focus from one object to the other won't make the first object as blurry as when you did that close. This is because your depth of field is larger, as the distance increases.

Finally macro. In macro photography, you're often extremely close, so depth of field is extremely thin. When I say extremely thin, I mean "it can take 10 or more pictures to cover a whole fly". A solution in that case, to get all your subject in focus (sharp), is to take lots of different pictures, focusing a tiny bit closer/farther away each time, and then taking all the sharp parts of each picture. That's the technique used here, often called "focus stacking".

[1]: https://en.wikipedia.org/wiki/Depth_of_field

[2]: This might be harder if you're older, as we age we slowly lose the ability to adjust focus, hence the need for reading glasses (cameras can also use "reading glasses" when they can't focus close enough, they're called "close-up filters" and work the same).

pixl97 · 2026-01-09T21:45:31 1767995131

>and you don't really realize most of what is in your field of view is low resolution, maybe a bit blurry

PBS did a great video on this

https://www.youtube.com/watch?v=HU6LfXNeQM4&t=514s

This links to the section in question, but it's well worth watching all of it to see and example of how your brain tricks you. The computer doing eye tracking and blurring everything else out to the user really points out how much your brain lies to you about reality.

Zababa · 2026-01-08T16:07:14 1767888434

It is silly because the problem isn't becoming worse, and not caused by AI labs training on user outputs. Reward hacking is a known problem, as you can see in Opus 4.5 system card (https://assets.anthropic.com/m/64823ba7485345a7/Claude-Opus-...) and they are working to reduce the problem, and measure it better. The assertions in the article seem to be mostly false and/or based on speculation, but it's impossible to really tell since the author doesn't offer a lot of detail (for example for the 10h task that used to take 5h and now takes 7-8h) except for a very simple test (that reminds me more of "count the r in strawberry" than coding performance tbh).

Zababa · 2026-01-08T16:02:49 1767888169

From what I understand model collapse/GIGO are not a problem in that labs generally know where the data comes from, so even if it causes problem in the long run you could filter it out. It's not like labs are forced to train models on the user outputs.

toss1 · 2026-01-08T20:53:07 1767905587

Indeed they are not forced to train them on user outputs, but the author of the article seems to have found good evidence that they are actually doing that, and will need more expert data-tagging/filtering on the inputs to regain their previous performance

Zababa · 2026-01-09T09:16:09 1767950169

I don't think the author of the article found "good evidence". He found a specific case where there was a regression. This could be due to:

- models actually getting worse in general

- his specific style of prompting working well with older models and less well with newer models

- the thing his test tests no longer being a priority for big AI labs

From the article:

> GPT-4 gave a useful answer every one of the 10 times that I ran it. In three cases, it ignored my instructions to return only code, and explained that the column was likely missing from my dataset, and that I would have to address it there.

Here ignoring the instructions to give a "useful answer" (as evaluated by the author) is considered a good thing. This would mean if a model is trained to be better at instruction following, it would lose points in that test.

To me this article feels a bit like saying "this new gun that shoot straight 100% of the time is worse than the older gun that shot straight only 50% of the time, because sometimes I shoot at something I don't actually want to shoot at!". And in a way, it is true, if you're used to being able to shoot at things without them getting hurt, the new gun will be worse from that point of view. But to spin up a whole theory about garbage in/garbage out from that? Or to think all models are getting worse rather than, you're maybe no longer the target audience? That seems weird to me.

toss1 · 2026-01-09T14:45:57 1767969957

You're right - I wasn't considering how narrow his case is and was perhaps overgeneralizing, particularly about the cause.

Seems we agree the better solution for column_index_+1 doesn't exist is to call it out instead of stealthily append a new column, but the why the newer models have that behavior is indeed speculative.

It a bit echos the conundrum from back in the PC days where IBM hardware was the de-facto standard, and companies building "compatible" hardware had to decide whether to be compatible with the spec, or compatible with every detail of the implementation, including buggy behavior, of which OFC some software took advantage. So, do they build to be "compatible" or "bug-compatible"?

Was the ChatGPT v4 response highlighting the missing column a bug or failure to shoot straight? Not sure I'd characterize it that way, but there definitely could be many other reasons for the change in behavior (other than training on lower-skilled programmers' inputs) — we really have to consider that as a conjecture on the author's part.