More

ogrisel · 2025-12-01T08:46:08 1764578768

How do you deny access to prod credentials from an assistant running on your dev machine assuming you need to store them on that same machine to do manual prod investigation/maintenance work from that machine?

victorbuilds · 2025-12-01T09:00:09 1764579609

I keep them in env variables rather than files. Not 100% secure - technically Claude Code could still run printenv - but it's never tried. The main thing is it won't stumble into them while reading config files or grepping around.

63stack · 2025-12-01T11:29:33 1764588573

A process does not need to run printenv to see environment variables, they are literally part of the environment it runs in.

dist-epoch · 2025-12-01T11:49:43 1764589783

The LLM doesn't have direct access to the process env unless the harness forwards it (and it doesn't)

fragmede · 2025-12-01T10:51:15 1764586275

chown other_user; chmod 000; sudo -k

ogrisel · 2025-12-01T08:07:03 1764576423

When you run Antigravity the first time, it asks you for a profile (I don't remember the exact naming) and you what it entails w.r.t. the level of command execution confirmation is well explained.

IshKebab · 2025-12-01T08:18:09 1764577089

Yeah but it also says something like "Auto (recommended). We'll automatically make sure Antigravity doesn't run dangerous commands." so they're strongly encouraging people to enable it, and suggesting they have some kind of secondary filter which should catch things like this!

ogrisel · 2025-12-01T08:05:02 1764576302

I think there is far less than 1% chance for this to happen, but there are probably millions of antigravity users at this point, 1 millionths chance of this to happen is already a problem.

We need local sandboxing for FS and network access (e.g. via `cgroups` or similar for non-linux OSes) to run these kinds of tools more safely.

cube2222 · 2025-12-01T08:32:22 1764577942

Codex does such sandboxing, fwiw. In practice it gets pretty annoying when e.g. it wants to use the Go cli which uses a global module cache. Claude Code recently got something similar[0] but I haven’t tried it yet.

In practice I just use a docker container when I want to run Claude with —-dangerously-skip-permissions.

[0]: https://code.claude.com/docs/en/sandboxing

BrenBarn · 2025-12-01T09:00:07 1764579607

We also need laws. Releasing an AI product that can (and does) do this should be like selling a car that blows your finger off when you start it up.

jpc0 · 2025-12-01T09:21:21 1764580881

This is more akin to selling a car to an adult that cannot drive and they proceed to ram it through their garage door.

It's perfectly within the capabilities of the car to do so.

The burden of proof is much lower though since the worst that can happen is you lose some money or in this case hard drive content.

For the car the seller would be investigated because there was a possible threat to life, for an AI buyer beware.

anang · 2025-12-02T15:05:49 1764687949

I think the general public has a MUCH better grasp on the potential consequences of crashing a car into a garage than some sort of auto-run terminal command mode in an AI agent.

These are being sold as a way for non-developers to create software, I don't think it's reasonable to expect that kind of user to have the same understanding as an actual developer.

I think a lot of these products avoid making that clear because the products suddenly become a lot less attractive if there are warnings like "we might accidentally delete your whole hard drive or destroy a production database."

nkrisc · 2025-12-01T11:41:50 1764589310

Responsibility is shared.

Google (and others) are (in my opinion) flirting with false advertising with how they advertise the capabilities of these "AI"s to mainstream audiences.

At the same time, the user is responsible for their device and what code and programs they choose to run on it, and any outcomes as a result of their actions are their responsibility.

Hopefully they've learned that you can't trust everything a big corporation tells you about their products.

Zigurd · 2025-12-01T14:35:22 1764599722

This is an archetypal case of where a law wouldn't help. The other side of the coin is that this is exactly a data loss bug in a product that is perfectly capable of being modified to make it harder for a user to screw up this way. Have people forgotten how comically easy it was to do this without any AI involved? Then shells got just a wee bit smarter and it got harder to do this to yourself.

LLM makers that make this kind of thing possible share the blame. It wouldn't take a lot of manual functional testing to find this bug. And it is a bug. It's unsafe for users. But it's unsafe in a way that doesn't call for a law. Just like rm -rf * did not need a law.

pas · 2025-12-01T09:04:20 1764579860

there are laws about waiving liability for experimental products

sure, it would be amazing if everyone had to do a 100 hour course on how LLMs work before interacting with one

stogot · 2025-12-01T13:28:43 1764595723

Where are these laws? Are they country, state, province?

pas · 2025-12-01T14:00:37 1764597637

varies by jurisdiction, but just as you can

- sell a knife that can lead to digit loss, or

- sell software that interacts with your computer and can lead to data loss, you can

- give people software for free that can lead to data loss.

...

the Antigravity installer comes with a ToS that has this

   The Service includes goal-oriented AI systems or workflows that perform
   actions or tasks on your behalf in a supervised or autonomous manner that you
   may create, orchestrate, or initiate within the Service (“AI Agents”). You
   are solely responsible for: (a) the actions and tasks performed by an AI
   Agent; (b) determining whether the use an AI Agent is fit for its use case;
   (c) authorizing an AI Agent’s access and connection to data, applications,
   and systems; and (d) exercising judgment and supervision when and if an AI
   Agent is used in production environments to avoid any potential harm the AI
   Agent may cause.

chickensong · 2025-12-01T10:56:30 1764586590

Google will fix the issue, just like auto makers fix their issues. Your comparison is ridiculous.

ogrisel · 2025-11-27T14:07:19 1764252439

I think it would help to open an issue on github making explicit the following three points explicit in the report:

- steps to reproduce from scratch;

- what you expected to happen;

- what you actually observed (include the screenshot or video capture in addition to a textual description).

Otherwise, you might risk your report being ignored due to a silent misunderstanding about the mismatch between your expectations and the actual results.

supermatt · 2025-11-27T14:15:51 1764252951

At the time i wasn't sure if it was PEBCAK, which is why i started a discussion in the forums. As there were no replies, i received no notifications, and so I forgot all about it.

If anyone is interested in opening a bug report you can see the issue here: https://imgur.com/a/hZ1ja9o

ogrisel · 2025-11-27T14:24:43 1764253483

Personally, I do not understand why you think there is a bug from this screen capture alone. Maybe because I am that familiar with penpot and figma, but still, I do not find it obvious.

This is why it's important to describe explicitly the three points in text:

- steps to reproduce;

- what you expected to happen;

- what actual result you observe instead.

Something that might be obvious to you but isn't for others will just be silently ignored most of the time.

EDIT: I now see the problem after reading your other reply above:

https://news.ycombinator.com/item?id=46064757#46069546

This is why it's important to describe explicitly the difference between what you expected and what you observed. I swear I did not see the change in button width before reading the linked comment.

supermatt · 2025-11-27T14:48:37 1764254917

> This is why it's important to describe explicitly

That is a fair point. I will take it on board when giving people screenshots and videos of bugs in future.

> I did not see the change in button width

There's actually a lot more visual changes than that just the button, but I will leave that to the reader as an exercise in spot-the-difference ;)

pixelatedindex · 2025-11-27T19:12:18 1764270738

> There's actually a lot more visual changes than that just the button, but I will leave that to the reader as an exercise in spot-the-difference ;)

This is fair. But issues like this will never get my attention in general because I don’t have time to do this exercise - I would much rather have it all spelled out. Even if there are a bunch of related issues they won’t get fixed in a single PR, it likely will be multiple.

I guess my point is that if you really want OSS projects to improve, the issue submitter can’t just ask the maintainer “figure it out”. It totally works this way in the corporate world though (IME).

Edit: I’m sorry to have jumped to conclusions. Leaving my comment up for accountability.

supermatt · 2025-11-27T20:31:19 1764275479

I didn’t ask the maintainer to “figure it out”. I posted a thread in the forum with multiple videos to start a discussion.

People here have stated I should have filed on GitHub, and because I don’t want to link my GitHub to this account I suggested someone else do it.

That was 6 hours ago, and people are still commenting about my lack of a suitable report rather than actually reporting it correctly themselves - as is evident by the lack of a new issue on the github.

pixelatedindex · 2025-11-27T21:04:42 1764277482

I’m sorry for jumping at you like that.

supermatt · 2025-11-27T21:38:54 1764279534

No problem :)

pixelatedindex · 2025-11-27T19:09:11 1764270551

> I swear I did not see the change in button width before reading the linked comment.

I didn’t either! I stared at that gif for a few minutes and I couldn’t tell what the problem is (or what to look for). It wasn’t until you said “changing button width” I knew where to focus my attention.

ErroneousBosh · 2025-11-27T19:50:03 1764273003

"Content not available in your region"

So, given that Penpot appears to mostly be developed in the EU, you'd need to fix that part first.

supermatt · 2025-11-27T22:05:22 1764281122

I’m not sure what you are referring to. If you mean the video link, I am in the EU and can see it.

ErroneousBosh · 2025-11-28T08:07:33 1764317253

Doesn't work in Austria, doesn't work from the UK, doesn't work from Finland.

supermatt · 2025-11-28T12:33:03 1764333183

I am physically in the EU and its working for me.

I just checked using nordVPN for Austria and Finland and it's working there too, so maybe you have some other issue going on?

I am assuming you are in the UK, as Imgur are specifically blocking the UK: https://help.imgur.com/hc/en-us/articles/41592665292443-Imgu...

Imgur isn't my site, and I don't vote in the UK, so Im not sure how you expect me to resolve that.

supermatt · 2025-11-28T19:51:34 1764359494

This might help you: https://news.ycombinator.com/item?id=46081188

stuaxo · 2025-11-28T19:11:10 1764357070

Not being able view imgur in the UK is a pain.

IshKebab · 2025-11-27T21:23:08 1764278588

I hate how every time someone even talks about an issue with an open source project, some smart alec replies "well did you raise an issue?" - or worse - "did you send a PR to fix it?".

We are all very aware how bug reporting works. And user criticism of bugs isn't somehow invalidated just because the users didn't go to the sometimes very large effort to report bugs.

I wouldn't have reported this bug either. If the example documents are getting corrupted just by navigating them that indicates that it's just a really buggy project (corroborated by other comments here) that I'm not even going to use, so why would I spend my time working on it?

antonok · 2025-11-28T03:13:11 1764299591

I opened an issue based on the discussion here and it didn't take much time or effort.

(It was one of those form-based issue templates that requires you to explicitly list out Steps to Reproduce, Expected behavior, Actual behavior, OS version, etc. which IMO causes slightly more friction for anyone who knows how to put together a good bug report, but I've also seen enough poorly-specified issues to know that it's necessary sometimes)

VerifiedReports · 2025-11-28T04:28:37 1764304117

"And user criticism of bugs isn't somehow invalidated just because the users didn't go to the sometimes very large effort to report bugs."

Yes, it is.

IshKebab · 2025-11-28T08:13:35 1764317615

No it isn't.

VerifiedReports · 2025-11-28T19:06:17 1764356777

Yeah, just let everyone else do the work while you sit back and gripe.

theultdev · 2025-11-27T22:04:56 1764281096

I can see both sides of the dilemma and I don't necessarily like when a maintainer defaults to "open a PR" but asking for a reproducible issue wherever requested is not too much to ask.

With a PR I understand not wanting to put the effort in as it may not be merged. But offering up a reproducible example on the correct forum is the least you could do. If you want the problem fixed that's the best way forward.

supermatt · 2025-11-27T22:29:44 1764282584

> offering up a reproducible example on the correct forum is the least you could do

I suggested someone do that 8 hrs ago:

https://news.ycombinator.com/item?id=46069471

So far no takers. Just people saying what they would do instead of actually doing it :)

antonok · 2025-11-28T03:04:58 1764299098

I've done it:

https://github.com/penpot/penpot/issues/7850

Thanks for sharing all the details about the issue, and shame on all the armchair critics :D

supermatt · 2025-11-28T07:56:19 1764316579

Thank you :)

ogrisel · 2025-05-16T13:05:12 1747400712

You cannot share arbitrarily structured objects in the `ShareableList`, only atomic scalars and bytes / strings.

If you want to share structured Python objects between instances, you have to pay the cost of `pickle.dump/pickle.dump` (CPU overhead for interprocess communication) + the memory cost of replicated objects in the processes.

notpushkin · 2025-05-16T18:14:36 1747419276

We need a dataclass-like interface on top of a ShareableList.

notpushkin · 2025-05-16T23:35:11 1747438511

Actually, ShareableList feels like a tuple really (as it’s impossible to change its length). If we could mix ShareableList and collections.namedtuple together, it would get us 90% there (99.9% if we use typing.NamedTuple). Unfortunately, I can’t decipher either one [1, 2] from the first glance – maybe if I get some more sleep?

[1]: https://github.com/python/cpython/blob/3.13/Lib/collections/...

[2]: https://github.com/python/cpython/blob/3.13/Lib/typing.py#L2...

tomrod · 2025-05-16T14:34:50 1747406090

I can fit a lot of json into bytes/strings though?

frollogaston · 2025-05-16T16:12:13 1747411933

If all your state is already json-serializable, yeah. But that's just as expensive as copying if not more, hence what cjbgkagh said about flatbuffers.

frollogaston · 2025-05-16T22:25:28 1747434328

oh nvm, that doesn't solve this either

cjbgkagh · 2025-05-16T14:37:39 1747406259

Perhaps flatbuffers would be better?

tomrod · 2025-05-16T14:40:32 1747406432

I love learning from folks on HN -- thanks! Will check it out.

notpushkin · 2025-05-16T18:09:57 1747418997

Take a look at https://capnproto.org/ as well, while at it.

Neither solve the copying problem, though.

frollogaston · 2025-05-16T22:25:14 1747434314

Ah, I forgot capnproto doesn't let you edit a serialized proto in-memory, it's read-only. In theory this should be possible as long as you're not changing the length of anything, but I'm not surprised such trickery is unsupported.

So this doesn't seem like a versatile solution for sharing data structs between two Python processes. You're gonna have to reserialize the whole thing if one side wants to edit, which is basically copying.

tinix · 2025-05-16T16:25:05 1747412705

let me introduce you to quickle.

reliabilityguy · 2025-05-16T17:55:27 1747418127

What’s the point? The whole idea is to share an object, and not to serialize them whether it’s json, pickle, or whatever.

tomrod · 2025-05-16T18:55:35 1747421735

I mean, the answer to this is pretty straightforward -- because we can, not because we should :)

vlovich123 · 2025-05-16T14:38:39 1747406319

That’s even worse than pickle.

tomrod · 2025-05-16T14:41:02 1747406462

pickle pickles to pickle binary, yeah? So can stream that too with an io Buffer :D

sgarland · 2025-05-16T22:46:54 1747435614

So don’t do that? Send data to workers as primitives, and have a separate process that reads the results and serializes it into whatever form you want.

ogrisel · 2025-03-11T10:17:44 1741688264

According to the following paper, it's possible to get calibrated confidence scores by directly asking the LLM to verbalize a confidence level, but it strongly depends on how you prompt it to do so:

https://arxiv.org/abs/2412.14737

ogrisel · 2025-03-06T15:07:10 1741273630

It appears that they reused a lot of the data preparation provided by the AllenAI team:

https://github.com/allenai/OLMoE

https://github.com/allenai/dolma

https://github.com/AMD-AIG-AIMA/Instella

ogrisel · 2025-02-07T08:31:09 1738917069

Software Engineering is difficult to verify because it requires dealing with ambiguous understanding of the end-user actual needs / value and subtle trade-offs about code maintainability vs feature coverage vs computational performance.

Algorithmic puzzles, on the other hand, both require reasoning and are easy to verify.

There are other things in coding that are both useful and easy to verify: checking that the generated code follows formatting standards or generating outputs with a specific data schema and so on.

godelski · 2025-02-07T10:56:30 1738925790

I agree with you on the first part, but no, code is not easy to verify. I think you missed part of what I wrote. I mean verify that your code is bug free. This cannot be done purely through testing. Formal verification still remains an unsolved problem.

FieryTransition · 2025-02-07T13:18:52 1738934332

But if you have a large set of problems to which you already know the answer, then using that in reinforcement learning, then wouldn't the expertise transfer later to problems with no known answers, that is a feasable strategy, right?

Another issue is, how much data can you synthesize in such a way, so that you can construct both the problem and solution, so that you know the answer before using it as a sample.

Ie, some problems are easier to make knowing you can construct the problem yourself, but if you were to solve said problems, with no prior knowledge, they would be hard to solve, and could be used as a scoring signal?

Ie, you are the Oracle and whatever model is being trained doesn't know the answer, only if it is right or wrong. But I don't know if the reward function must be binary or on a scale.

Does that make sense or is it wrong?

godelski · 2025-02-08T01:19:33 1738977573

I don't think this makes sense and I'm not quite sure why you went to ML, but that's okay. I am a machine learning researcher, but also frustrated with the state of machine learning, in part because, well... you can probably see how "proof by empirical evidence" is dialed up to 11.

Sorry, long answer incoming. It is far from complete too but I think it will help build strong intuition around your questions.

Will knowledge transfer? That entirely depends on the new problem. It also entirely depends on how related the problem is. But also, what information was used to solve the pre-transfer state. Take LLMs for example. There's lots of works that have shown them being difficult to train for solving calculations. Where they will do well on problems with the same number of digits but this will degrade rapidly as number of digits increase. It can be weird to read some of these papers as there will sometimes be periodic relationships with the number of digits but that should give us information about how they're encoding the problems. But that lack of transferability indicates that despite the problem solving and what we'd believe is actually just the same problem, doesn't mean it is. So you have to be really careful here, because us humans are really fucking good at generalization (yeah, we also suck, but a big part is our proficiency makes us recognize where we lack. But also, this is more a "humans can" more than "humans do" type of thing. So be careful when comparing). This generalization is really because we're focused around building causal relationships, while on the other hand the ML algorithms are build around compression (i.e. fitting data). Which, if you notice, is the same issue I was pointing to above.

  > Ie, you are the Oracle and whatever model is being trained doesn't know the answer, only if it is right or wrong. But I don't know if the reward function must be binary or on a scale.

This entirely depends on the problem. We can construct simple problems that both illustrate success as well as failure. What you really need to think about here is the information gain from the answer. If you check how to calculate that, you will see the dependence (we could get into Bayesian Learning or experiment design but this is long enough). But let's think of a simple example in the negative direction. If I ask you to guess where I'm from, you're going to have a very hard time pinning down the exact location. Definitely in this example there is a efficient method, but our ML learning algorithms don't start with prior knowledge about strategies and so they aren't going to know to binary search. If you gave that to the model, you baked in that information. This is a tricky form of information leakage. It can be totally fine to bake in knowledge, but we should be aware of how that changes how we evaluate things (we always bake in knowledge btw. There is no escaping this). But most models would not have a hard time if instead we played "hot/cold", because the information gain is much higher. We've provided a gradient to the solution space. We might call this hard and soft labels, respectively.

I picked this because there's a rather famous paper about emergent abilities (I fucking hate this term[0]) in ML models[1], and a far less famous counter to it[2]. There's a lot of problems with [1] that require a different discussion but [2] shows how a big part of the issue is how many of the loss landscapes are fairly flat and so when feedback is discrete the smaller models just wonder around that flat landscape needing to get lucky to find the optima (btw, this also shows that technically this can be done too! But that would require different training methods and optimizers). But when giving them continuous feedback (i.e. you're wrong, but closer than your last guess), they are able to actually optimize. A big criticism of the work is that it is an unfair comparison because there are "right and wrong" answers here, but it'd be naive to not recognize that some answers are more wrong than others. Plus, their work shows a clear testable way we can confirm or deny if this works or not. We schedule learning rates, there's no reason you cannot schedule labels. In fact, this does work.

But also look at the ways they tackled these problems. They are entirely different. [1] tries to do proof by evidence while [2] uses proof by contradiction. Granted, [2] has an easier problem since they only need to counter the claims of [1], but that's a discussion about how you formulate proofs.

So I'd be very careful when using the recent advancements in ML as a framework for modeling reasoning. The space is noisy. It is undeniable that we've made a lot of advancements but there is some issues with what work gets noticed and what doesn't. A lot does come down to this proof by evidence fallacy. Evidence can only bound confidence, it can unfortunately not prove things. But this is helpful and well, we can bound our confidence to limit the search space before we change strategies, right? I picked [1] and [2] for a reason ;) And to be clear, I'm not saying [1] shouldn't exist as a paper or that the researchers were dumb for doing it. Read back on this paragraph, because we've got multiple meta layers here. It's good to place a flag in the ground, even if it is wrong, because you gotta start somewhere, and science is much much better at ruling things out than ruling things in. We more focus on proving things don't work until there's not much left and then accept those things (limits here too, but this is too long already).

I'll leave with this, because now there should be a lot of context that makes this much more meaningful: https://www.youtube.com/watch?v=hV41QEKiMlM

[0] It significantly diverges from the terminology used in fields such as physics. ML models are de facto weakly emergent by nature of composition. But the ML definition can entirely be satisfied by "Information was passed to the model but I wasn't aware of it" (again, same problem: exhaustive testing)

[1] (2742 citations) https://arxiv.org/abs/2206.07682

[2] (447 citations) https://arxiv.org/abs/2304.15004

FieryTransition · 2025-02-09T15:12:22 1739113942

Thanks a lot for the detailed reply, it was better than I had hoped for :)

So knowledge transfer is something incredibly specific and much more narrow than what I thought. They don't transfer concepts by generalization, but they compress knowledge instead, which I assume the difference is, that generalization is much more fluid, while compression is much more static, like a dictionary where each key has a probability to be chosen, and all the relationships are frozen, and the only generalization that happens, is the generalization which is an expression of the training method used, since the training method freezes it's "model of the world" into the weights so to say? So if the training method itself cannot generalize, but only compress, why would the resulting model that the training method produces? Is that understood correctly?

Does there exist a computational model, which can be used to analyse a training method and put a bound on the expressiveness of the resulting model?

It's fascinating that the emergent ability of models disappear if you measure them differently. Guess the difference is that "emergent abilities" are kinda nonsensical, since they have no explanation of causality (i.e. it "just" happens), and just seeing the model getting linearly better with training fits into a much more sane framework. That is, like you said, when your success metric is measuring discretely, you also see the model itself as discrete, and it hides the continuous hill climbing you would otherwise see the model exhibit with a different non-discrete metric.

But the model still gets better over time, so would you expect the model to get progressively worse on a more generalized metric, or does it only relate to the spikes in the graph that they talk about? IE, they answer the question of "why" jumps in performance are not emergent, but they don't answer why the performance keeps increasing, even if it is linear, and whether it is detrimental to other less related tasks?

And if you wanted to test "emergent" wouldn't it be more interesting to test the model on tasks, which would be much more unrelated to the task at hand? That would be to test generalization, more so as we see humans see it? So it wouldn't really be emergence, but generalization of concepts?

It makes sense that it is more straightforward to refute a claim by using contradiction. Would it be good practice for papers, to try and refute their own claims by contradiction first? I guess that would save a lot of time.

It's interesting about the knowledge leakage, because I was thinking about the concept of world simulations and using models to learn about scenarios through simulations and consequence. But the act of creating a model to perceive the world, taints the model itself with bias, so the difficulty lies in creating a model which can rearrange itself to get rid of incorrect assumptions, while disconnecting its initial inherent bias. I thought about models which can create other models etc, but then how does the model itself measure success? If everything is changing, then so is the metric, so the model could decide to change what it measures as well. I thought about hard coding a metric into the model, but what if the metric I choose is bad, and we are then stuck with the same problem of bias as well. So it seems like there are only two options, it either converges towards total uncontrollability or it is inherently biased, there's doesn't seem to be any in-between?

I admit I'm trying to learn things about ML I just find general intelligence research fascinating (neuroscience as well), but the more I learn, the more I realize I should really go back to the fundamentals and build up. Because even things which seem like they make sense on a surface level, really has a lot of meaning behind them, and needs a well-built intuition not from a practical level, but from a theoretical level.

From the papers I've read which I find interesting, it's like there's always the right combination of creativity in thinking, which sometimes my intuition/curiosity about things proved right, but I lack the deeper understanding, which can lead to false confidence in results.

godelski · 2025-02-10T01:04:02 1739149442

Well fuck... My comment was too long... and it doesn't get cached -___-

I'll come back and retype some of what I said but I need to do some other stuff right now. So I'll say that you're asking really good questions and I think you're mostly understanding things.

So give you very quick answers:

Yes, things are frozen. There's active/online learning but even that will not solve all the issues at hand.

Yes, we can put bounds. Causal models naturally do this but statistics is all about this too. Randomness is a measurement of uncertainty. Note that causal models are essentially perfect embeddings. Because if you've captured all causal relationships, you gain no more value from additional information, right?

Also note that we have to be very careful about assumptions. It is always important to uncover what assumptions have been made and what the implications are. This is useful in general problems solving and applies to anything in your life, not just AI/ML/coding. Unfortunately, assumptions are almost never explicitly stated, so you got to go hunting.

See how physics defines strong emergence and weak emergence. There are no known strongly emerging phenomena and we generally believe they do not exist. For weakly emerging, well it's rather naive to discuss this in the context of ML if we're dedicating so little time and effort to interpretation, right? That's kinda the point I was making previously about not being able to differentiate an emergent phenomena from not knowing we gave it information.

For the "getting better" it is about the spikes. See the first two figures and their captions in the response paper.

More parameters do help btw, but make sure you distinguish the difference between a problem being easier to solve and a problem not being solvable. The latter is rather hard to show. But the paper is providing strong evidence to the underlying issues being about the ease of problem solving rather than incapacity.

Proof is hard. There's nothing wrong with being empirical, but we need to understand that this is a crutch. It is evidence, not proof. We leaned on this because we needed to start somewhere. But as progress is made so too must all the metrics and evaluations. It gets exponentially harder to evaluate as progress is made.

I do not think it is best to put everyone in ML into the theory first and act like physicists. Rather we recognize the noise and do not lock out others from researching other ideas. The review process has been contaminated and we lost sight. I'd say that the problem is that we look at papers as if we are looking at products. But in reality, papers need to be designed with understanding the experimental framework. What question is being addressed, are variables being properly isolated, and do the results make a strong case for the conclusion? If we're benchmark chasing we aren't doing this and we're providing massive advantage to "gpu rich" as they can hyper-parameter tune their way to success. We're missing a lot of understanding because of this. You don't need state of the art to prove a hypothesis. Nor to make improvements on architectures or in our knowledge. Benchmarks are very lazy.

For information leakage, you can never remove the artist from the art, right? They always leave part of themselves. That's okay, but we must be aware of the fact so we can properly evaluate.

Take the passion, and dive deep. Don't worry about what others are doing, and pursue your interests. That won't make you successful in academia, but it is the necessary mindset of a researcher. Truth is no one knows where we're going and which rabbit holes are dead ends (or which look like dead ends but aren't). It is good to revisit because you table questions when learning, but then we forget to come back to them.

  > needs a well-built intuition not from a practical level, but from a theoretical level.

The magic is at the intersection. You need both and you cannot rely on only one. This is a downfall in the current ML framework and many things are black boxes only because no one has bothered to look.

voxic11 · 2025-02-07T14:59:42 1738940382

Formal verification of arbitrary programs with arbitrary specifications will remain an unsolved problem (see halting problem). But formal verification of specific programs with specific specifications definitely is a solved problem.

godelski · 2025-02-07T22:56:05 1738968965

As someone who came over from physics to CS this has always been one of the weirdest aspects of CS to me. That CS people believe that testing code (observing output) is sufficient to assume code correctness. You'd be laughed at in most hard sciences for doing this. I mean you can even ask the mathematicians, and there's a clear reason why proofs by contradiction are so powerful. But proof through empirical analysis is like saying "we haven't found a proof by contradiction, therefore it is true."

It seems that if this was true that formal verification should be performed much more frequently. No doubt would this be cheaper than hiring pen testers, paying out bug bounties, or incurring the costs of getting hacked (even more so getting unknowingly hacked). It also seems to reason that the NSA would have a pretty straight forward job: grab source code, run verification, exploit flaws, repeat the process as momentum is in your favor.

That should be easy to reason through even if you don't really know the formal verification process. We are constantly bombarded with evidence that testing isn't sufficient. This is why it's been so weird for me, because it's talked about in schooling and you can't program without running into this. So why has it been such a difficult lesson to learn?

snovv_crash · 2025-02-08T08:44:17 1739004257

The fact that a lot of code doesn't even have tests, and that a lot of people don't think even writing tests is a good thing, should shock you even more.

godelski · 2025-02-08T23:51:51 1739058711

I teach so I'm not too shocked lol. But there's a big difference between beginners doing this and junior devs. But the amount of Sr devs doing this shit and not even understanding the limits of testing, well... a bit more shameful than surprising if you ask me

BalinKing · 2025-02-07T19:53:33 1738958013

I don't think this is really true either practically or theoretically. On the practical side, formally verifying program correctness is still very difficult for anything other than very simple programs. And on the theoretical side, some programs require arbitrarily difficult proofs to show that they satisfy even very simple specifications (e.g. consider a program to encode the fixpoint of the Collatz conjecture procedure, and our specification is that it always halts and returns 1).

whattheheckheck · 2025-02-08T01:46:45 1738979205

Have you looked into the busy beaver algo?

ogrisel · 2025-02-05T08:56:56 1738745816

Similarly, for paywalled news/journals.

bandrami · 2025-02-05T09:48:27 1738748907

Fifteen years ago people were talking about micropayments per article for news services being just around the corner. What ever happened to that?

thayne · 2025-02-05T10:05:39 1738749939

A big problem with micropayments is that the transaction costs tend to dwarf the actual payment, which isn't good for the buyer or seller. I don't think it is an unsolveable problem, but there are significant network effects that would need to be overcome.

Maakuth · 2025-02-05T09:55:27 1738749327

AFAIK it never pans out really. People turn out very stingy if they're faced with a decision to pay or not to pay for every article, so the revenues end up a lot lower than what the subscription model would pay.

elnasca · 2025-02-05T11:46:39 1738755999

When people are confronted with the actual cost, they tend to say no. With a subscription, their head tells them that for 10 €/$ they get an infinite number of articles.

dotancohen · 2025-02-05T14:18:01 1738765081

No, they get the articles that _you_ provide. But if _you_ provide only 50% of the interesting articles, as does every other provider, then approaching the ability to access 100% of interesting articles get very expensive. Just getting to 90% of the articles would cost 40€/$. And pushing that to 99% will cost 70€/$.

carlosjobim · 2025-02-05T11:55:48 1738756548

A ton of newspapers actually tried micropayments (something like 50c per article). Almost nobody was interested, consumers do not want micropayments.

I believe instead that the future for textual content is mass syndication, just like it worked out for video content and audio content.

bandrami · 2025-02-05T12:45:14 1738759514

God I fear the advent of Reportify and Poetify

ogrisel · 2025-01-29T16:19:36 1738167576

It's better to be specific:

- open-source inference code

- open weights (for inference and fine-tuning)

- open pretraining recipe (code + data)

- open fine-tuning recipe (code + data)

Very few entities publish the later two items (https://huggingface.co/blog/smollm and https://allenai.org/olmo come to mind). Arguably, publishing curated large scale pretraining data is very costly but publishing code to automatically curate pretraining data from uncurated sources is already very valuable.

Palmik · 2025-01-29T17:33:01 1738171981

Also open-weights comes in several flavors -- there is "restricted" open-weights like Mistral's research license that prohibits most use cases (most importantly, commercial applications), then there are licenses like Llama's or DeepSeek's with some limitations, and then there are some Apache 2.0 or MIT licensed model weights.

cycomanic · 2025-01-29T19:46:45 1738180005

Has it been established if the weights can even be copyrighted? My impression has been that AI companies want to have their cake and it it too, on one hand they argue that the models are more like a database in a search engine, hence are not violating copyright of the data they have been trained with, but on the other hand they argue they meet the threshold that they are copyrightable in their own right.

So it seems to me that it's at least dubious if those restricted licences can be enforced (that said you likely need deep pockets to defend yourself from a lawsuit)

jcgl · 2025-01-29T17:57:45 1738173465

Then those should not be considered “open” in any real sense—when we say “open source,” we’re talking about the four freedoms (more or less—cf. the negligible difference between OSI and FSF definitions).

So when we apply the same principles to another category, such as weights, we should not call things “open” that don’t grant those same freedoms. In the case of this research license, Freedom 0 at least is not maintained. Therefore, the weights aren’t open, and to call them “open” would be to indeed dilute the meaning of open qua open source.

seberino · 2025-01-29T17:40:59 1738172459

Wait timeout. I thought DeepSeek's stuff was all MIT licensed too no? What limitations are you thinking of that DeepSeek still has?

Palmik · 2025-01-29T17:47:11 1738172831

I am referring to this one: https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/LIC...

It is a bit more permissive than Llama's it seems (no MAU threshold it seems).

seberino · 2025-01-30T15:29:19 1738250959

Wow. Your link is frustrating because I thought everything was under the MIT license. Why did people claim it is MIT licensed if they sneaked in this additional license?

orra · 2025-01-31T08:31:49 1738312309

So, the older DeepSeek-V3 model weights are sadly not permissively licensed.

But the recent DeepSeek-R1-Zero and DeepSeek-R1 have MIT licensed weights.

seberino · 2025-02-01T15:48:14 1738424894

Thank you very much. That was helpful. Do we need the older model weights to use the recent DeepSeek-R1-Zero and DeepSeek-R1 models?

orra · 2025-02-02T12:26:46 1738499206

I can't be 100% certain, but I think the good news is: no. There seem to be the exact same number of safetensor files for both, and AFAICT the file sizes are identical.

https://huggingface.co/deepseek-ai/DeepSeek-V3/tree/main https://huggingface.co/deepseek-ai/DeepSeek-R1/tree/main