If I publish some c++ code that has some hard-coded magic values in it, can the ...

bityard · 2025-01-29T16:35:42 1738168542

It depends on what those magic numbers are for. If they represent pure data, and it's obvious what the data is (perhaps a bitmap image), then sure, it's open source.

If the magic values are some kind of microcode or firmware, or something else that is executed in some way, then no, it is not really open source.

Even algorithms can be open source in spirit but closed source in practice. See ECDSA. The NSA has never revealed in any verifiable way how they came up with the specific curves used in the algorithm, so there is room for doubt that they weren't specifically chosen due to some inherent (but hard to find) weakness.

I don't know a ton about AI, but I gather there are lots of areas in the process of producing a model where they can claim everything is "open source" as a marketing gimmick but in reality, there is no explanation for how certain results were achieved. (Trade secrets, in other words.)

Ukv · 2025-01-29T16:51:17 1738169477

> If the magic values are some kind of microcode or firmware, or something else that is executed in some way, then no, it is not really open source.

To my understanding, the contents of a .safetensors file is purely numerical weights - used by the model defined in MIT-licensed code[0] and described in a technical report[1]. The weights are arguably only really "executed" to the same extent kernel weights of a gaussian blur filter would be, though there is a large difference in scale and effect.

[0]: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inferen...

[1]: https://arxiv.org/html/2412.19437v1

TeMPOraL · 2025-01-29T23:25:23 1738193123

Code is data is code. Fundamentally, they are the same. We treat the two things as distinct categories only for practical convenience. Most of the time, it's pretty clear which is which, but we all regularly encounter situations in which the distinction gets blurry. For example:

- Windows MetaFiles (WMF, EMF, EMF+), still in use (mostly inside MS Office suite) - you'd think they're just another vector image format, i.e. clearly "data", but this one is basically a list of function calls to Windows GDI APIs, i.e. interpreted code.

- Any sufficiently complex XML or JSON config file ends up turning into an ad-hoc Lisp language, with ugly syntax and a parser that's a bug-ridden, slow implementation of a Lisp runtime. People don't realize that the moment they add conditionals and ability to include or refer back to other parts of config, they're more than halfway to a Turing-complete language.

- From the POV of hardware, all native code is executed "to the same extent kernel weighs of a gaussian blur filter" are. In general, all code is just data for the runtime that executes it.

And so on.

Point being, what is code and what is data depends on practical reasons you have to make this distinction in the first place. IMHO, for OSS licensing, when considering the reasons those licenses exist, LLM weights are code.

mohsen1 · 2025-01-29T16:18:33 1738167513

if you publish only the binary it's not open source

if open the source then it is open source

if you write a book/blog about how you came up with the ideas but didn't publish the source it's not open source, even if you publish the blog+binaries

mistercheph · 2025-01-29T17:45:28 1738172728

model weights != binaries

fragmede · 2025-01-29T19:12:02 1738177922

why not?

jay_kyburz · 2025-01-29T20:26:48 1738182408

Its like the image you generated in Photoshop released as creative commons, not the Photoshop source code.

fragmede · 2025-01-29T20:33:22 1738182802

that adds to model weights == binaries tho

z3c0 · 2025-01-29T16:18:13 1738167493

I don't know if that compares to an AI model, where the most significant portions are the data preparation and training. The code DeepSeek released only demonstrates how to use the given weights for inferencing with Torch/Triton. I wouldn't consider that an open-source model, just wrapper code for publicly available weights.

I think a closer comparison would be Android and GApps, where if you remove the latter, most would deem the phone unusable.

reedciccio · 2025-01-29T17:42:58 1738172578

The Open Source Definition is quite clear on its #2 requirement: `The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed.` https://opensource.org/osd

ChadNauseam · 2025-01-29T17:53:00 1738173180

Arguably this would still apply to deepseek. While they didn’t release a way of recreating the weights, it is perfectly valid and common to modify the neural network using only what was released (when doing fine-tuning or RLHF for example, previous training data is not required). Doing modifications based on the weights certainly seems like the preferred way of modifying the model to me.

Another note is that this may be the more ethical option. I’m sure the training data contained lots of copyrighted content, and if my content was in there I would prefer that it was released as opaque weights rather than published in a zip file for anyone to read for free.

jonex · 2025-01-29T18:53:54 1738176834

It takes away the ability to know what it does though, which is also often considered an important aspect. By not publishing details on how to train the model, there's no way to know if they have included intentional misbehavior in the training. If they'd provide everything needed to train your own model, you could ensure that it's not by choosing your own data using the same methodology.

IMO it should be considered freeware, and only partially open. It's like releasing an open source program with a part of it delivered as a binary.

DonHopkins · 2025-02-02T04:55:20 1738472120

It's not that they want to keep the training content secret, it's the fact that they stole the training content, and who they stole it from, that they want to keep secret.