If I publish some c++ code that has some hard-coded magic values in it, can the code not be considered open source until I also publish how I came up with those magic values?
It depends on what those magic numbers are for. If they represent pure data, and it's obvious what the data is (perhaps a bitmap image), then sure, it's open source.
If the magic values are some kind of microcode or firmware, or something else that is executed in some way, then no, it is not really open source.
Even algorithms can be open source in spirit but closed source in practice. See ECDSA. The NSA has never revealed in any verifiable way how they came up with the specific curves used in the algorithm, so there is room for doubt that they weren't specifically chosen due to some inherent (but hard to find) weakness.
I don't know a ton about AI, but I gather there are lots of areas in the process of producing a model where they can claim everything is "open source" as a marketing gimmick but in reality, there is no explanation for how certain results were achieved. (Trade secrets, in other words.)
> If the magic values are some kind of microcode or firmware, or something else that is executed in some way, then no, it is not really open source.
To my understanding, the contents of a .safetensors file is purely numerical weights - used by the model defined in MIT-licensed code[0] and described in a technical report[1]. The weights are arguably only really "executed" to the same extent kernel weights of a gaussian blur filter would be, though there is a large difference in scale and effect.
Code is data is code. Fundamentally, they are the same. We treat the two things as distinct categories only for practical convenience. Most of the time, it's pretty clear which is which, but we all regularly encounter situations in which the distinction gets blurry. For example:
- Windows MetaFiles (WMF, EMF, EMF+), still in use (mostly inside MS Office suite) - you'd think they're just another vector image format, i.e. clearly "data", but this one is basically a list of function calls to Windows GDI APIs, i.e. interpreted code.
- Any sufficiently complex XML or JSON config file ends up turning into an ad-hoc Lisp language, with ugly syntax and a parser that's a bug-ridden, slow implementation of a Lisp runtime. People don't realize that the moment they add conditionals and ability to include or refer back to other parts of config, they're more than halfway to a Turing-complete language.
- From the POV of hardware, all native code is executed "to the same extent kernel weighs of a gaussian blur filter" are. In general, all code is just data for the runtime that executes it.
And so on.
Point being, what is code and what is data depends on practical reasons you have to make this distinction in the first place. IMHO, for OSS licensing, when considering the reasons those licenses exist, LLM weights are code.
if you publish only the binary it's not open source
if open the source then it is open source
if you write a book/blog about how you came up with the ideas but didn't publish the source it's not open source, even if you publish the blog+binaries
I don't know if that compares to an AI model, where the most significant portions are the data preparation and training. The code DeepSeek released only demonstrates how to use the given weights for inferencing with Torch/Triton. I wouldn't consider that an open-source model, just wrapper code for publicly available weights.
I think a closer comparison would be Android and GApps, where if you remove the latter, most would deem the phone unusable.
The Open Source Definition is quite clear on its #2 requirement:
`The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed.`
https://opensource.org/osd
Arguably this would still apply to deepseek. While they didn’t release a way of recreating the weights, it is perfectly valid and common to modify the neural network using only what was released (when doing fine-tuning or RLHF for example, previous training data is not required). Doing modifications based on the weights certainly seems like the preferred way of modifying the model to me.
Another note is that this may be the more ethical option. I’m sure the training data contained lots of copyrighted content, and if my content was in there I would prefer that it was released as opaque weights rather than published in a zip file for anyone to read for free.
It takes away the ability to know what it does though, which is also often considered an important aspect. By not publishing details on how to train the model, there's no way to know if they have included intentional misbehavior in the training. If they'd provide everything needed to train your own model, you could ensure that it's not by choosing your own data using the same methodology.
IMO it should be considered freeware, and only partially open. It's like releasing an open source program with a part of it delivered as a binary.
It's not that they want to keep the training content secret, it's the fact that they stole the training content, and who they stole it from, that they want to keep secret.