Weights are kind of like a compiled binary, because they are an incomprehensible blob of bits. But they are also unlike a compiled binary, because they can be fine-tuned.
If hexl-mode on the binary works on my home PC but compiling the source code costs me millions of dollars in compute then I want the binary. Someone with millions of dollars to spend on compute may have a differing opinion.
This argument only barely holds water for those big SOTA models like llama derivatives, and that's only because of practical costs involved.
Or should I say, it held water until few days ago.
Personally though, I never bought it. Saying that weights are the "preferred form of the work for making modifications to it" because a) approximately no one can afford to start with the training data, and b) fine-tuning and training LoRAs are cheap enough, is basically like saying binary blobs are "open source" as long as they provide an API (or ABI) for other programs to use. By this line of reasoning, NVIDIA GPU stack and Broadcom chipset firmware would qualify as open source, too.
So, as you state yourself basically, the result also depends on training data, which makes it part of the "compiled source" in a way, just like the architecture of the model. If you have the training data, you can modify that.
But probably it is impossible for them to release the training data, as they have probably not made it all reproducible, but live ingested the data, and the data has since then chanced in many places. So the code to live ingest the data becomes the actual source, I guess.
Cost of building is a true concern, but it doesn't stop people from forking large open projects like Chrome or Firefox and try to build a project to pursue their own ideas and be able to contribute back to the upstream projects when it makes sense.
I don't build my browser, it's too expensive, but the cost of building has nothing to say with how open the access to things. It'd be cool if the community could fork the project, propose changes and maybe crowdfund a training/build run to experiment.
Comparing fine tuning to editing binaries by hand is not a fair comparison. If I could show the decompiler some output I liked and it edited the binary for me to make the output match, then the comparison would be closer.
> If I could show the decompiler some output I liked and it edited the binary for me to make the output match, then the comparison would be closer.
That's fundamentally the same thing though - you run an optimization algorithm on a binary blob. I don't see why this couldn't work. Sure, a neural net is designed to be differentiable, while ELF and PE executables aren't, but then backprop isn't the be-all, end-all of optimization algorithms.
Off the top of my head, you could reframe the task as a special kind of genetic programming problem, one that starts with a large program instead of starting from scratch, and that works on an assembly instead of an abstract syntax tree. Hell, you could first decompile the executable and then have the genetic programming solver run on decompiled code.
I'd be really surprised if no one tried that before. Or, if such functionality isn't already available in some RE tools (or as a plugin for one). My own hands-on experience with reverse engineering is limited to a few attempts at adding extra UI and functionality to StarCraft by writing some assembly, turning it into object code, and injecting it straight into the running game process[0] - but that was me doing exactly what you described, just by hand. I imagine doing such things is common practice in RE that someone already automated finding the specific parts of the binary that produce the outputs you want to modify.
--
[0] - I sometimes miss the times before Data Execution Prevention became a thing.
The question is not, whether it is ideal to do some ML tasks with it, the question is, whether you can do the things you could typically do with open sourced software, including looking at the source and build it, or modify the source and build it. If you don't have the original training data, or mechanism of getting the training data, the compiled result is not reproducible, like normal code would be, and you cannot make a version saying for example: "I want just the same, but without it ever learning from CCP prop."
It is a fair comparison. Normal programming takes inputs and a function and produces outputs. Deep learning takes inputs and outputs and derives a functions. Of course the decompilers for traditional programs do not work on inputs and outputs, it is a different paradigm!
How hard can it be to wrap it in a loop and apply some off-the-shelf good old fashioned AI^H^H optimization technique?
"Given specific inputs X and outputs Y, have a computer automatically find modifications to F so that F(X) gives Y" is a problem that's been studied for nearly a century now (longer, if relax the meaning of "computer"), with plenty of well-known solutions, most of which don't require F to be differentiable.
Isn't "operational research" a standard part of undergrad CS curriculum? It was at my alma mater.
It's amazing to me that "open source" has been so diluted that it is now used to mean "we will give you an opaque binary and permission to run it on your own computer."
Yes training is left as an exercise to the user, but it's outlined in the paper, and a good ML engineer should be able to get started with it, cluster of GPUs not included
There was an article saying they used hand-tuned PMX instead of CUDA so it might be a bit hard to match just from the paper without some good performance experts.
CUDA isn't so bad that hand writing PTX will give you a huge performance improvement, but when you're spending a few million dollars on training it makes sense to chase even a single digit percentage improvement, maybe more in a very hot code-path. Also these articles are based on a single mention of PTX in a paper.
"3.2.2. Efficient Implementation of Cross-Node All-to-All Communication
In order to ensure sufficient computational performance for DualPipe, we customize efficient
cross-node all-to-all communication kernels (including dispatching and combining) to conserve
the number of SMs dedicated to communication. The implementation of the kernels is codesigned with the MoE gating algorithm and the network topology of our cluster. To be specific,
in our cluster, cross-node GPUs are fully interconnected with IB, and intra-node communications
are handled via NVLink. NVLink offers a bandwidth of 160 GB/s, roughly 3.2 times that of IB
(50 GB/s). To effectively leverage the different bandwidths of IB and NVLink, we limit each
token to be dispatched to at most 4 nodes, thereby reducing IB traffic. For each token, when its
routing decision is made, it will first be transmitted via IB to the GPUs with the same in-node
index on its target nodes. Once it reaches the target nodes, we will endeavor to ensure that it is
instantaneously forwarded via NVLink to specific GPUs that host their target experts, without
being blocked by subsequently arriving tokens. In this way, communications via IB and NVLink
are fully overlapped, and each token can efficiently select an average of 3.2 experts per node
without incurring additional overhead from NVLink. This implies that, although DeepSeek-V3
13
selects only 8 routed experts in practice, it can scale up this number to a maximum of 13 experts
(4 nodes × 3.2 experts/node) while preserving the same communication cost. Overall, under
such a communication strategy, only 20 SMs are sufficient to fully utilize the bandwidths of IB
and NVLink.
In detail, we employ the warp specialization technique (Bauer et al., 2014) and partition
20 SMs into 10 communication channels. During the dispatching process, (1) IB sending, (2)
IB-to-NVLink forwarding, and (3) NVLink receiving are handled by respective warps. The
number of warps allocated to each communication task is dynamically adjusted according to the
actual workload across all SMs. Similarly, during the combining process, (1) NVLink sending,
(2) NVLink-to-IB forwarding and accumulation, and (3) IB receiving and accumulation are also
handled by dynamically adjusted warps. In addition, both dispatching and combining kernels
overlap with the computation stream, so we also consider their impact on other SM computation
kernels. Specifically, we employ customized PTX (Parallel Thread Execution) instructions and
auto-tune the communication chunk size, which significantly reduces the use of the L2 cache
and the interference to other SMs."
It's definitely not the full model written in PTX or anything, but still some significant engineering effort to replicate, from people commanding 7-figure salaries in this wave, since the training code isn't open.
And it really isn't so surprising to have to go 'down' to PTX for such a low-level optimisation. For all the love AVX512 articles get here, I, for one, am glad some people talk about their PTX secret sauce.
I wish Intel didn't kill the maxas effort back then (buying the team out and ... what ?) as they were going even lower down the stack.
To me, this feels the same as saying Sonic Colors Ultimate is open source because it was made with Godot. The engine is open source and making the game is left as an exercise to the user.
But you have all the assets of the actual finished game as well as the code used to run it, using your example. You don't get the game dev studio, i.e. datasets, expertise, and compute. Just because someone gives you all the source code and methods they used to make a game, doesn't mean anyone can just go and easily make a sequel, but it helps.
Very few entities publish the later two items (https://huggingface.co/blog/smollm and https://allenai.org/olmo come to mind). Arguably, publishing curated large scale pretraining data is very costly but publishing code to automatically curate pretraining data from uncurated sources is already very valuable.
Also open-weights comes in several flavors -- there is "restricted" open-weights like Mistral's research license that prohibits most use cases (most importantly, commercial applications), then there are licenses like Llama's or DeepSeek's with some limitations, and then there are some Apache 2.0 or MIT licensed model weights.
Has it been established if the weights can even be copyrighted? My impression has been that AI companies want to have their cake and it it too, on one hand they argue that the models are more like a database in a search engine, hence are not violating copyright of the data they have been trained with, but on the other hand they argue they meet the threshold that they are copyrightable in their own right.
So it seems to me that it's at least dubious if those restricted licences can be enforced (that said you likely need deep pockets to defend yourself from a lawsuit)
Then those should not be considered “open” in any real sense—when we say “open source,” we’re talking about the four freedoms (more or less—cf. the negligible difference between OSI and FSF definitions).
So when we apply the same principles to another category, such as weights, we should not call things “open” that don’t grant those same freedoms. In the case of this research license, Freedom 0 at least is not maintained. Therefore, the weights aren’t open, and to call them “open” would be to indeed dilute the meaning of open qua open source.
Wow. Your link is frustrating because I thought everything was under the
MIT license. Why did people claim it is MIT licensed if they sneaked in this additional license?
I can't be 100% certain, but I think the good news is: no. There seem to be the exact same number of safetensor files for both, and AFAICT the file sizes are identical.
If I publish some c++ code that has some hard-coded magic values in it, can the code not be considered open source until I also publish how I came up with those magic values?
It depends on what those magic numbers are for. If they represent pure data, and it's obvious what the data is (perhaps a bitmap image), then sure, it's open source.
If the magic values are some kind of microcode or firmware, or something else that is executed in some way, then no, it is not really open source.
Even algorithms can be open source in spirit but closed source in practice. See ECDSA. The NSA has never revealed in any verifiable way how they came up with the specific curves used in the algorithm, so there is room for doubt that they weren't specifically chosen due to some inherent (but hard to find) weakness.
I don't know a ton about AI, but I gather there are lots of areas in the process of producing a model where they can claim everything is "open source" as a marketing gimmick but in reality, there is no explanation for how certain results were achieved. (Trade secrets, in other words.)
> If the magic values are some kind of microcode or firmware, or something else that is executed in some way, then no, it is not really open source.
To my understanding, the contents of a .safetensors file is purely numerical weights - used by the model defined in MIT-licensed code[0] and described in a technical report[1]. The weights are arguably only really "executed" to the same extent kernel weights of a gaussian blur filter would be, though there is a large difference in scale and effect.
Code is data is code. Fundamentally, they are the same. We treat the two things as distinct categories only for practical convenience. Most of the time, it's pretty clear which is which, but we all regularly encounter situations in which the distinction gets blurry. For example:
- Windows MetaFiles (WMF, EMF, EMF+), still in use (mostly inside MS Office suite) - you'd think they're just another vector image format, i.e. clearly "data", but this one is basically a list of function calls to Windows GDI APIs, i.e. interpreted code.
- Any sufficiently complex XML or JSON config file ends up turning into an ad-hoc Lisp language, with ugly syntax and a parser that's a bug-ridden, slow implementation of a Lisp runtime. People don't realize that the moment they add conditionals and ability to include or refer back to other parts of config, they're more than halfway to a Turing-complete language.
- From the POV of hardware, all native code is executed "to the same extent kernel weighs of a gaussian blur filter" are. In general, all code is just data for the runtime that executes it.
And so on.
Point being, what is code and what is data depends on practical reasons you have to make this distinction in the first place. IMHO, for OSS licensing, when considering the reasons those licenses exist, LLM weights are code.
if you publish only the binary it's not open source
if open the source then it is open source
if you write a book/blog about how you came up with the ideas but didn't publish the source it's not open source, even if you publish the blog+binaries
I don't know if that compares to an AI model, where the most significant portions are the data preparation and training. The code DeepSeek released only demonstrates how to use the given weights for inferencing with Torch/Triton. I wouldn't consider that an open-source model, just wrapper code for publicly available weights.
I think a closer comparison would be Android and GApps, where if you remove the latter, most would deem the phone unusable.
The Open Source Definition is quite clear on its #2 requirement:
`The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed.`
https://opensource.org/osd
Arguably this would still apply to deepseek. While they didn’t release a way of recreating the weights, it is perfectly valid and common to modify the neural network using only what was released (when doing fine-tuning or RLHF for example, previous training data is not required). Doing modifications based on the weights certainly seems like the preferred way of modifying the model to me.
Another note is that this may be the more ethical option. I’m sure the training data contained lots of copyrighted content, and if my content was in there I would prefer that it was released as opaque weights rather than published in a zip file for anyone to read for free.
It takes away the ability to know what it does though, which is also often considered an important aspect. By not publishing details on how to train the model, there's no way to know if they have included intentional misbehavior in the training. If they'd provide everything needed to train your own model, you could ensure that it's not by choosing your own data using the same methodology.
IMO it should be considered freeware, and only partially open. It's like releasing an open source program with a part of it delivered as a binary.
It's not that they want to keep the training content secret, it's the fact that they stole the training content, and who they stole it from, that they want to keep secret.
It’s because prominent people with large followings are confusing the terms on purpose. Yann LeCun of Meta and Clem Delangue of Hugging Face constantly use the wrong terms for models that only release weights, and market them to their huge audiences as “open source”. This is a willful open washing campaign to benefit from the positivity that label generates.
I agree it would be nice to have the training specifics. Nevertheless everything DeepSeek released is under the MIT license right? So you can go set up a cloud LLM, fine tune it, and, do whatever else you wish with it right? That is pretty significant no?
Except the "binary" is not really opaque, and can be "edited" in exactly the same way it was produced in the first place (continued pre-training / fine-tuning).
Even with the training material what good is it? The model isn’t reproducible, and even if it were you’re not going to spend the money to verify the output.
Frontier models will never be reproducible in the freedom-loving countries that enforce intellectual property law, since they all depend on copyrighted content in their training data.
why not? if we could get a version of ChatGPT that wasn't censored and would tell me how to make meth, or an censored version of deepseek that wanted to talk about tank man, you don't think the Internet would come together and make that happen?
> amazing to me that "open source" has been so diluted
It’s not and I called it [1].
We had three options: (A) Open weights (favoured by Altman et al); (B) Open training data (favoured by some FOSS advocates); and (C) Open weights and model, which doesn’t provide the training data, but would let you derive the weights if you had it.
OSI settled on (C) [2], but it did so late. FOSS argued for (B), but it’s impractical. So the world, for a while, had a choice between impractical (B) and the useful-if-flawed (A). The public, predictably, went with the pragmatic.
This was Betamax vs VHS, except in natural linguistics. There is still hope for (C). But it relies on (A) being rendered impractical. Unfortunately, the path to that flows through institutionalising OpenAI et al’s TOS-based fair use paradigm. Which means while we may get a definition (not exactly (B), but (A) absent use restrictions) we’ll also get restrictions on even using Chinese AI.
We absolutely had a choice (D), in that no one was forced to call it "open source" at all, which was arguably done to unfaithfully communicate benefits that don't exist. This is the part that riles people up, and that furthermore is causing collateral damage outside the AI bubble, and is nothing like Betamax vs. VHS.
If you want to prioritize pragmatism, that every discussion of this includes a lengthy "so what open source do you mean, exactly?" subthread proves this was a poor choice. It causes uncertainly that also makes it harder for the folks releasing these models to make their case and be taken seriously for their approach.
We should probably call them "free to run", if the "it's cheap" connotation of "freeware" needs to be avoided. Or maybe "open architecture" to appreciate the Python file that utilizes the weights more.
> We absolutely had a choice (D), in that no one was forced to call it "open source" at all
Technically yes, practically no.
You’re describing a prisoner’s dilemma. The term was available, there was (and remains) genuine ambiguity over what it meant in this context, and there are first-mover advantages in branding. (Exhibit A: how we label charges).
> causing collateral damage outside the AI bubble, and is nothing like Betamax vs. VHS
Standards wars have collateral damage.
> We should probably call them "free to run", if the "it's cheap" connotation of "freeware" needs to be avoided. Or maybe "open architecture"
Language is parsimonious. A neologism will never win when a semantic shift will do.
> Language is parsimonious. A neologism will never win when a semantic shift will do.
Agreed, but I think it's worth lamenting the danger in that. History is certainly full of transitory calamity and harm when semantic shifts detach labels from reality.
I guess we're in any case in "damage is done" territory. The question is more about where to go next. It does appear that the term "open source" isn't working for what these folks are doing (you could even argue whether the "available" term they chose was a strong one to lean on in the first place), so we'll see what direction the next shift takes.
The source code is absolutely open which is the traditional meaning of open source. You are wanting to expand this to include data sets, which is fine, but that is the divergence.
Nonono the code for (pre-)training wasn't released either and is non trivial to replicate. Releasing the weights without the dataset and training code is equivalent of releasing a binary executable and calling it open source. Freeware would be more accurate terminology.
I think I see what you mean. I suppose it is kinda like an opaque binary, nevertheless, you can use it freely since all is under the MIT license right?
Yes even for commercial purposes which is great, but the point of and reason why "open source" became popular is that you can modify the underlying source code of the binary which you can then recompile with your modifications included (as well as selling/publishing your modifications). You can't do that with deepseek or most other LLMs that claim to be open source. The point isn't that this makes it bad, the point is we shouldn't call it open source because we shouldn't loose focus on the goal of a truly open source (or free software) LLM on the same level than chatgpt/o1.
You can modify the weights which is exactly what they do when training initially. You do not even need to do it in exactly the same fashion. You could change things such as the optimizer and it would still work. So in my opinion it is nothing like an opaque binary. It's just data.
We have the weights and the code for inference, in the analogy this is an executable binary. We are missing the code and data for training, that's the "source code".
Then it’s never distributable and any definition of open source requiring it to be is DOA. It’s interesting, as an argument against copyright. But that academic.
it's not academic. Why can't ChatGPT tell me how to make meth? why doesn't deepseek want to talk about tiananmen square? what other things has the model been molested into how it should be? without the full source, we don't know
While I appreciate the argument that the term "open source" is problematic in the context of AI models, I think saying the training data is the "source code" is even worse, because it broadens the definition to be almost meaningless. We never considered data to be source code and realistically for 99.9999% of users the training data is not the preferred way of modifying the model, just because the don't have millions of $ to retrain the full model, they likely don't even have the HDD space to save the training data.
Also I would say arguing that the model weights are just the "binary" is disingenuous, because nobody wants releases that only contain the training data and scripts to train and not the model weights (which would be perfectly fine for open source software if we argue that the weights are just the binaries), because they would be useless to almost everyone, because they don't have the resources to train the model.
When you can use LLMs to write code with English (or other) language, it's pretty disingenuous to not call the training data source code just because it's not exclusively written in a programming language like Python or C++.
If it's fully open source, where's the code for training it? The implementation - at least, theirs - is also not trivial as they've mentioned optimising below the CUDA level to get maximum throughout out of their cluster.
I'm very appreciative of what they've done, but it's open weights and methodology, not open source.
For people who have the disciplinary background in neural networks and machine learning I imagine that replicating that paper into some type of framework would be straight forward right? Or am I mistaken?
The model itself yes. The changes from previous architectures are often quite small code-wise. Quite often just adding/changing few lines in a torch model.
Things like tweaking all the hyperparameters to make the training process actually work may be more tricky though.
With an LLM, the actual 0s and 1s of the model are fairly standard, common, and freely available to anyone that wants to use them. The "source code" for an LLM, is the process used to create the outcome, and to an extent, the data used to train with. DeepSeek released a highly detailed paper that describes the process used to create the outcome. People/Companies are actively trying to reproduce the work of DeepSeek to confirm the findings.
It's more akin to scientific research where everyone is using the same molecules, but depending on the process you put the molecules through, you get a different outcome.
> With an LLM, the actual 0s and 1s of the model are fairly standard, common, and freely available to anyone that wants to use them
How is that different than the 0s and 1s of a program?
Assembly instructions are literally standard. What’s more, if said program uses something like Java, the byte code is even _more_ understandable. So much so that there is an ecosystem of Java decompilers.
Binary files are not the “source” in question when talking about “open source”
There is no way to decompile an LLM's weights and obtain a somewhat meaningful, reproducible source, like with a program binary as you say. In fact, if we were to compare both in this way that would make a program binary more "open source".
yes, all the training code is still closed and doesn't seem it will ever be released. Here's[0] a comment from a dev that worked at deepseek.
tldr: we're already on to the next model, don't expect anything else to get open sourced.
> I was just told that the amount of people there are too limited, and open-sourcing needs another layer of hard work beyond making the training framework brrr on their own infra. So their priority has been to open-source everything that is MINIMUM + NECESSARY to the community while pushing most efforts on iterating to the next generation of models I think. They have been write everything clearly in technical reports and encourage the community to engage in reproduction , which is the unique insight of the team as well I think.
Weights are actually all you have. The "Open Source" name never applies to LLMs because they don't have a source.
But China did distribute them with sharing-friendly terms, what is completely different from others, like Meta, and makes the name way less misleading this time.