Didn't they only "opensource" weights like others?

Buttons840 · 2025-01-29T16:07:07 1738166827

Weights are kind of like a compiled binary, because they are an incomprehensible blob of bits. But they are also unlike a compiled binary, because they can be fine-tuned.

edflsafoiewq · 2025-01-29T17:19:27 1738171167

GPL defines "source code" as "the preferred form of the work for making modifications to it", which certainly describes the weights.

dietr1ch · 2025-01-29T20:45:41 1738183541

Training being a one-way function that drops knowledge should tell you that the weights are not the form you want to start with.

This is like saying, hey, a regular binary executable is fine because I can edit it with hexl-mode.

furyofantares · 2025-01-29T21:02:31 1738184551

If hexl-mode on the binary works on my home PC but compiling the source code costs me millions of dollars in compute then I want the binary. Someone with millions of dollars to spend on compute may have a differing opinion.

TeMPOraL · 2025-01-29T22:07:20 1738188440

This argument only barely holds water for those big SOTA models like llama derivatives, and that's only because of practical costs involved.

Or should I say, it held water until few days ago.

Personally though, I never bought it. Saying that weights are the "preferred form of the work for making modifications to it" because a) approximately no one can afford to start with the training data, and b) fine-tuning and training LoRAs are cheap enough, is basically like saying binary blobs are "open source" as long as they provide an API (or ABI) for other programs to use. By this line of reasoning, NVIDIA GPU stack and Broadcom chipset firmware would qualify as open source, too.

furyofantares · 2025-01-30T02:52:29 1738205549

I just don't think analogies to open source are useful in any direction. This is its own beast and we should just think about what we want out of it.

sitkack · 2025-01-29T23:56:00 1738194960

If we are able to look back at this comment in a year or two, you will chuckle.

zelphirkalt · 2025-01-30T10:48:40 1738234120

So, as you state yourself basically, the result also depends on training data, which makes it part of the "compiled source" in a way, just like the architecture of the model. If you have the training data, you can modify that.

But probably it is impossible for them to release the training data, as they have probably not made it all reproducible, but live ingested the data, and the data has since then chanced in many places. So the code to live ingest the data becomes the actual source, I guess.

dietr1ch · 2025-01-30T01:54:03 1738202043

Cost of building is a true concern, but it doesn't stop people from forking large open projects like Chrome or Firefox and try to build a project to pursue their own ideas and be able to contribute back to the upstream projects when it makes sense.

I don't build my browser, it's too expensive, but the cost of building has nothing to say with how open the access to things. It'd be cool if the community could fork the project, propose changes and maybe crowdfund a training/build run to experiment.

fsflover · 2025-01-29T20:14:22 1738181662

https://news.ycombinator.com/item?id=42869403

kragen · 2025-01-29T16:11:07 1738167067

I've fine-tuned compiled binaries on occasion. It used to be a common pastime among teenagers; that's where the demoscene came from.

dartos · 2025-01-29T16:08:35 1738166915

You can decompile binaries.

You can also edit binaries by hand.

behrlich · 2025-01-29T16:28:34 1738168114

Comparing fine tuning to editing binaries by hand is not a fair comparison. If I could show the decompiler some output I liked and it edited the binary for me to make the output match, then the comparison would be closer.

TeMPOraL · 2025-01-29T22:53:23 1738191203

> If I could show the decompiler some output I liked and it edited the binary for me to make the output match, then the comparison would be closer.

That's fundamentally the same thing though - you run an optimization algorithm on a binary blob. I don't see why this couldn't work. Sure, a neural net is designed to be differentiable, while ELF and PE executables aren't, but then backprop isn't the be-all, end-all of optimization algorithms.

Off the top of my head, you could reframe the task as a special kind of genetic programming problem, one that starts with a large program instead of starting from scratch, and that works on an assembly instead of an abstract syntax tree. Hell, you could first decompile the executable and then have the genetic programming solver run on decompiled code.

I'd be really surprised if no one tried that before. Or, if such functionality isn't already available in some RE tools (or as a plugin for one). My own hands-on experience with reverse engineering is limited to a few attempts at adding extra UI and functionality to StarCraft by writing some assembly, turning it into object code, and injecting it straight into the running game process[0] - but that was me doing exactly what you described, just by hand. I imagine doing such things is common practice in RE that someone already automated finding the specific parts of the binary that produce the outputs you want to modify.

--

[0] - I sometimes miss the times before Data Execution Prevention became a thing.

sitkack · 2025-01-29T23:57:19 1738195039

Grab an AI model and keep going.

jsight · 2025-01-29T21:15:20 1738185320

I feel like a lot of people in this thread have never done continued training on an LLM and it shows.

Seriously, a set of weights that already works really well is basically the ideal basis for a _lot_ of ML tasks.

zelphirkalt · 2025-01-30T10:56:21 1738234581

The question is not, whether it is ideal to do some ML tasks with it, the question is, whether you can do the things you could typically do with open sourced software, including looking at the source and build it, or modify the source and build it. If you don't have the original training data, or mechanism of getting the training data, the compiled result is not reproducible, like normal code would be, and you cannot make a version saying for example: "I want just the same, but without it ever learning from CCP prop."

dartos · 2025-01-29T19:31:38 1738179098

With regard to the argument about open source, it’s pretty much the same.

Especially with dynamically linked binaries like many games.

carom · 2025-01-29T20:12:48 1738181568

It is a fair comparison. Normal programming takes inputs and a function and produces outputs. Deep learning takes inputs and outputs and derives a functions. Of course the decompilers for traditional programs do not work on inputs and outputs, it is a different paradigm!

sksrbWgbfK · 2025-01-29T16:14:20 1738167260

Ghidra (https://ghidra-sre.org/) can fine-tune executables way more easily than your models.

mistercheph · 2025-01-29T17:48:54 1738172934

Actually it can't, you can fine tune models with training data, parameters, time and compute, ghidra won't "fine-tune" anything for you.

TeMPOraL · 2025-01-29T22:55:41 1738191341

How hard can it be to wrap it in a loop and apply some off-the-shelf good old fashioned AI^H^H optimization technique?

"Given specific inputs X and outputs Y, have a computer automatically find modifications to F so that F(X) gives Y" is a problem that's been studied for nearly a century now (longer, if relax the meaning of "computer"), with plenty of well-known solutions, most of which don't require F to be differentiable.

Isn't "operational research" a standard part of undergrad CS curriculum? It was at my alma mater.

mistercheph · 2025-01-30T08:18:59 1738225139

There's billions of dollars at the end of the rainbow you're gesturing towards

helpfulclippy · 2025-01-29T16:04:47 1738166687

It's amazing to me that "open source" has been so diluted that it is now used to mean "we will give you an opaque binary and permission to run it on your own computer."

mcbuilder · 2025-01-29T16:23:45 1738167825

Surely the architecture released as a HF transformers python file counts as "open source". https://huggingface.co/deepseek-ai/DeepSeek-R1/raw/main/mode...

Yes training is left as an exercise to the user, but it's outlined in the paper, and a good ML engineer should be able to get started with it, cluster of GPUs not included

cma · 2025-01-29T17:08:56 1738170536

There was an article saying they used hand-tuned PMX instead of CUDA so it might be a bit hard to match just from the paper without some good performance experts.

LiamPowell · 2025-01-29T17:54:35 1738173275

CUDA isn't so bad that hand writing PTX will give you a huge performance improvement, but when you're spending a few million dollars on training it makes sense to chase even a single digit percentage improvement, maybe more in a very hot code-path. Also these articles are based on a single mention of PTX in a paper.

cma · 2025-01-29T20:16:18 1738181778

The mention is here:

"3.2.2. Efficient Implementation of Cross-Node All-to-All Communication

In order to ensure sufficient computational performance for DualPipe, we customize efficient cross-node all-to-all communication kernels (including dispatching and combining) to conserve the number of SMs dedicated to communication. The implementation of the kernels is codesigned with the MoE gating algorithm and the network topology of our cluster. To be specific, in our cluster, cross-node GPUs are fully interconnected with IB, and intra-node communications are handled via NVLink. NVLink offers a bandwidth of 160 GB/s, roughly 3.2 times that of IB (50 GB/s). To effectively leverage the different bandwidths of IB and NVLink, we limit each token to be dispatched to at most 4 nodes, thereby reducing IB traffic. For each token, when its routing decision is made, it will first be transmitted via IB to the GPUs with the same in-node index on its target nodes. Once it reaches the target nodes, we will endeavor to ensure that it is instantaneously forwarded via NVLink to specific GPUs that host their target experts, without being blocked by subsequently arriving tokens. In this way, communications via IB and NVLink are fully overlapped, and each token can efficiently select an average of 3.2 experts per node without incurring additional overhead from NVLink. This implies that, although DeepSeek-V3 13 selects only 8 routed experts in practice, it can scale up this number to a maximum of 13 experts (4 nodes × 3.2 experts/node) while preserving the same communication cost. Overall, under such a communication strategy, only 20 SMs are sufficient to fully utilize the bandwidths of IB and NVLink.

In detail, we employ the warp specialization technique (Bauer et al., 2014) and partition 20 SMs into 10 communication channels. During the dispatching process, (1) IB sending, (2) IB-to-NVLink forwarding, and (3) NVLink receiving are handled by respective warps. The number of warps allocated to each communication task is dynamically adjusted according to the actual workload across all SMs. Similarly, during the combining process, (1) NVLink sending, (2) NVLink-to-IB forwarding and accumulation, and (3) IB receiving and accumulation are also handled by dynamically adjusted warps. In addition, both dispatching and combining kernels overlap with the computation stream, so we also consider their impact on other SM computation kernels. Specifically, we employ customized PTX (Parallel Thread Execution) instructions and auto-tune the communication chunk size, which significantly reduces the use of the L2 cache and the interference to other SMs."

It's definitely not the full model written in PTX or anything, but still some significant engineering effort to replicate, from people commanding 7-figure salaries in this wave, since the training code isn't open.

touisteur · 2025-01-31T09:52:59 1738317179

And it really isn't so surprising to have to go 'down' to PTX for such a low-level optimisation. For all the love AVX512 articles get here, I, for one, am glad some people talk about their PTX secret sauce.

I wish Intel didn't kill the maxas effort back then (buying the team out and ... what ?) as they were going even lower down the stack.

squeaky-clean · 2025-01-29T18:28:18 1738175298

To me, this feels the same as saying Sonic Colors Ultimate is open source because it was made with Godot. The engine is open source and making the game is left as an exercise to the user.

mcbuilder · 2025-01-29T19:02:48 1738177368

But you have all the assets of the actual finished game as well as the code used to run it, using your example. You don't get the game dev studio, i.e. datasets, expertise, and compute. Just because someone gives you all the source code and methods they used to make a game, doesn't mean anyone can just go and easily make a sequel, but it helps.

HPsquared · 2025-01-29T19:22:05 1738178525

In other words you don't have the source data.

zelphirkalt · 2025-01-30T11:01:18 1738234878

And full circle would be "code is data (lisp), data is code (forth)".

ogrisel · 2025-01-29T16:19:36 1738167576

It's better to be specific:

- open-source inference code

- open weights (for inference and fine-tuning)

- open pretraining recipe (code + data)

- open fine-tuning recipe (code + data)

Very few entities publish the later two items (https://huggingface.co/blog/smollm and https://allenai.org/olmo come to mind). Arguably, publishing curated large scale pretraining data is very costly but publishing code to automatically curate pretraining data from uncurated sources is already very valuable.

Palmik · 2025-01-29T17:33:01 1738171981

Also open-weights comes in several flavors -- there is "restricted" open-weights like Mistral's research license that prohibits most use cases (most importantly, commercial applications), then there are licenses like Llama's or DeepSeek's with some limitations, and then there are some Apache 2.0 or MIT licensed model weights.

cycomanic · 2025-01-29T19:46:45 1738180005

Has it been established if the weights can even be copyrighted? My impression has been that AI companies want to have their cake and it it too, on one hand they argue that the models are more like a database in a search engine, hence are not violating copyright of the data they have been trained with, but on the other hand they argue they meet the threshold that they are copyrightable in their own right.

So it seems to me that it's at least dubious if those restricted licences can be enforced (that said you likely need deep pockets to defend yourself from a lawsuit)

jcgl · 2025-01-29T17:57:45 1738173465

Then those should not be considered “open” in any real sense—when we say “open source,” we’re talking about the four freedoms (more or less—cf. the negligible difference between OSI and FSF definitions).

So when we apply the same principles to another category, such as weights, we should not call things “open” that don’t grant those same freedoms. In the case of this research license, Freedom 0 at least is not maintained. Therefore, the weights aren’t open, and to call them “open” would be to indeed dilute the meaning of open qua open source.

seberino · 2025-01-29T17:40:59 1738172459

Wait timeout. I thought DeepSeek's stuff was all MIT licensed too no? What limitations are you thinking of that DeepSeek still has?

Palmik · 2025-01-29T17:47:11 1738172831

I am referring to this one: https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/LIC...

It is a bit more permissive than Llama's it seems (no MAU threshold it seems).

seberino · 2025-01-30T15:29:19 1738250959

Wow. Your link is frustrating because I thought everything was under the MIT license. Why did people claim it is MIT licensed if they sneaked in this additional license?

orra · 2025-01-31T08:31:49 1738312309

So, the older DeepSeek-V3 model weights are sadly not permissively licensed.

But the recent DeepSeek-R1-Zero and DeepSeek-R1 have MIT licensed weights.

seberino · 2025-02-01T15:48:14 1738424894

Thank you very much. That was helpful. Do we need the older model weights to use the recent DeepSeek-R1-Zero and DeepSeek-R1 models?

orra · 2025-02-02T12:26:46 1738499206

I can't be 100% certain, but I think the good news is: no. There seem to be the exact same number of safetensor files for both, and AFAICT the file sizes are identical.

https://huggingface.co/deepseek-ai/DeepSeek-V3/tree/main https://huggingface.co/deepseek-ai/DeepSeek-R1/tree/main

paxys · 2025-01-29T16:07:14 1738166834

Haha yup. Going by the current definition of "open source" in AI 100% of software created before the cloud era would have been considered open source.

mFixman · 2025-01-29T16:36:07 1738168567

I can't believe Microsoft finally made Windows open source.

zelphirkalt · 2025-01-30T11:05:17 1738235117

Yay, let's "fine-tune" and share the result with everyone!

desdenova · 2025-01-29T16:55:32 1738169732

Every binary is open source if you can read assembly.

hexomancer · 2025-01-29T16:10:53 1738167053

If I publish some c++ code that has some hard-coded magic values in it, can the code not be considered open source until I also publish how I came up with those magic values?

bityard · 2025-01-29T16:35:42 1738168542

It depends on what those magic numbers are for. If they represent pure data, and it's obvious what the data is (perhaps a bitmap image), then sure, it's open source.

If the magic values are some kind of microcode or firmware, or something else that is executed in some way, then no, it is not really open source.

Even algorithms can be open source in spirit but closed source in practice. See ECDSA. The NSA has never revealed in any verifiable way how they came up with the specific curves used in the algorithm, so there is room for doubt that they weren't specifically chosen due to some inherent (but hard to find) weakness.

I don't know a ton about AI, but I gather there are lots of areas in the process of producing a model where they can claim everything is "open source" as a marketing gimmick but in reality, there is no explanation for how certain results were achieved. (Trade secrets, in other words.)

Ukv · 2025-01-29T16:51:17 1738169477

> If the magic values are some kind of microcode or firmware, or something else that is executed in some way, then no, it is not really open source.

To my understanding, the contents of a .safetensors file is purely numerical weights - used by the model defined in MIT-licensed code[0] and described in a technical report[1]. The weights are arguably only really "executed" to the same extent kernel weights of a gaussian blur filter would be, though there is a large difference in scale and effect.

[0]: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inferen...

[1]: https://arxiv.org/html/2412.19437v1

TeMPOraL · 2025-01-29T23:25:23 1738193123

Code is data is code. Fundamentally, they are the same. We treat the two things as distinct categories only for practical convenience. Most of the time, it's pretty clear which is which, but we all regularly encounter situations in which the distinction gets blurry. For example:

- Windows MetaFiles (WMF, EMF, EMF+), still in use (mostly inside MS Office suite) - you'd think they're just another vector image format, i.e. clearly "data", but this one is basically a list of function calls to Windows GDI APIs, i.e. interpreted code.

- Any sufficiently complex XML or JSON config file ends up turning into an ad-hoc Lisp language, with ugly syntax and a parser that's a bug-ridden, slow implementation of a Lisp runtime. People don't realize that the moment they add conditionals and ability to include or refer back to other parts of config, they're more than halfway to a Turing-complete language.

- From the POV of hardware, all native code is executed "to the same extent kernel weighs of a gaussian blur filter" are. In general, all code is just data for the runtime that executes it.

And so on.

Point being, what is code and what is data depends on practical reasons you have to make this distinction in the first place. IMHO, for OSS licensing, when considering the reasons those licenses exist, LLM weights are code.

mohsen1 · 2025-01-29T16:18:33 1738167513

if you publish only the binary it's not open source

if open the source then it is open source

if you write a book/blog about how you came up with the ideas but didn't publish the source it's not open source, even if you publish the blog+binaries

mistercheph · 2025-01-29T17:45:28 1738172728

model weights != binaries

fragmede · 2025-01-29T19:12:02 1738177922

why not?

jay_kyburz · 2025-01-29T20:26:48 1738182408

Its like the image you generated in Photoshop released as creative commons, not the Photoshop source code.

fragmede · 2025-01-29T20:33:22 1738182802

that adds to model weights == binaries tho

z3c0 · 2025-01-29T16:18:13 1738167493

I don't know if that compares to an AI model, where the most significant portions are the data preparation and training. The code DeepSeek released only demonstrates how to use the given weights for inferencing with Torch/Triton. I wouldn't consider that an open-source model, just wrapper code for publicly available weights.

I think a closer comparison would be Android and GApps, where if you remove the latter, most would deem the phone unusable.

reedciccio · 2025-01-29T17:42:58 1738172578

The Open Source Definition is quite clear on its #2 requirement: `The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed.` https://opensource.org/osd

ChadNauseam · 2025-01-29T17:53:00 1738173180

Arguably this would still apply to deepseek. While they didn’t release a way of recreating the weights, it is perfectly valid and common to modify the neural network using only what was released (when doing fine-tuning or RLHF for example, previous training data is not required). Doing modifications based on the weights certainly seems like the preferred way of modifying the model to me.

Another note is that this may be the more ethical option. I’m sure the training data contained lots of copyrighted content, and if my content was in there I would prefer that it was released as opaque weights rather than published in a zip file for anyone to read for free.

jonex · 2025-01-29T18:53:54 1738176834

It takes away the ability to know what it does though, which is also often considered an important aspect. By not publishing details on how to train the model, there's no way to know if they have included intentional misbehavior in the training. If they'd provide everything needed to train your own model, you could ensure that it's not by choosing your own data using the same methodology.

IMO it should be considered freeware, and only partially open. It's like releasing an open source program with a part of it delivered as a binary.

DonHopkins · 2025-02-02T04:55:20 1738472120

It's not that they want to keep the training content secret, it's the fact that they stole the training content, and who they stole it from, that they want to keep secret.

blackeyeblitzar · 2025-01-29T16:13:39 1738167219

It’s because prominent people with large followings are confusing the terms on purpose. Yann LeCun of Meta and Clem Delangue of Hugging Face constantly use the wrong terms for models that only release weights, and market them to their huge audiences as “open source”. This is a willful open washing campaign to benefit from the positivity that label generates.

seberino · 2025-01-29T17:44:28 1738172668

I agree it would be nice to have the training specifics. Nevertheless everything DeepSeek released is under the MIT license right? So you can go set up a cloud LLM, fine tune it, and, do whatever else you wish with it right? That is pretty significant no?

fragmede · 2025-01-29T19:49:45 1738180185

It is, but words mean things. If I said I got you a puppy and gave you a million dollars instead, that'd be nice, but what about the puppy?

Palmik · 2025-01-29T17:30:57 1738171857

Except the "binary" is not really opaque, and can be "edited" in exactly the same way it was produced in the first place (continued pre-training / fine-tuning).

cedws · 2025-01-29T16:21:15 1738167675

Even with the training material what good is it? The model isn’t reproducible, and even if it were you’re not going to spend the money to verify the output.

barnabee · 2025-01-29T19:14:03 1738178043

> The model isn’t reproducible

Not necessarily[0], it's a WIP, but: https://github.com/huggingface/open-r1

[0] Surely they won't end up with the exact same weights, but it should be possible to verify something about the model and approach

mistercheph · 2025-01-29T17:47:20 1738172840

Frontier models will never be reproducible in the freedom-loving countries that enforce intellectual property law, since they all depend on copyrighted content in their training data.

deegles · 2025-01-29T16:24:31 1738167871

I guess something like a kickstarter campaign would be needed to get together the millions of dollars needed per training run

Gigachad · 2025-01-30T08:23:02 1738225382

Who would fund that? What would be the point?

fragmede · 2025-01-29T19:19:33 1738178373

why not? if we could get a version of ChatGPT that wasn't censored and would tell me how to make meth, or an censored version of deepseek that wanted to talk about tank man, you don't think the Internet would come together and make that happen?

seberino · 2025-01-29T17:36:48 1738172208

I'm not an expert but didn't they release the weights under MIT license? So you can make your own LLM with complete control right?

I agree it would nice to know the details of their training, but, simply calling this drop an "opaque binary" is seriously underselling it no?

dartos · 2025-01-29T16:08:12 1738166892

Yeah blame the crowds of newbies calling llama open source bc it was free after being leaked.

JumpCrisscross · 2025-01-29T16:27:29 1738168049

> amazing to me that "open source" has been so diluted

It’s not and I called it [1].

We had three options: (A) Open weights (favoured by Altman et al); (B) Open training data (favoured by some FOSS advocates); and (C) Open weights and model, which doesn’t provide the training data, but would let you derive the weights if you had it.

OSI settled on (C) [2], but it did so late. FOSS argued for (B), but it’s impractical. So the world, for a while, had a choice between impractical (B) and the useful-if-flawed (A). The public, predictably, went with the pragmatic.

This was Betamax vs VHS, except in natural linguistics. There is still hope for (C). But it relies on (A) being rendered impractical. Unfortunately, the path to that flows through institutionalising OpenAI et al’s TOS-based fair use paradigm. Which means while we may get a definition (not exactly (B), but (A) absent use restrictions) we’ll also get restrictions on even using Chinese AI.

[1] https://news.ycombinator.com/item?id=41047269

[2] https://opensource.org/ai/open-source-ai-definition

sho_hn · 2025-01-29T16:32:44 1738168364

We absolutely had a choice (D), in that no one was forced to call it "open source" at all, which was arguably done to unfaithfully communicate benefits that don't exist. This is the part that riles people up, and that furthermore is causing collateral damage outside the AI bubble, and is nothing like Betamax vs. VHS.

If you want to prioritize pragmatism, that every discussion of this includes a lengthy "so what open source do you mean, exactly?" subthread proves this was a poor choice. It causes uncertainly that also makes it harder for the folks releasing these models to make their case and be taken seriously for their approach.

We should probably call them "free to run", if the "it's cheap" connotation of "freeware" needs to be avoided. Or maybe "open architecture" to appreciate the Python file that utilizes the weights more.

JumpCrisscross · 2025-01-29T16:38:51 1738168731

> We absolutely had a choice (D), in that no one was forced to call it "open source" at all

Technically yes, practically no.

You’re describing a prisoner’s dilemma. The term was available, there was (and remains) genuine ambiguity over what it meant in this context, and there are first-mover advantages in branding. (Exhibit A: how we label charges).

> causing collateral damage outside the AI bubble, and is nothing like Betamax vs. VHS

Standards wars have collateral damage.

> We should probably call them "free to run", if the "it's cheap" connotation of "freeware" needs to be avoided. Or maybe "open architecture"

Language is parsimonious. A neologism will never win when a semantic shift will do.

sho_hn · 2025-01-29T16:41:56 1738168916

> Language is parsimonious. A neologism will never win when a semantic shift will do.

Agreed, but I think it's worth lamenting the danger in that. History is certainly full of transitory calamity and harm when semantic shifts detach labels from reality.

I guess we're in any case in "damage is done" territory. The question is more about where to go next. It does appear that the term "open source" isn't working for what these folks are doing (you could even argue whether the "available" term they chose was a strong one to lean on in the first place), so we'll see what direction the next shift takes.

JumpCrisscross · 2025-01-29T16:50:21 1738169421

> we're in any case in "damage is done" territory. The question is more about where to go next

Sort of. We can learn from the example. Perfect is the enemy of the good.

nightski · 2025-01-29T17:22:52 1738171372

The source code is absolutely open which is the traditional meaning of open source. You are wanting to expand this to include data sets, which is fine, but that is the divergence.

lyu07282 · 2025-01-29T17:33:47 1738172027

Nonono the code for (pre-)training wasn't released either and is non trivial to replicate. Releasing the weights without the dataset and training code is equivalent of releasing a binary executable and calling it open source. Freeware would be more accurate terminology.

seberino · 2025-01-29T17:47:25 1738172845

I think I see what you mean. I suppose it is kinda like an opaque binary, nevertheless, you can use it freely since all is under the MIT license right?

lyu07282 · 2025-01-29T18:01:00 1738173660

Yes even for commercial purposes which is great, but the point of and reason why "open source" became popular is that you can modify the underlying source code of the binary which you can then recompile with your modifications included (as well as selling/publishing your modifications). You can't do that with deepseek or most other LLMs that claim to be open source. The point isn't that this makes it bad, the point is we shouldn't call it open source because we shouldn't loose focus on the goal of a truly open source (or free software) LLM on the same level than chatgpt/o1.

nightski · 2025-01-29T18:03:57 1738173837

You can modify the weights which is exactly what they do when training initially. You do not even need to do it in exactly the same fashion. You could change things such as the optimizer and it would still work. So in my opinion it is nothing like an opaque binary. It's just data.

lyu07282 · 2025-01-29T18:41:17 1738176077

We have the weights and the code for inference, in the analogy this is an executable binary. We are missing the code and data for training, that's the "source code".

JumpCrisscross · 2025-01-29T19:27:23 1738178843

> that's the "source code"

Then it’s never distributable and any definition of open source requiring it to be is DOA. It’s interesting, as an argument against copyright. But that academic.

fragmede · 2025-01-29T20:24:00 1738182240

it's not academic. Why can't ChatGPT tell me how to make meth? why doesn't deepseek want to talk about tiananmen square? what other things has the model been molested into how it should be? without the full source, we don't know

cycomanic · 2025-01-29T21:50:46 1738187446

While I appreciate the argument that the term "open source" is problematic in the context of AI models, I think saying the training data is the "source code" is even worse, because it broadens the definition to be almost meaningless. We never considered data to be source code and realistically for 99.9999% of users the training data is not the preferred way of modifying the model, just because the don't have millions of $ to retrain the full model, they likely don't even have the HDD space to save the training data.

Also I would say arguing that the model weights are just the "binary" is disingenuous, because nobody wants releases that only contain the training data and scripts to train and not the model weights (which would be perfectly fine for open source software if we argue that the weights are just the binaries), because they would be useless to almost everyone, because they don't have the resources to train the model.

JumpCrisscross · 2025-01-29T17:29:43 1738171783

> source code is absolutely open

It’s ambiguously open.

HPsquared · 2025-01-29T19:26:42 1738178802

Data is code, code is data.

DonHopkins · 2025-02-02T05:05:13 1738472713

When you can use LLMs to write code with English (or other) language, it's pretty disingenuous to not call the training data source code just because it's not exclusively written in a programming language like Python or C++.

culi · 2025-01-29T16:54:19 1738169659

No its fully open sourced. Even Janus is.

https://github.com/deepseek-ai

More importantly, they spelled out their methodology in depth in a paper (the code/implementation is trivial in comparison to the methodology)

Philpax · 2025-01-29T18:00:27 1738173627

If it's fully open source, where's the code for training it? The implementation - at least, theirs - is also not trivial as they've mentioned optimising below the CUDA level to get maximum throughout out of their cluster.

I'm very appreciative of what they've done, but it's open weights and methodology, not open source.

aldanor · 2025-01-29T18:22:18 1738174938

That's just inference code.

ComputerGuru · 2025-01-29T16:06:06 1738166766

They “open sourced” it enough (via the whitepaper) that huggingface is trying to reproduce their training now.

dartos · 2025-01-29T16:11:49 1738167109

How is it open source at all with no source?

Paxos isn’t open source just because you can read the paxos paper.

nexus_six · 2025-01-29T16:23:53 1738167833

For people who have the disciplinary background in neural networks and machine learning I imagine that replicating that paper into some type of framework would be straight forward right? Or am I mistaken?

jampekka · 2025-01-29T16:51:43 1738169503

The model itself yes. The changes from previous architectures are often quite small code-wise. Quite often just adding/changing few lines in a torch model.

Things like tweaking all the hyperparameters to make the training process actually work may be more tricky though.

ru552 · 2025-01-29T16:24:05 1738167845

With an LLM, the actual 0s and 1s of the model are fairly standard, common, and freely available to anyone that wants to use them. The "source code" for an LLM, is the process used to create the outcome, and to an extent, the data used to train with. DeepSeek released a highly detailed paper that describes the process used to create the outcome. People/Companies are actively trying to reproduce the work of DeepSeek to confirm the findings.

It's more akin to scientific research where everyone is using the same molecules, but depending on the process you put the molecules through, you get a different outcome.

dartos · 2025-01-29T16:41:08 1738168868

> With an LLM, the actual 0s and 1s of the model are fairly standard, common, and freely available to anyone that wants to use them

How is that different than the 0s and 1s of a program?

Assembly instructions are literally standard. What’s more, if said program uses something like Java, the byte code is even _more_ understandable. So much so that there is an ecosystem of Java decompilers.

Binary files are not the “source” in question when talking about “open source”

fuzzbazz · 2025-01-29T17:34:54 1738172094

There is no way to decompile an LLM's weights and obtain a somewhat meaningful, reproducible source, like with a program binary as you say. In fact, if we were to compare both in this way that would make a program binary more "open source".

dartos · 2025-01-29T19:30:37 1738179037

Yes, that is my exact argument.

ComputerGuru · 2025-01-29T16:18:17 1738167497

As an actual FOSS developer: they didn’t open source it.

But I was merely adding the missing context using the (sorry) lingua Franca of AI.

famouswaffles · 2025-01-29T16:17:03 1738167423

They released the weights

jayd16 · 2025-01-29T17:23:58 1738171438

Publishing a white paper doesn't qualify as open source in any other context.

Google Spanner has a nice white paper but you wouldn't consider it open source, for example.

tim333 · 2025-01-29T22:12:10 1738188730

No. Basically everything. It's not even GPL, it's MIT license. See https://news.ycombinator.com/item?id=42768547

mritchie712 · 2025-01-29T16:05:55 1738166755

yes, all the training code is still closed and doesn't seem it will ever be released. Here's[0] a comment from a dev that worked at deepseek.

tldr: we're already on to the next model, don't expect anything else to get open sourced.

> I was just told that the amount of people there are too limited, and open-sourcing needs another layer of hard work beyond making the training framework brrr on their own infra. So their priority has been to open-source everything that is MINIMUM + NECESSARY to the community while pushing most efforts on iterating to the next generation of models I think. They have been write everything clearly in technical reports and encourage the community to engage in reproduction , which is the unique insight of the team as well I think.

0 - https://x.com/wzihanw/status/1884374329334387017

marcosdumay · 2025-01-29T21:04:10 1738184650

Weights are actually all you have. The "Open Source" name never applies to LLMs because they don't have a source.

But China did distribute them with sharing-friendly terms, what is completely different from others, like Meta, and makes the name way less misleading this time.

frontfor · 2025-01-30T02:20:49 1738203649

Stop referring to them as “China”. It’s as ridiculous as referring to Meta as “America”.

tarsinge · 2025-01-29T19:05:43 1738177543

Others? Do OpenAI, Google or Anthropic release weights?

badgersnake · 2025-01-29T16:22:35 1738167755

Yeah, it’s not opensource. It’s just not SaaS. We need to call out these AI companies more on this.

otterley · 2025-01-29T18:55:05 1738176905

We have a term for this: "freeware."

badgersnake · 2025-01-29T20:10:07 1738181407

Sure, call it that then. Cut the open source bollocks.