More

patelajay285 · on Nov 14, 2024

We found the same result a few years ago in our ICLR paper: https://arxiv.org/pdf/2209.14500

We found Google's T5 models which were released in 2019, pre-GPT-3, were "secretly" capable of in-context learning with a simple inference technique.

Given they use a bidirectional MLM (Masked Language Modeling) objective, it wasn't obvious how to do it, but MLM objectives are known to produce better language representations than causal (next token prediction) objectives. We were able to outperform much larger sized GPT-3 models or get very close to their performance with far smaller T5 models.

cscurmudgeon · on Nov 14, 2024

Are there any intrinsic dis/advantages of bidirectional models over causal models for in-context learning? It seems that unidirectional model just have been explored and worked on more.

patelajay285 · on Nov 14, 2024

When you train bidirectionally only, you don't get a generative model, that would be the downside. However, you can train on a mixture of causal and bidirectional objectives as some LLM pre-training has done. As far as I am aware, there are no downsides of that, but it is not more common simply because the standard practice has been to train causal only and there just isn't enough funding/attention to go into experimenting on every axis of pre-training (which can be very expensive).

namibj · on Nov 14, 2024

No, you can generate with them using diffusion.

zxexz · on Nov 15, 2024

Yep. That technique works very well. Surprised that it’s not more widely used.

byefruit · on Nov 15, 2024

This is very interesting. Have you got any references describing this approach?

namibj · on Nov 28, 2024

I'll try to remember to search some out that I read back when I did a literature review of the subject (probably was 11 months ago).

mycall · on Nov 15, 2024

Isn't Q* (or Quiet-STaR) a causal and bidirectional objective learning system?

toxik · on Nov 14, 2024

From that paper it seems the sampling method (SAP) is also slower, so that it beats larger models seems expected.

patelajay285 · on Nov 14, 2024

It's not at all expected. T5 models are not generative models by default and they were not thought to be able to perform generation, let alone in-context learning. Remember these models were released before any of the existing LLMs and in-context learning/prompting as a technique became popularized with GPT-3.

While the technique requires multiple samples to coax generations from this particular model, other LLM training schemes have incorporated both unidirectional and bidirectional objectives in their training now. However, this exploration hasn't been fully resolved as most models are still trained only on the causal objective by standard practice. There's still a a lot of exploration that can be done on pre-training objectives.

authorfly · on Nov 15, 2024

You are right, but it's a little misleading (as it sounds like is the usefulness of your work nowadays) - Comparisons on language modelling prowess of BERT/T5 being compared to the default, non-instruct GPT-3 or OPT-3 isn't really that useful if done by size, because in practice we don't use 1.3B generative models, and more importantly, because focusing on default decoding generation without an instruct/PPO step is not how these models are used practically. The instruct models blow this performance out of the water, but instruct plus better-performance-at-size for GPT models completely shows the dominance of decoder-only architectures in my opinion for now.
I think you have to consider that in 2020/2021 many PhDs and Professors attempted to shift grant funded research with BERT and T5 to explore how they could compete with GPT-3 or to express other properties of it that supposedley outdid GPT-3. Very few (besides sentence transformers) succeeded. It's not like this is an unexplored niche. A lot of people in denial were trying to keep on with BERT research for a while despite the fact their work was essentially made obsolete by GPT-3.

(and notably Table 1 and Figure 4 are cherrypicking the smallest size with the largest gaps in task difference, and a size we know decoding is not performative at - 1.3B param mark - the characteristics and conclusions the authors come to (wow, BERT is trained on less data but does better!) obviously can't be made at larger sizes because the actual GPT models become much larger)

patelajay285 · on Oct 31, 2024

We've been working on a Python framework where one of the use cases is easy distillation from larger models to smaller open-source models and smaller-closed source models (where you don't have to still use / pay for the closed-source API service): https://datadreamer.dev/docs/latest/

Here's an (now slightly outdated) example of OpenAI GPT-4 => OpenAI GPT-3.5: https://datadreamer.dev/docs/latest/pages/get_started/quick_...

But you can also do GPT-4 to any model on HuggingFace. Or something like Llama-70B to Llama-1B.

For some tasks, this kind of distillation works extremely well given even a few hundred examples of the larger model performing the task.

bangaladore · on Oct 31, 2024

> OpenAI GPT-4 => OpenAI GPT-3.5

I'm confused why you are mentioning 3.5 here. The weights aren't public, so you aren't actually running any derivative of GPT-3.5

Or am I mistaken. Can you clarify?

Tiberium · on Oct 31, 2024

> distillation from larger models to smaller open-source models and smaller-closed source models

They don't limit it only to open-source models. And you can finetune 3.5 Turbo on OpenAI API.

patelajay285 · on Feb 11, 2024

Collecting data is hard, but the library is also a synthetic data generation library, so for example you can create the data for DPO fully synthetically, check out the self-rewarding LLMs example: https://datadreamer.dev/docs/latest/pages/get_started/quick_...

patelajay285 · on Feb 11, 2024

Yes it is :), but the library is also a synthetic data generation library, so for example you can create the data for DPO fully synthetically, check out the self-rewarding LLMs example:

https://datadreamer.dev/docs/latest/pages/get_started/quick_...

sillysaurusx · on Feb 11, 2024

I’m extremely skeptical of this approach. Until proven otherwise, with a model that users actually find useful, I don’t think this can work.

It would be nice. But I’ve seen too many nice ideas completely fall apart in practice to accept this without some justification. Even if there are papers on the topic, and those papers show that the models rank highly according to some eval metrics, the only metric that truly matters is "the user likes the model and it solves their problems."

By the way, on a separate topic, the 90/10 dataset split that you do in all of your examples turns out to be fraught with peril in practice. The issue is that the validation dataset quality turns out to be crucial, and randomly yeeting 10% of your data into the validation dataset without manual review is a recipe for problems.

patelajay285 · on Feb 11, 2024

It's a demo snippet of how to setup the workflow, it's not meant to be a working production example a self-rewarding model or a faithful reproduction of the original paper. Whether self-rewarding LLMs are a good idea or not, it's a valuable and very active area of research in the literature today. This is a library for ML researchers who should actively research and study these avenues along with the pitfalls you're mentioning. But in order for them to do that, building these workflows have to be accessible to them, which is what this library is meant to do. It's not meant for the "hobbyist" ML-community, they should not be using synthetic data today in this way, it would likely lead to subpar results for any practical model or task.

sillysaurusx · on Feb 11, 2024

There’s a lot to unpack here.

First, I’m an ML researcher. I don’t go around saying so because appeal to authority is bogus, but since every one of your comments seems to do this, it’s unavoidable.

You say the code is for ML researchers, then flat out say that it’s not a working production example, nor is it a faithful reproduction of a paper. So what is it?

Whether you want it to be or not, your audience is the hobbyist ML community, because without benchmarks to back up your code examples, no one from the research community will trust your examples without actual proof that they work. That’s the hard part of research, and it’s most of the effort.

My advice is, write something that can train useful models. Implement a production grade workflow, and show some reasons why it works. If you’re trying to get the wider ML research community to buy in to this, there’s not much other way to do it. No one will want to take easy code that does the wrong thing, and most of your examples show the wrong thing to do, like the 90/10 split.

You’re also a bit defensive about accepting feedback. Trust me that it’s better to accept that your code sucks and does the wrong thing, and then try to make it suck less and do the right thing. That’s how the majority of good software is written, unless you’re cperciva. But he’d also publish a paper explaining why his code is correct.

Anyway, the whole point of posting this to HN is to get feedback on it. (If you were hoping that a bunch of people would suddenly use it, then you need to appeal to the hobbyist community. They’ve told you a bunch of things that you’ve straight up said is out of scope.) And it sounds like you were hoping for feedback from ML researchers. Maybe others will chime in, but for now, that’s the best I’ve got.

patelajay285 · on Feb 11, 2024

I think you're interpreting hostility where there is none, so I don't have much to say other than it's an infrastructure library, a demonstration snippet doesn't need to show how to train a production grade model. I appreciate the feedback and it's noted.

sillysaurusx · on Feb 12, 2024

Well, this is a decent example. I didn’t say you were hostile, just defensive.

As an ML researcher, infrastructure libraries need to show how to train a production grade model, or else they’re useless for research. This is why research is hard. You keep handwaving this in various ways, but if you want ML researchers to take this seriously, you need a serious example.

"Production grade" doesn’t mean that it needs to have a deployable API. It memes the model needs to not suck. And until your training code can train a model that doesn’t suck, every ML researcher will view this and think "this code is guaranteed to produce a model that sucks," since there’s no evidence to the contrary. It’s incredibly hard to get the details right, and I can’t count the number of times I’ve had to track down some obscure bug buried deep within abstraction layers.

I’m trying to help you here. Ask yourself: who are my users? Are your users ML researchers? I already explained the problems we have, and why your library doesn’t meet those needs. Are your users ML hobbyists? You’ve already said no to this, and I think that’s a mistake. Most ML researchers behave as hobbyists, in the sense that they’re always looking for simple, understandable examples. Your library gives that, but without any of the rigor necessary to show that it can be trusted. Are your users ML devops, since it’s infrastructure? No, because it’s training models.

So you’re excluding every possible user, whether you realize it or not. But we’ll see; in a few months, if your library has significant traction, I’m empirically wrong. But I’m trying to help you avoid the default outcome of nobody uses your code because you’re not designing it for any particular user.

patelajay285 · on Feb 12, 2024

Thanks for clarifying, for the record, I generally agree with you. I think we just disagree on the snippets and how in-depth they need to be. Our library is built on HF libraries (we don't implement the training code ourselves), which are popular and commonly utilized by researchers, and people know how to build good models on those libraries. The package is simply meant to provide an easier interface to create some of these complex multi-stage LLM workflows that are starting to become common at ML research conferences and reduce boilerplate code around common functions (caching or tokenizing).

But I hear you on it would be useful to also have some examples that show a proper, reliable model being trained with the library v.s. just example models. The project is pretty early, and we'll work on adding more examples.

patelajay285 · on Feb 11, 2024

Thanks for the question. This is built for ML researchers, so in examples we use the defacto source for datasets researchers often use, HF Hub.

However, there is a lot of documentation on the site to help guide users. This documentation page shows you can load in data via local datasets as well. For example, JSON, CSV, text files, a local HF Dataset folder, or even from a Python `dict` or `list`:

https://datadreamer.dev/docs/latest/datadreamer.steps.html#t...

We'll definitely keep improving documentation, guides, and examples. We have a lot of it already, and more to come! This has only recently become a public project :)

If anyone has any questions on using it, feel free to email me directly (email on the site and HN bio) for help in the meantime.

mk_stjames · on Feb 11, 2024

I did glance at the docs first before commenting but I was looking in 'datasets' to try and understand importing a potential CSV/JOSN etc and all I saw was verbage on accessing the output.

I would not have guessed that the base input data processing would have been filed under 'steps'. But now I kinda see how you are working, but I admit I'm not the target audience.

If you want this to really take off for people outside of a very, very specific class of researchers... setup an example on your landing page that calls to a local JSON of user prompts/answers/rejects finetuning a llama model with your datadreamer.steps.JSONDataSource into the loader. Or, a txt file with the system/user/assistant prompts tagged and examples given. Yes, your 'lines of code' for your frontpage example may grow a bit!

Maybe there are a lot of 'ML researchers' that are used to the type of super-abstract OOP API, load-it-from-huggingface-scheme-people you are targeting but also know that there are a ton that aren't.

patelajay285 · on Feb 11, 2024

That's totally fair and good feedback, it's hard to support everyone's use cases simultaneously, but from my own research and other researchers we collaborate with, this solves and streamlines the right set of problems, but we want to make this as broadly useful as possible. Always happy to chat more / provide support if you would like, feel free to reach out if you want to try it and run into any sharp edges I could help make easier.

patelajay285 · on Feb 11, 2024

This was discussed in another comment, DPO is pretty much strictly better than RLHF + PPO, and far more stable when training. Yes, DPO is not technically "RL", but it's semantics for the most part. DataDreamer does support PPO training if you want, but it's so unstable, it's a less popular choice now.

antonvs · on Feb 11, 2024

In the DPO paper linked from the OP page, DPO is described as "a simple RL-free algorithm for training language models from preferences." So as you say, "not technically RL."

Given that, shouldn't the first sentence on the linked page end with "...in a process known as DPO (...)" ? Ditto for the title.

It sounds like you're saying that the terms RL and RLHF should subsume DPO because they both solve the same problem, with similar results. But they're different techniques, and there are established terms for both of them.

patelajay285 · on Feb 11, 2024

I think the discussion in the other comment thread discusses this well. They are different techniques, but the line between RL & SL is quite fuzzy. The DPO authors advertise this as a "non-RL" technique to precisely get away from the reputation of unstable training RL has, but they also say and treat the language model as an (implicit) reward model, similar to PPO. The point is well taken though, I will update this page to clarify the differences to avoid confusion.

vvrm · on Feb 11, 2024

> DPO is pretty much strictly better than RLHF + PPO

Out of genuine curiosity, do you have any pointers/evidence to support this. I know that some of the industry leading research labs haven't switched over to DPO yet, in spite of the fact that DPO is significantly faster than RLHF. It might just be organizational inertia, but I do not know. I would be very happy if simpler alternatives like DPO were as good as RLHF or better, but I haven't seen that proof yet.

changoplatanero · on Feb 12, 2024

I can second that. From what I’ve heard from people at leading labs, it’s not clear that dpo is worth switching to from RLHF

patelajay285 · on Feb 11, 2024

This is built for ML researchers out of an academic lab. There's a ton of functionality in the library (beyond RLHF and alignment) that ML researchers do every day to write papers and run experiments that the library helps abstract and make repeatable and usable.

Unless your research hypothesis is specifically around improving or changing RLHF, it's unlikely you should be implementing it from scratch. Abstractions are useful for a reason. The library is quite configurable to let you tune any knobs you would want.

patelajay285 · on Feb 11, 2024

That’s totally valid and something we would even encourage! This project is for researchers so if there is a point where the abstraction is no longer useful, by all means configure, or subclass, or copy code.

patelajay285 · on Feb 11, 2024

Thanks, appreciate the feedback, will update when I get a chance!

patelajay285 · on Feb 11, 2024

Yep, DPO is not technically “RL” and implicitly uses the LLM itself as a reward model, but training with DPO is far more stable for that reason.

espadrine · on Feb 11, 2024

DPO is as close to RL as RLHF. The latter also uses the LLM as a reward model.

I'm not a fan of the RL/SL dichotomy, because the line gets so foggy. If you squint, every loss is a negative reward, and every policy improvement a supervised target.

Still, what the code does isn't what is described in the paper that the page links to.

nextaccountic · on Feb 11, 2024

> I'm not a fan of the RL/SL dichotomy, because the line gets so foggy. If you squint, every loss is a negative reward, and every policy improvement a supervised target.

Isn't this just because reinforcement learning and supervised learning are both optimization problems?

espadrine · on Feb 11, 2024

In part, yes! But also because what used to define it was the human-curated datasets: SL contained input/output pairs, while RL contained episodes with sporadic rewards.

Nowadays, many datasets have different forms or are synthetic. DPO uses datasets with both positive and negative examples (instead of just a target output as with traditional SL); RLHF uses synthetic rewards.

patelajay285 · on Feb 11, 2024

I tend to agree @espadrine, it's semantics for the most part