> DPO is pretty much strictly better than RLHF + PPO
Out of genuine curiosity, do you have any pointers/evidence to support this. I know that some of the industry leading research labs haven't switched over to DPO yet, in spite of the fact that DPO is significantly faster than RLHF. It might just be organizational inertia, but I do not know. I would be very happy if simpler alternatives like DPO were as good as RLHF or better, but I haven't seen that proof yet.
Out of genuine curiosity, do you have any pointers/evidence to support this. I know that some of the industry leading research labs haven't switched over to DPO yet, in spite of the fact that DPO is significantly faster than RLHF. It might just be organizational inertia, but I do not know. I would be very happy if simpler alternatives like DPO were as good as RLHF or better, but I haven't seen that proof yet.