I think the discussion in the other comment thread discusses this well. They are different techniques, but the line between RL & SL is quite fuzzy. The DPO authors advertise this as a "non-RL" technique to precisely get away from the reputation of unstable training RL has, but they also say and treat the language model as an
(implicit) reward model, similar to PPO. The point is well taken though, I will update this page to clarify the differences to avoid confusion.