Direct Preference Optimization (DPO)

2 Posts

More Factual LLMs: FactTune, a method to fine-tune LLMs for factual accuracy without human feedback
Direct Preference Optimization (DPO)

More Factual LLMs: FactTune, a method to fine-tune LLMs for factual accuracy without human feedback

Large language models sometimes generate false statements. New work makes them more likely to produce factual output.
Human Feedback Without Reinforcement Learning: Direct Preference Optimization (DPO) fine-tunes pretrained large language models on human preferences without the cumbersome step of reinforcement learning.
Direct Preference Optimization (DPO)

Human Feedback Without Reinforcement Learning: Direct Preference Optimization (DPO) fine-tunes pretrained large language models on human preferences without the cumbersome step of reinforcement learning.

Reinforcement learning from human feedback (RLHF) is widely used to fine-tune pretrained models to deliver outputs that align with human preferences. New work aligns pretrained models without the cumbersome step of reinforcement learning.

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox