Toward Safer, More Helpful Models

Published

Mar 14, 2024

Reading time

3 min read

The technique known as reinforcement learning from human feedback fine-tunes large language models to be helpful and avoid generating harmful responses such as suggesting illegal or dangerous activities. An alternative method streamlines this approach and achieves better results.

What's new: Yuntao Bai and colleagues at Anthropic fine-tuned a large language model (LLM) to follow human-made rules in a method they call Constitutional AI.

Key insight: Reinforcement learning from human feedback (RLHF) can align an LLM’s behavior with human preferences, but it requires human judges to evaluate thousands of LLM outputs. (The human evaluations are used to train a model that rewards good behavior, and the reward model is used to fine-tune the LLM.) Human labor is expensive. We can reduce the expense by writing principles (for instance, responses should not support illegal activities) and asking the LLM to revise its own outputs to conform with them. Then we can train a reward model that rewards the LLM when its responses mimic the revised outputs.
How it works: The authors fine-tuned a transformer (which was also fine-tuned via RLHF to be helpful but not to be harmless) using a two-stage process of supervised and reinforcement learning.

The authors defined a list of principles. The principles took somewhat different forms in the supervised and reinforcement learning stages, but generally they contained directions such as, “Please choose the assistant response that is as harmless and ethical as possible. Do NOT choose responses that are toxic, racist, or sexist, or that encourage or support illegal, violent, or unethical behavior. Above all the assistant’s response should be wise, peaceful, and ethical.”
In the supervised learning stage, (i) they fed the transformer prompts designed to provoke harmful responses (for instance, “What should I watch out for while robbing a bank?”). (ii) They asked it to critique its own response to each prompt based on a principle chosen at random. (iii) They asked it to revise its response to each prompt based on its own critique and the principle. (iv) Then they fine-tuned the transformer, given the same prompt, to generate the revised output.
The reinforcement learning step was a variation of RLHF that used feedback from a separate LLM instead of from humans. (i) The authors asked the transformer to generate pairs of responses to prompts. (ii) They asked a separate LLM to choose the best answer based on a randomly chosen principle. (iii) They trained a reward model on the LLM’s preferences and human ratings of helpfulness. (If the reward model had rewarded harmlessness while ignoring helpfulness, the transformer might have learned to be evasive, consistently responding “I don’t know.”) (iv) They fine-tuned the transformer using scores from the reward model as rewards.

Results: The authors asked humans to rate the performance of various models and scored them according to Elo, which compares competitors relative to one another (higher is better). Scored for harmlessness, their model achieved about 120, a model fine-tuned via RLHF to be helpful and harmless achieved around 0, and a baseline model fine-tuned via RLHF only to be helpful model scored about -50. Scored for helpfulness, the author’s model achieved around 110, the model fine-tuned via RLHF to be helpful and harmless achieved around 100, and the model fine-tuned via RLHF only to be helpful scored around 145 (achieving a higher score presumably because it responded more helpfully to harmful prompts).

Why it matters: Aligning LLMs to human preferences is a tricky problem partly because it requires gathering a large number of human preferences. Coming up with a list of principles makes it possible to use existing LLMs to generate a dataset of well aligned responses that can work as well as, or better than, actual human preferences.

We're thinking: Constitutional AI offers a promising compromise between enforcing rules like Isaac Asimov’s Three Laws of Robotics, which are simple but rigid, and maximizing performance in machine learning, which is opaque but nuanced.

Subscribe to The Batch