Better Teachers Make Better Students Microsoft‘s Orca 2 strengthens the native reasoning abilities of smaller models

Jun 5, 2024
Reading time
3 min read
Better Teachers Make Better Students: Microsoft‘s Orca 2 strengthens the native reasoning abilities of smaller models

A relatively small student LLM that learns to mimic a larger teacher model can perform nearly as well as the teacher while using much less computation. It can come even closer if the teacher also strengthens the student’s native reasoning skills.

What’s new: Arindam Mitra and colleagues at Microsoft proposed Orca 2, a technique that improves the output of student LLMs an order of magnitude smaller than their teachers.

Key insight: Large language models can provide better output when they’re prompted to use a particular reasoning strategy such as think step by step, recall then generate, or explain then generate. Different reasoning strategies may yield better output depending on the task at hand. Moreover, given the same task, different models may perform better using different reasoning strategies. Consequently, in a teacher-student situation, the teacher and student models may need to use different strategies to achieve their highest performances on a given task. The student will achieve its best performance if it mimics the teacher's reasoning and response when the teacher uses not its own best-performing strategy, but the student’s best-performing strategy.

How it works: The teacher, GPT-4, helped generate a fine-tuning dataset to improve the output of the student, Llama 2 (13 billion parameters), both of which had been pretrained. They created the fine-tuning dataset and fine-tuned Llama 2 as follows:

  • The authors assembled an initial dataset that included examples (prompts and responses) of roughly 1,500 tasks. They drew from datasets including FLAN (which includes text classification, math questions, logic questions, and multiple choice questions), math problems from 10 datasets not in FLAN, few-shot prompts in the Orca dataset, and summarizations generated using GPT-4.
  • The authors fed each prompt to Llama 2 using each of several reasoning strategies including direct answer, think step by step, explain then answer, and more. (The authors don’t specify all the strategies they used.) They measured its performance on each task per reasoning strategy.
  • For each task, they prompted GPT-4 with all examples of that task, specifying the reasoning strategy that had enabled Llama 2 to achieve its highest performance on that task. In this way, GPT-4 augmented the dataset to include, for each prompt, both the response and the reasoning it used to arrive at it. 
  • They fine-tuned Llama 2, given a prompt — without specifying the reasoning strategy — to produce the detailed reasoning and response generated by GPT-4.

Results: The authors compared their model to models of similar size including WizardLM-13B (also based on Llama 2) and larger models including GPT-3.5 Turbo (an order of magnitude larger) and GPT-4 (parameter count undisclosed). They evaluated the percentage of correct responses on average over six reasoning benchmarks such as AGIEval, which includes multiple-choice and fill-in-the-blank questions from the Scholastic Aptitude Test, American Mathematics Competitions, and other tests designed for humans. Their model exactly matched the correct answer 66.92 percent of the time compared to WizardLM-13B (50.32 percent). It performed nearly as well as the 10x larger GPT-3.5 Turbo (which achieved 67.65 percent) but much less well than GPT-4 (which achieved 79.03 percent).

Why it matters: Learning how to reason is an important complement to learning facts and perspectives. A model that has been trained to reason using its most effective strategy generally will provide better output. Users don’t need to tell it which strategy to apply. They can simply enter a prompt, and the model will figure out how to reason its response.

We’re thinking: Perhaps a similar approach could be used to prompt a model to improve its own output. In effect, this would be similar to an agentic workflow designed to enable a model to produce its own training data, as recently described in The Batch.


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox