Better Reasoning from ChatGPT Iterative bootstrapping, a new method to improve chain-of-thought prompting

Published

Oct 18, 2023

Reading time

3 min read

You can get a large language model to solve math problems more accurately if your prompts include a chain of thought: an example that solves a similar problem through a series of intermediate reasoning steps. A new approach to this sort of prompting improved ChatGPT’s accuracy on a variety of reasoning problems.

What's new: Jiashuo Sun and colleagues at Xiamen University, Microsoft, and IDEA Research, introduced iterative bootstrapping in chain-of-thought-prompting, a method that prompts a large language model to generate correct chains of thought for difficult problems, so it can use them as guides to solving other problems.

Key insight: Researchers have developed a few ways to prompt a large language model to apply a chain of thought (CoT). The typical method is for a human to write an example CoT for inclusion in a prompt. A faster way is to skip the hand-crafted example and simply instruct the model to “think step by step,” prompting it to generate not only a solution but its own CoT (this is called zero-shot CoT). To improve zero-shot CoT, other work both (i) asked a model to “think step by step” and (ii) provided generated CoTs (auto-CoT). The weakness of this approach is that the model can generate fallacious CoTs and rely on them when responding to the prompt at hand, which can lead to incorrect responses. To solve this problem, we can draw example prompts from a dataset that includes correct responses, and the model can check its responses against the dataset labels. If it’s wrong, it can try repeatedly until it answers correctly. In this way, it generates correct CoT examples to use in solving other problems.

How it works: To prompt ChatGPT to reason effectively, the authors built a database of example problems, chains of thought, and solutions. They drew problems from 11 datasets: six arithmetic reasoning datasets (such as grade-school math word problems), four common-sense reasoning datasets (for example, questions like “Did Aristotle use a laptop?”), and a symbolic reasoning dataset consisting of tasks that involved manipulating letters in words (for instance, “Take the last letters of the words in ‘Steve Sweeney’ and concatenate them”).

The authors prompted the model with a problem and instructed it to “think step by step” as it generated a solution, and they recorded the input and output.
When the model’s solution did not match the solution in the dataset, the authors instructed the model to try again using prompts such as, “The answer is not right, can you think more carefully and give me the final answer?” They repeated this step until the model delivered the correct solution.
Once the model had solved a problem correctly, they prompted it to present the answer again along with the steps that led to it. This output generally rendered the chain of thought more concisely than the model’s initial correct responses. They stored the problem, chain of thought, and solution in a database.
At inference, when prompting the model to solve a problem, the authors included in the prompt four to eight database entries selected at random.

Results: The authors evaluated their method versus hand-crafting and auto-CoT. Of the 11 datasets, their method achieved the best results on 8. For example, on grade-school math word problems, ChatGPT prompted using their method achieved 73.6 percent accuracy; using hand-crafted prompts, it achieved 69.3 percent accuracy, and using auto-CoT, it achieved 71.4 percent accuracy. Their method underperformed hand-crafted prompts on two common-sense reasoning datasets (76.8 percent versus 77.1 percent and 69.3 percent versus 71.1 percent). It underperformed auto-CoT on one arithmetic dataset (91.9 percent versus 92.5 percent.)

Why it matters: Large language models have powerful latent capabilities that can be activated by clever prompting. ChatGPT was able to solve the problems in the authors’ database, but only after multiple tries. Prompting it with examples of its own correct solutions to these problems apparently enabled it to solve other, similarly difficult problems without needing multiple tries.

We're thinking: It may be possible to modify this method to make human input unnecessary by asking the model to fix the problems in its previous generations or use external tools to validate its outputs.

Subscribe to The Batch