Schooling Language Models in Math GOAT (Good at Arithmetic Tasks), a method to boost large language models' arithmetic abilities

Published

Mar 06, 2024

Reading time

2 min read

Large language models are not good at math. Researchers devised a way to make them better.

What's new: Tiedong Liu and Bryan Kian Hsiang Low at the National University of Singapore proposed a method to fine-tune large language models for arithmetic tasks.

Key insight: Large language models (LLMs) do fairly well at addition and subtraction as well as multiplication and division by single digits or by powers of 10. They’re less adept at the more challenging tasks of multiplication and division of larger numbers. One way to perform these tasks well is to divide them into simpler subtasks. For example, a relatively easy way to multiply two large numbers like 123 and 321 is to

Split one number into decimal places (123 becomes 100 + 20 + 3)
Multiply the other number by each of these (100 * 321 + 20 * 321 + 3 * 321)
Add the resulting products to arrive at the solution (32100 + 6420 + 963 = 39483)

A similar technique exists for division. Together, these approaches can enable LLMs to perform more complicated mathematical tasks.

How it works: The authors built GOAT (a model GOod at Arithmetic Tasks) by fine-tuning LLaMA on a synthetic dataset that comprised 1 million examples of arithmetic operations on integers that were divided into steps for easier calculation.

The prompts were simple instructions like “Calculate 397 x 4429” or “I would appreciate it if you could assist me in calculating 1463456 + 2107”.
The answers were either numbers (for the simpler operations) or chains of reasoning (for multiplications and divisions of larger numbers). For example, if the prompt was “Calculate 24x79”, the target was “24 * 79 = 24 * (70 + 9) = 24 * 70 + 24 * 9 = 1680 + 216 = 1896”.
To create these chains, the authors wrote a Python script. For multiplication, the script randomly generated two numbers, split one number into decimal places, multiplied the second number by each of those terms, then added the products. It followed a similar procedure for division.

Results: The authors compared GOAT and GPT-4 on BIGBench, which contains arithmetic operations on integers up to five digits. GOAT performed either on par with or better than GPT-4 for all operations. Specifically, GPT-4 struggled to multiply and divide large numbers. Multiplying 5-digit numbers, GPT-4 achieved 0 percent accuracy, while GOAT achieved 96.7 percent. Dividing five-digit numbers, GPT-4 achieved 53.4 percent, while GOAT achieved 96.5 percent. GOAT also performed better than other LLMs (Bloom, GPT-NeoX, OPT, and Pythia) that had been fine-tuned in the same way. The authors attribute this to the fact that LLaMA generates a separate token for each digit (and does not learn tokens that represent multiple digits), while the other models learn tokens for multiple digits (for example, separate tokens for 748, 74, and 7).

Why it matters: LLMs have latent mathematical knowledge that can be unlocked by thoughtful fine-tuning.

We’re thinking: Humans, too, aren’t great at multiplying or dividing numbers directly — but give us a pencil and paper so we can work things out step by step, and we’re much better.

Subscribe to The Batch