Programming is hard. Programming competitions are harder. Yet transformers proved themselves up to the task.
What’s new: Yujia Li, David Choi, Junyoung Chung, and a team at DeepMind built AlphaCode, a system that beat roughly half of competitors in coding contests where many examples of program inputs and outputs were available.
Key insight: Previous work showed that transformers can generate code, though their output doesn’t always solve the task at hand. But transformers can generate millions of possible solutions to the same problem instantly, and the solutions can be filtered by checking their performance automatically. Those that remain should solve the problem.
How it works: The authors trained a transformer to generate programs based on problems from a dataset they built containing 13,000 challenges mainly from Codeforces, a platform that hosts coding contests. Each problem included hundreds of solution programs (incorrect as well as correct) along with roughly 100 examples of test cases (expected inputs and outputs) mostly created by the authors.
- The authors pretrained a transformer on 86 million programs in 12 programming languages. Given the first part of a program, the transformer learned to generate the next part.
- They fine-tuned the model to generate each program in their challenge dataset based on the difficulty, problem description, programming language, suggested techniques that might solve the problem, and whether the solution was correct. They used the GOLD loss function, which gave the model more encouragement to be confident in its predictions when it had some confidence, and less encouragement to be confident when it had little confidence. In this way, the model increased its chance of generating, over many tries, at least one correct program.
- They fine-tuned a second transformer to generate test-case inputs given a problem description.
- At inference, they randomly sampled a difficulty and suggested techniques, and they told the first transformer to generate a correct solution. They repeated this 1 million times and filtered out programs that failed to solve all test cases. This left thousands of programs.
- To filter the programs further, they used the second transformer to generate 50 test-case inputs and ran the remaining programs on those 50 inputs. Then they clustered programs that produced the same outputs and randomly picked one from each of the 10 largest clusters. This procedure yielded 10 diverse programs to be entered into a contest.
Results: The authors used AlphaCode in 10 simulated Codeforces competitions, allowing it two hours to generate solutions for each. Ranking its performance among 5,000 Codeforces competitors, it averaged in the 54th percentile (lower is better). It correctly solved 34.2 percent of problems in the validation set.
Why it matters: AlphaCode generated 1 million possible solutions and culled the bad ones to solve problems it had never seen before and beat a substantial portion of competitive human programmers. It goes to show that there are still benefits to be gained from scaling up.
We’re thinking: AlphaCode is an impressive demonstration of high-throughput code generation and testing. That said, considering its performance on the validation set, there’s still a distance to go.