OpenAI’s Generative Pre-Trained Transformer (GPT) architecture has created coherent essays, images, and code. Now it generates mathematical proofs as well.**What’s new:** Stanislas Polu and Ilya Sutskever at OpenAI premiered GPT-f, a state-of-the-art transformer network (adapted from a language model) that synthesized proofs good enough to impress mathematicians.**Key insight:** A proof is a lot like a board game. You start with the pieces on the board (assumptions) and make a sequence of moves (steps) to reach a conclusion (theorem). AlphaGo famously beat world champions of the strategy game Go by iteratively building a tree of possible sequences of moves to find a winner. Similarly, GPT-f builds a tree of possible steps to prove a theorem.**How it works:** GPT-f is based on transformers similar to GPT-2 and GPT-3. It outputs pairs of statements (vertices) and steps (edges) in syntax readable by Metamath Proof Explorer, an automated proof verifier, and assembles them into a tree. The authors pretrained it on web data scraped by Common Crawl — the corpus of choice for GPT-3 — as well as arXiv, Github, and Mathematics StackExchange to strengthen its sense of logic. They fine-tuned it on existing proofs verified by Metamath to give it familiarity with that system’s syntax.

- Given a set of assumptions and a statement to prove (called the goal in the figure above), GPT-f generates a candidate for the next statement (the next goal) and steps that prove it (the tactic). For instance, to prove the theorem (A ⇒ B), the model may first prove the tactic (A ⇒ C) and then attempt the next goal (C ⇒ B). It produces up to 32 next-goal and tactic candidates.
- The system uses each next-goal candidate to build a tree of statements provable from the assumptions. If one of those statements is the original goal, then GPT-f has produced a proof.
- The authors used Metamath to label each step correct or incorrect. They fed back Metamath-verified GPT-f proofs into the fine-tuning dataset. As GPT-f generated its shortest proof of a given theorem, it learned to create even shorter ones.

**Results:** The researchers compared GPT-f to MetaGen-IL, a recurrent neural network and the previous state-of-the-art theorem prover that uses Metamath syntax. Given a test set of theorems proved by Metamath, GPT-f generated valid proofs for 56.22 percent of them, MetaGen-IL 21.16 percent. Active members in the Metamath community were impressed by the economy of GPT-f’s proofs. The model shortened 23 previously verified proofs, which are now part of Metamath’s proof library.**Why it matters:** Historically, AI has suffered from a gulf between deep learning and traditional symbolic approaches. This work shows that a sufficiently sophisticated neural network can manipulate symbols and logic as well.**We’re thinking:** If this model were to find a solution to the Millennium Problem, the authors could add $1 million to the training budget.