from
The Isabelle Archive of Formal Proofs. https://www.isa-afp.org/.
https://github.com/zhangir-azerbayev/ProofNet/tree/main/eval/prompts
Code for replicating the paper [ProofNet: Autoformalizing and Formally Proving Undergraduate Mathematics](https://arxiv.org/abs/2302.12433).
This repo is intended for replicating experimental results and accepting PRs to the dataset. To use ProofNet for your own experiments, use the [Huggingface dataset](https://huggingface.co/datasets/hoskinson-center/proofnet).
ProofNet is a benchmark for autoformalization and formal proving of undergraduate-level mathematics.
#### The ProofNet benchmarks consists of 371 examples, each consisting of a formal theorem statement in Lean 3, a natural language theorem statement, and a natural language proof. The problems are primarily drawn from popular undergraduate pure mathematics textbooks and cover topics such as real and complex analysis, linear algebra, abstract algebra, and topology. We intend for ProofNet to be a challenging benchmark that will drive progress in autoformalization and automatic theorem proving.
---
## Axler LADR
[[Lean Proofs for Axler LADR Exercises]]
ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics Zhangir Azerbayev Yale University∗
[email protected] Bartosz Piotrowski University of Warsaw∗
[email protected] Hailey Schoelkopf EleutherAI, Yale University
[email protected] Edward W. Ayers Carnegie Mellon University
[email protected] Dragomir Radev Yale University
[email protected] Jeremy Avigad Carnegie Mellon University
[email protected] Abstract 1 We introduce ProofNet, a benchmark for autoformalization and formal proving 2 of undergraduate-level mathematics. The ProofNet benchmarks consists of 371 3 examples, each consisting of a formal theorem statement in Lean 3, a natural 4 language theorem statement, and a natural language proof. The problems are pri5 marily drawn from popular undergraduate pure mathematics textbooks and cover 6 topics such as real and complex analysis, linear algebra, abstract algebra, and 7 topology. We intend for ProofNet to be a challenging benchmark that will drive 8 progress in autoformalization and automatic theorem proving. We report base9 line results on statement autoformalization via in-context learning. Moreover we 10 demonstrate improvements over our baselines by applying prompt retrieval and 11 distilled backtranslation. 12 1 Introduction 13 The creation of an automatic mathematician, that is, a system capable of autonomously posing con14 jectures and proving theorems, is a longstanding challenge in mathematics and artificial intelligence 15 [Gelernter, 1959]. In recent years, neural generative language modeling has emerged as a promising 16 approach to automating aspects of mathematics [Rabe and Szegedy, 2021]. 17 One approach to applying language models to mathematics has been to treat mathematical reasoning 18 in natural language as a sequence learning task [Welleck et al., 2021a, 2022, Lewkowycz et al., 2022]. 19 A key advantage of mathematical reasoning in natural language is the abundance of natural language 20 mathematics data on the internet [Lewkowycz et al., 2022]. 21 An alternative approach is to use language models to guide formal proof-search in an interactive 22 theorem prover (ITP) [Whalen, 2016, Yang and Deng, 2019, Wang and Deng, 2020, Polu et al., 2022, 23 Jiang et al., 2022a, Lample et al., 2022, First et al., 2023]. A salient advantage of this method is that
To remedy this gap, we propose ProofNet, 2 34 a benchmark consisting of parallel natural language 35 and formal mathematics that can be used to evaluate autoformalization and theorem proving. The 36 ProofNet benchmark consists of 371 parallel formal theorem statements, natural language theo37 rem statements, and natural language proofs sourced from the exercises of popular undergraduate38 level pure mathematics textbooks. Formal statements are expressed in the Lean 3 theorem prover 39 [de Moura et al., 2015], and depend on Lean’s mathlib [mathlib Community, 2020]. 40 Language-model-based theorem provers and autoformalization systems have typically been evalu41 ated on benchmarks consisting of competition and olympiad-style problems [Zheng et al., 2022, Wu 42 et al., 2022a]. While such problems require complex reasoning, their solutions only depend on a 43 relatively small set of elementary facts about integers, real numbers, counting, and geometry. In 44 contrast, modern research mathematics requires the mastery of a massive body of theory made up of 45 thousands of definitions, lemmas, and theorems. The Lean 3 formalization of perfectoid spaces, an 46 important definition in research-level arithmetic geometry, depends on over 3000 distinct theorems
and definitions [Buzzard et al., 2020]. How to effectively reason over such a large repository of 48 knowledge is an important unsolved problem in applying language models to mathematics [Irving 49 et al., 2016, Wu et al., 2022b, Tworkowski et al., 2022] . 50 ProofNet falls short of requiring mastery of all of modern mathematics, but poses the still ambitious 51 goal of reasoning over the core of an undergraduate mathematics, including basic analysis, algebra, 52 number theory, and topology. We hope that this benchmark will spur the development of language 53 models that are able to reason effectively over large knowledge bases. 54 In order to obtain stronger baselines on ProofNet, we train and open-source the PROOFGPT lan55 guage models at scales of 1.3 billion and 6.7 billion parameters. These models are trained on the 56 proof-pile, an 8.3 billion token dataset of mathematical text. To our knowledge, these are the only 57 open-source language models fine-tuned for general mathematics. 58 We establish baselines for ProofNet theorem autoformalization using in-context learning [Brown 59 et al., 2020]. Moreover, we introduce two novel theorem autoformalization methods that outper60 form our few-shot baselines. Prompt retrieval uses nearest-neighbor search against an embedding 61 database to create a prompt consisting of the mathlib declarations most relevant to a particular nat62 ural language theorem. Distilled backtranslation is a method inspired by work in unsupervised 63 machine translation [Lample et al., 2017, Han et al., 2021a] that finetunes a language model for 64 autoformalization at a large scale without the need for parallel data.
Not all textbook exercises lend themselves naturally to formalization. In particular, we only consider 70 for inclusion in ProofNet problems meeting the following criteria: 71 • Self-containment. Problems should only depend on the results commonly taught in an 72 undergraduate curriculum. In particular, this rules out problems that are split into multiple 73 sequentially dependent parts, or those using nonstandard notations. 74 • Naturality of formalization. Not all kinds of mathematical problems can be naturally for75 malized, such as word problems, and such problems are excluded. We do not include exer76 cises that require computing an unknown quantity. We do not include problems that depend 77 on parts of Lean’s mathlib that are relatively less mature, such as Euclidean geometry or 78 combinatorics. 79 • Low risk of train-test overlap. Because language models are often pre-trained on large 80 corpora mined from the internet that include mathlib, we refrain from including statements 81 that are in mathlib or are likely to be added to mathlib in the future. In practice, this 82 means we avoid the abstract “theory-building” style of theorems that constitute mathlib, 83 and instead choose problems that involve applying general results to specific cases. For 84 more insight into the stylistic differences between mathlib and ProofNet problems, see 85 Figure 1. 86 Beyond the above criteria, problems were selected for broad coverage of the undergraduate curricu87 lum and to range in difficulty from straightforward applications of the definitions to those requiring 88 tremendous creativity. Problems statements are transcribed into LATEX and formalized by human 89 annotators proficient in Lean. Natural language proofs are adapted from online solutions manuals, 90 or in a few cases, written by the annotators. 91 Supported Tasks As ProofNet includes parallel natural language statements, natural language 92 proofs, and formal statements, the dataset supports the evaluation of the following distinct tasks: 93 • Formal theorem proving. Given a formal statement of a theorem, produce a formal proof. 94 • Informal theorem proving. Given an informal statement, produce an informal proof. This 95 facilitates direct comparison between formal and informal theorem proving approaches. 96 • Autoformalization and informalization of statements. Given an informal (formal) statement, 97 produce a corresponding formal (informal) statement. 98 • Autoformalization of proofs. Given an informal theorem statement, its informal proof, and 99 its formal statement, produce a formal proof. 100 3 The PROOFGPT models and the proof-pile dataset 101 In order to obtain stronger baselines on the ProofNet benchmark, we introduce the PROOFGPT 102 language models and a text dataset named the proof-pile that these models are trained on. Many 103 approaches to quantitative reasoning with language models depend on pre-training or fine-tuning a
model on large corpora of mathematical text, which significantly boosts downstream performance 105 [Hendrycks et al., 2021b, Polu and Sutskever, 2020, Lample et al., 2022, Lewkowycz et al., 2022]. 106 Motivated by these results, we train and open-source the PROOFGPT models at sizes of 1.3 billion and 6.7 billion parameters.3 107 The PROOFGPT models are decoder-only causual language models initialized with Pythia weights [Biderman et al., 2023],4 and then fine-tuned on the proof-pile, 5 108 a 109 corpus of unstructured mathematical text gathered from internet sources whose composition is de110 tailed in Table 1. The proof-pile contains roughly 8.3 billion GPT-NeoX [Andonian et al., 2021] 111 tokens. Fine-tuning was performed using the GPT-NeoX library [Andonian et al., 2021]. For train112 ing hyperparameters, see Appendix A. In Table 2, we show that the PROOFGPT models outperform 113 Pythia base models at standard mathematical reasoning tasks. 114 We regard the PROOFGPT model suite as inferior to the Minerva models [Lewkowycz et al., 2022] 115 due to the fact that the PROOFGPT models are fine-tuned on an order of magnitude less mathematical 116 text and span a smaller parameter range. However, we hope that the research community will benefit 117 from PROOFGPT’s open-source weights and dataset.
In Table 3, we present our experimental results for autoformalization and informalization of 177 ProofNet theorem statements. Although conceptually simple and easy to implement, our Code178 davinci-002 in-context learning baseline achieves highly nontrivial performance, correctly formaliz179 ing 12.9% of theorems. The PROOFGPT models do not formalize any statements correctly, likely 180 owing to their smaller parameter count. However, they demonstrate some signal on the typecheck 181 rate and BLEU metrics. Note that even generating statements that typecheck in Lean 3’s strict type 182 system is a nontrivial feat. 183 Informalization accuracy is much higher than formalization accuracy for all models, supporting the 184 intuitive claim that informalization is an easier task than formalization. This result also suggests that 185 large pre-trained language models have a strong grasp of the semantics of formal mathematics, and 186 primarily struggle with generating lexically correct and type-correct Lean code. 187 We further observe that among Code-davinci-002’s generations that typecheck, roughly half are 188 correct formalizations. This is consistent with our hypothesis that Code-davinci-002 has a strong 189 grasp of the semantics of mathematics, since the model displays high accuracy conditional on having 190 generated valid Lean.
---
Case study 3. The following is an incorrect formalization of exercise 6.16 from Sheldon Axler’s 229 Linear Algebra produced by Code-davinci-002. The retrieval-augmented Code-davinci-002 autofor230 malization, not shown here, is also incorrect.
NL: Suppose U is a subspace of V . Prove that U ⊥ = {0} if and only if U = V Code-davinci-002 output: theorem exercise_6_16 {K : Type*} [field K] {V : Type*} [add_comm_group V] [vector_space K V] {U : subspace V} : U.orthogonal = {0} ↔ U = ⊤ ProofNet label: theorem exercise_6_16 {K : Type*} [is_R_or_C K] [inner_product_space K V] (U : submodule K V) : U.orthogonal = ⊥ ↔ U = ⊤ 231 232 Here, Code-davinci-002 makes three straightforward lexical mistakes: in mathlib, vector spaces 233 over a field K are denoted by module K V rather than vector_space K V, subspaces of a vector 234 space by submodule K V rather than subspace V, and the trivial subspace is denoted by ⊥ rather 235 than {0}. However, the model also makes a much more significant logical error. In order for the orthogonal complement U ⊥ 236 of a subspace to make sense, the space must be endowed with a inner 237 product space structure rather than merely a vector space structure, which is expressed in mathlib as 238 inner_product_space K V. Furthermore, inner product spaces are only defined over the real and 239 complex fields, so one must also declare [is_R_or_C K]. Reliably inferring these kinds of implicit 240 hypotheses is a major challenge for autoformalization systems.
---
Interactive Theorem Proving Work in formal theorem proving and autoformalization depends on 266 libraries of formalized mathematics. This work directly depends on Lean’s mathlib, but indirectly 267 benefits from lessons learned from other proofs systems such as Coq [Bertot and Castéran, 2004], 268 Mizar [Grabowski et al., 2010], and Isabelle [Nipkow et al., 2002]. 269 Unsupervised Machine Translation Because the amount of parallel formal and natural language 270 text is negligible, autoformalization faces many of the same challenges as unsupervised machine 271 translation [Lample et al., 2017, Conneau et al., 2018, Lample et al., 2018, Han et al., 2021a, Garcia 272 et al., 2023]. Our distilled backtranslation method is inspired by the distilled and iterated backtrans273 lation algorithm of Han et al. [2021a]. However, the authors of this work regard backtranslation as 274 a temporary workaround and foresee that in-context learning will be enough to elicit maximal per275 formance from a sufficiently good language model, as is now the case for unsupervised translation 276 [Garcia et al., 2023].
A PROOFGPT training 456 Table 5 displays hyperparameters for PROOFGPT training on the proof-pile. 457 B Problem Sources 458 The following is a complete list of sources ProofNet draws from: 459 • Analysis: Walter Rudin’s Principles of Mathematical Analysis 3rd ed, Charles C. Pugh’s 460 Real Mathematical Analysis 1st ed, Elias M. Stein and Rami Shakarchi’s Complex Analysis 461 1st ed. 462 • Linear Algebra: Sheldon Axler’s Linear Algebra Done Right 2nd ed. 463 • Abstract Algebra: David S. Dummit and Richard M. Foote’s Abstract Algebra 3rd ed, I.N. 464 Herstein’s Abstract Algebra 3rd ed, and Michael Artin’s Algebra 1st ed. 465 • Topology: James Munkres’ Topology 2nd ed. 466 • Examinations: Putnam Competition. 467 C Comparison to Existing Benchmarks 468 For a comparison of ProofNet to existing mathematical reasoning benchmarks, see Table 4 469 D Prompts Prompts are viewable in the open-source repository6 470 . The retrieval knowledge base and the code for generating it is also available in the repository7 471 . We use a 12-shot prompt for Code-davinci-002 472 autoformalization and informalization, and a 6-shot prompt for PROOFGPT autoformalization and 473 informalization. We give PROOFGPT models fewer examples because of its shorter context (2048 474 tokens compared to 8192), we only use the last six examples when prompting PROOFGPT. 475 For retrieval augmented models, we use a 3-shot prompt, where each example consists of 4 reference 476 formal statements and one NL-formal pair. 6 https://github.com/zhangir-azerbayev/ProofNet/tree/main/eval/prompts 7 https://github.com/zhangir-azerbayev/ProofNet/blob/main/train_backtranslation/ make_data/docgen_export_with_nl/docgen_export_with_nl.jsonl