arXiv:2408.03350v1 Announce Type: new
Abstract: We introduce miniCTX, which tests a model’s ability to prove formal mathematical theorems that depend on new definitions, lemmas, or other contextual information that was not observed during training. miniCTX contains theorems sourced from real Lean projects and textbooks, each associated with a context that can span tens of thousands of tokens. Models are tasked with proving a theorem given access to code from the theorem’s repository, which contains context that is helpful or needed for the proof. As a baseline for miniCTX, we introduce file-tuning, a simple recipe that trains a model to generate a proof step conditioned on the preceding file contents. File-tuning substantially outperforms the traditional neural theorem proving approach that fine-tunes on states alone. Additionally, our file-tuned model improves performance on the standard miniF2F benchmark, achieving a pass rate of 33.61%, which is a new state-of-the-art for 1.3B parameter models. Alongside miniCTX, we offer ntp-toolkit for automatically extracting and annotating theorem proving data, making it easy to add new projects into miniCTX to ensure that contexts are not seen during training. miniCTX offers a challenging and realistic perspective on evaluating neural theorem provers.
Analyzing miniCTX: Evaluating Neural Theorem Provers
The introduction of miniCTX presents a new approach to evaluating the capabilities of neural theorem provers. Unlike previous methods that only considered the ability to prove theorems based on observed data during training, miniCTX introduces the concept of proving theorems that depend on new definitions, lemmas, or other contextual information that was not present in the training data.
A significant aspect of miniCTX is its multi-disciplinary nature. Drawing from real Lean projects and textbooks, each theorem is associated with a context that spans thousands of tokens. This means that neural theorem provers must not only understand mathematical concepts but also possess a deep understanding of the contextual information surrounding a theorem. This multi-disciplinary aspect forces models to integrate mathematical reasoning with the ability to interpret and utilize contextual information.
To establish a baseline for miniCTX, the authors introduce the concept of file-tuning. This method trains a model to generate proof steps based on the preceding file contents. Notably, file-tuning outperforms the traditional neural theorem proving approach that solely focuses on fine-tuning on states. This suggests that incorporating contextual information significantly improves the model’s ability to generate proofs.
In addition to improved performance in miniCTX, the file-tuned model also achieves a pass rate of 33.61% on the standard miniF2F benchmark, setting a new state-of-the-art for 1.3B parameter models. This demonstrates the effectiveness of the file-tuning approach in enhancing the general theorem proving capabilities of neural models.
Furthermore, the authors introduce the ntp-toolkit, a practical tool for automatically extracting and annotating theorem proving data. This toolkit simplifies the process of adding new projects into miniCTX, ensuring that the context associated with theorems is unseen during training. This feature enables miniCTX to offer a challenging and realistic perspective on evaluating neural theorem provers, as it tests models’ ability to handle unfamiliar contexts and apply mathematical reasoning in a wider range of scenarios.
Through miniCTX, the evaluation of neural theorem provers moves beyond the limitations of observed data during training, providing a more comprehensive assessment of models’ capabilities. The combination of miniCTX’s multi-disciplinary nature, the success of file-tuning, and the creation of ntp-toolkit sets the stage for future advancements in the field of neural theorem proving.