arXiv:2405.19444v1 Announce Type: new
Abstract: Large language models (LLMs) have demonstrated impressive capabilities in mathematical problem solving, particularly in single turn question answering formats. However, real world scenarios often involve mathematical question answering that requires multi turn or interactive information exchanges, and the performance of LLMs on these tasks is still underexplored. This paper introduces MathChat, a comprehensive benchmark specifically designed to evaluate LLMs across a broader spectrum of mathematical tasks. These tasks are structured to assess the models’ abilities in multiturn interactions and open ended generation. We evaluate the performance of various SOTA LLMs on the MathChat benchmark, and we observe that while these models excel in single turn question answering, they significantly underperform in more complex scenarios that require sustained reasoning and dialogue understanding. To address the above limitations of existing LLMs when faced with multiturn and open ended tasks, we develop MathChat sync, a synthetic dialogue based math dataset for LLM finetuning, focusing on improving models’ interaction and instruction following capabilities in conversations. Experimental results emphasize the need for training LLMs with diverse, conversational instruction tuning datasets like MathChatsync. We believe this work outlines one promising direction for improving the multiturn mathematical reasoning abilities of LLMs, thus pushing forward the development of LLMs that are more adept at interactive mathematical problem solving and real world applications.
Improving the Multiturn Mathematical Reasoning Abilities of Large Language Models
Language models have made significant advancements in the field of mathematical problem solving, particularly in single turn question answering formats. However, real world scenarios often involve math problems that require multiple turns of interaction and open-ended generation, and the performance of large language models (LLMs) on these tasks is not well-explored. This paper introduces MathChat, a comprehensive benchmark specifically designed to evaluate LLMs across a broader spectrum of mathematical tasks.
MathChat aims to assess the abilities of LLMs in multiturn interactions and open-ended generation. The benchmark consists of structured tasks that simulate real-world conversations involving mathematical problem solving. By evaluating the performance of various state-of-the-art LLMs on MathChat, the researchers found that while these models excel in single turn question answering, they significantly underperform in more complex scenarios that require sustained reasoning and dialogue understanding.
To address the limitations of existing LLMs when faced with multiturn and open-ended tasks, the researchers developed MathChat sync. MathChat sync is a synthetic dialogue-based math dataset for finetuning LLMs, with a focus on improving the models’ interaction and instruction-following capabilities in conversations.
The experimental results highlight the importance of training LLMs with diverse, conversational instruction tuning datasets like MathChat sync. This implies that LLMs need exposure to a wide range of mathematical problem-solving scenarios that involve sustained reasoning and dialogue understanding. By incorporating such datasets, LLMs can better adapt to interactive mathematical problem solving and real-world applications.
This work highlights the multi-disciplinary nature of the concepts involved. It brings together elements from natural language processing, mathematical problem solving, and dialogue understanding. By combining these domains, the researchers aim to enhance the performance of LLMs in mathematical reasoning across interactive scenarios.
Future Directions
As LLMs continue to evolve, further research in this area could explore the development of more sophisticated benchmarks and datasets that capture the complexity of real-world mathematical problem-solving scenarios. Additionally, investigating techniques to improve sustained reasoning and dialogue understanding in LLMs could result in significant advancements in their multiturn mathematical reasoning abilities.
Moreover, investigations into incorporating external knowledge sources into LLMs could enable them to leverage a wider range of information during mathematical problem solving. This integration of external knowledge could enhance their reasoning abilities and enable them to tackle more complex tasks.
In summary, the MathChat benchmark and MathChat sync dataset serve as stepping stones towards improving the multiturn mathematical reasoning abilities of LLMs. By addressing the limitations of existing models and incorporating diverse training data, researchers are paving the way for more capable and interactive LLMs in the field of mathematical problem solving.