arXiv:2510.03965v1 Announce Type: new
Abstract: Predicting corporate earnings surprises is a profitable yet challenging task, as accurate forecasts can inform significant investment decisions. However, progress in this domain has been constrained by a reliance on expensive, proprietary, and text-only data, limiting the development of advanced models. To address this gap, we introduce textbf{FinCall-Surprise} (Financial Conference Call for Earning Surprise Prediction), the first large-scale, open-source, and multi-modal dataset for earnings surprise prediction. Comprising 2,688 unique corporate conference calls from 2019 to 2021, our dataset features word-to-word conference call textual transcripts, full audio recordings, and corresponding presentation slides. We establish a comprehensive benchmark by evaluating 26 state-of-the-art unimodal and multi-modal LLMs. Our findings reveal that (1) while many models achieve high accuracy, this performance is often an illusion caused by significant class imbalance in the real-world data. (2) Some specialized financial models demonstrate unexpected weaknesses in instruction-following and language generation. (3) Although incorporating audio and visual modalities provides some performance gains, current models still struggle to leverage these signals effectively. These results highlight critical limitations in the financial reasoning capabilities of existing LLMs and establish a challenging new baseline for future research.

Expert Commentary: Exploring the Multi-Disciplinary Nature of Financial Earnings Surprise Prediction

In the realm of corporate finance, predicting earnings surprises is a critical task that can have significant implications for investment decisions. The introduction of the FinCall-Surprise dataset represents a groundbreaking development in this field, as it combines text, audio, and visual data from corporate conference calls to create a multi-modal dataset for earnings surprise prediction.

This approach highlights the multi-disciplinary nature of the concepts involved in financial forecasting. By incorporating a variety of modalities, including textual transcripts, audio recordings, and presentation slides, researchers are able to capture a more comprehensive view of the data and potentially uncover hidden patterns and insights that may not be apparent from a single source of information. This multi-modal approach aligns with the broader field of multimedia information systems, which explores the integration of various types of media to enhance understanding and decision-making.

Furthermore, the evaluation of 26 state-of-the-art unimodal and multi-modal language models (LLMs) reveals interesting insights into the performance of these models in the context of financial earnings surprise prediction. The findings indicate that while many models achieve high accuracy, there are significant challenges posed by class imbalances in real-world data. Additionally, some specialized financial models exhibit unexpected weaknesses in instruction-following and language generation, underscoring the need for further refinement and improvement in this area.

From a broader perspective, the results of this study have implications for the fields of Artificial Reality, Augmented Reality, and Virtual Realities as well. The incorporation of audio and visual modalities in financial forecasting represents a step towards creating more immersive and interactive experiences for analysts and investors. However, the challenges in leveraging these signals effectively highlight the complexities involved in bridging the gap between traditional financial analysis and emerging technologies.

Overall, the FinCall-Surprise dataset and the insights gained from evaluating various LLMs shed light on the critical limitations of existing models in the context of financial reasoning and set a challenging new baseline for future research in this field.

Read the original article