arXiv:2504.07981v1 Announce Type: cross
Abstract: Recent advancements in Multi-modal Large Language Models (MLLMs) have led to significant progress in developing GUI agents for general tasks such as web browsing and mobile phone use. However, their application in professional domains remains under-explored. These specialized workflows introduce unique challenges for GUI perception models, including high-resolution displays, smaller target sizes, and complex environments. In this paper, we introduce ScreenSpot-Pro, a new benchmark designed to rigorously evaluate the grounding capabilities of MLLMs in high-resolution professional settings. The benchmark comprises authentic high-resolution images from a variety of professional domains with expert annotations. It spans 23 applications across five industries and three operating systems. Existing GUI grounding models perform poorly on this dataset, with the best model achieving only 18.9%. Our experiments reveal that strategically reducing the search area enhances accuracy. Based on this insight, we propose ScreenSeekeR, a visual search method that utilizes the GUI knowledge of a strong planner to guide a cascaded search, achieving state-of-the-art performance with 48.1% without any additional training. We hope that our benchmark and findings will advance the development of GUI agents for professional applications. Code, data and leaderboard can be found at https://gui-agent.github.io/grounding-leaderboard.
Advancements in Multi-modal Large Language Models
Multi-modal Large Language Models (MLLMs) have made significant progress in developing GUI agents for general tasks such as web browsing and mobile phone use. However, their application in professional domains has not been explored extensively. This article introduces ScreenSpot-Pro, a benchmark that evaluates the grounding capabilities of MLLMs in high-resolution professional settings.
Challenges in Professional Domains
Professional workflows present unique challenges for GUI perception models. These challenges include high-resolution displays, smaller target sizes, and complex environments. To effectively navigate and understand professional applications, MLLMs need to be trained and evaluated on specialized datasets that reflect the complexities of these domains.
ScreenSpot-Pro Benchmark
The ScreenSpot-Pro benchmark is designed to rigorously evaluate the performance of MLLMs in high-resolution professional settings. It comprises authentic high-resolution images from a variety of professional domains, with expert annotations. The benchmark covers 23 applications across five industries and three operating systems.
Weak Performance of Existing GUI Grounding Models
Existing GUI grounding models perform poorly on the ScreenSpot-Pro dataset, with the best model achieving only 18.9% accuracy. This highlights the need for specialized approaches that can effectively handle the challenges present in professional domains.
ScreenSeekeR: A Novel Visual Search Method
In response to the poor performance of existing models, ScreenSeekeR is proposed as a visual search method to enhance accuracy. ScreenSeekeR utilizes the GUI knowledge of a strong planner to guide a cascaded search, resulting in state-of-the-art performance of 48.1% without any additional training.
The Multi-disciplinary Nature of the Concepts
The concepts discussed in this paper draw upon multiple disciplines. First, there is the field of multimedia information systems, which focuses on the management and retrieval of multimedia data. The high-resolution images used in the benchmark require efficient storage and retrieval techniques to process them effectively.
Animations, artificial reality, augmented reality, and virtual realities are also relevant to this discussion. These technologies enhance the user interface and user experience in professional applications, and MLLMs need to understand and interact with these graphical elements effectively.
Advancing GUI Agents for Professional Applications
The benchmark and findings presented in this paper aim to advance the development of GUI agents for professional applications. By addressing the challenges specific to professional domains and proposing novel approaches like ScreenSeekeR, researchers can improve the accuracy and performance of MLLMs in these specialized workflows.
Overall, this paper showcases the importance of considering the multi-disciplinary nature of concepts related to multimedia information systems, animations, artificial reality, augmented reality, and virtual realities when developing and evaluating GUI agents for professional applications.
Code, data, and a leaderboard for the ScreenSpot-Pro benchmark can be accessed at https://gui-agent.github.io/grounding-leaderboard.