arXiv:2406.13264v1 Announce Type: new Abstract: Existing ML benchmarks lack the depth and diversity of annotations needed for evaluating models on business process management (BPM) tasks. BPM is the practice of documenting, measuring, improving, and automating enterprise workflows. However, research has focused almost exclusively on one task – full end-to-end automation using agents based on multimodal foundation models (FMs) like GPT-4. This focus on automation ignores the reality of how most BPM tools are applied today – simply documenting the relevant workflow takes 60% of the time of the typical process optimization project. To address this gap we present WONDERBREAD, the first benchmark for evaluating multimodal FMs on BPM tasks beyond automation. Our contributions are: (1) a dataset containing 2928 documented workflow demonstrations; (2) 6 novel BPM tasks sourced from real-world applications ranging from workflow documentation to knowledge transfer to process improvement; and (3) an automated evaluation harness. Our benchmark shows that while state-of-the-art FMs can automatically generate documentation (e.g. recalling 88% of the steps taken in a video demonstration of a workflow), they struggle to re-apply that knowledge towards finer-grained validation of workflow completion (F1 < 0.3). We hope WONDERBREAD encourages the development of more "human-centered" AI tooling for enterprise applications and furthers the exploration of multimodal FMs for the broader universe of BPM tasks. We publish our dataset and experiments here: https://github.com/HazyResearch/wonderbread The article "WONDERBREAD: A Benchmark for Evaluating Multimodal Foundation Models on Business Process Management Tasks" highlights the limitations of existing machine learning benchmarks in evaluating models on business process management (BPM) tasks. While most research focuses on full end-to-end automation using multimodal foundation models (FMs) like GPT-4, this approach overlooks the reality of how BPM tools are applied in practice. The authors introduce WONDERBREAD, the first benchmark for evaluating multimodal FMs on BPM tasks beyond automation. The benchmark includes a dataset of 2928 documented workflow demonstrations and 6 novel BPM tasks sourced from real-world applications. The authors also provide an automated evaluation harness. The results of the benchmark reveal that while state-of-the-art FMs can generate documentation with high accuracy, they struggle with finer-grained validation of workflow completion. The authors hope that WONDERBREAD will encourage the development of more "human-centered" AI tools for enterprise applications and further exploration of multimodal FMs for a wider range of BPM tasks. The dataset and experiments are available on GitHub.
Exploring the Potential of Multimodal Foundation Models in Business Process Management
In the field of machine learning, benchmarks play a crucial role in evaluating the performance of models on specific tasks. However, when it comes to business process management (BPM), existing benchmarks often lack the depth and diversity of annotations necessary to assess models accurately. BPM involves documenting, measuring, improving, and automating enterprise workflows. Unfortunately, most research in this area has focused primarily on full end-to-end automation using agents based on multimodal foundation models (FMs) like GPT-4.
While automation is undoubtedly a valuable goal, it overlooks the reality of how BPM tools are typically applied in practice. Today, the process of documenting workflows alone accounts for a significant portion (approximately 60%) of the time invested in a typical process optimization project. This aspect of BPM deserves attention and evaluation. To bridge this gap, a groundbreaking benchmark called WONDERBREAD has been presented.
Introducing WONDERBREAD: Beyond Automation
WONDERBREAD stands as the first benchmark designed explicitly to evaluate multimodal FMs on BPM tasks that extend beyond mere automation. The benchmark encompasses three vital contributions:
- A comprehensive dataset consisting of 2928 documented workflow demonstrations
- Six innovative BPM tasks, sourced from real-world applications, which cover a wide range of workflow-related activities such as documentation, knowledge transfer, and process improvement
- An automated evaluation harness to facilitate the analysis of model performance
Applying WONDERBREAD to state-of-the-art FMs reveals intriguing findings. While these models boast the capability to automatically generate workflow documentation with impressive accuracy (e.g., recalling 88% of the steps taken in a video demonstration), they struggle when it comes to applying that knowledge for finer-grained validation of workflow completion. The F1 score, a common metric for evaluation, falls below 0.3 in this aspect.
These results highlight the need for more “human-centered” AI tooling for enterprise applications. While automation is undoubtedly valuable, it is crucial to develop models that can adapt to and enhance the existing practices of BPM professionals. By achieving a deeper understanding of BPM tasks and actively involving humans in the workflow optimization process, AI-powered tools can yield even more significant benefits.
Encouraging Innovation and Exploration
With the introduction of WONDERBREAD, the hope is to encourage the development of AI tooling that is better aligned with the needs of enterprises. By shifting the focus beyond automation, this benchmark invites researchers and practitioners to explore the potential of multimodal FMs in a broader range of BPM tasks. By publishing the dataset and experiments associated with WONDERBREAD, the authors have provided the necessary resources for further exploration and innovation in this domain.
To access the WONDERBREAD dataset and experiments, visit the following GitHub repository: https://github.com/HazyResearch/wonderbread
“The only way to discover the limits of the possible is to go beyond them into the impossible.” – Arthur C. Clarke
The paper arXiv:2406.13264v1 introduces a new benchmark called WONDERBREAD that aims to address the lack of depth and diversity in existing machine learning (ML) benchmarks for evaluating models on business process management (BPM) tasks. BPM involves documenting, measuring, improving, and automating enterprise workflows. While previous research has focused on end-to-end automation using multimodal foundation models (FMs) like GPT-4, this benchmark recognizes that most BPM tools today primarily focus on workflow documentation rather than full automation.
WONDERBREAD brings several contributions to the field. Firstly, it provides a dataset consisting of 2928 documented workflow demonstrations. This dataset offers a comprehensive resource for training and evaluating models on BPM tasks. Secondly, the benchmark introduces six novel BPM tasks sourced from real-world applications, covering a range of activities such as workflow documentation, knowledge transfer, and process improvement. This diversification of tasks allows for a more comprehensive evaluation of models’ performance in various BPM scenarios. Lastly, the authors present an automated evaluation harness that facilitates the assessment of models on the benchmark tasks.
The authors’ experiments on WONDERBREAD reveal interesting insights about the performance of state-of-the-art FMs. While these models can automatically generate documentation by recalling a high percentage (88%) of steps taken in a video demonstration of a workflow, they struggle when it comes to finer-grained validation of workflow completion, with an F1 score below 0.3. This finding highlights a limitation in the current capabilities of FMs and suggests the need for further research and development of more “human-centered” AI tooling for enterprise applications.
The release of the WONDERBREAD benchmark and the accompanying dataset on GitHub (https://github.com/HazyResearch/wonderbread) opens up opportunities for researchers and practitioners to explore and advance the use of multimodal FMs in the broader universe of BPM tasks. By encouraging the development of AI tools that cater to human-centered BPM needs, this benchmark could potentially lead to more effective and efficient BPM solutions in real-world settings. Future work could involve building upon WONDERBREAD by expanding the dataset, adding more diverse tasks, and exploring novel techniques to improve the performance of FMs on BPM-related challenges.
Read the original article