by jsendak | Apr 26, 2025 | Computer Science
arXiv:2504.16936v1 Announce Type: new
Abstract: Multi-modal large language models (MLLMs) have recently achieved great success in processing and understanding information from diverse modalities (e.g., text, audio, and visual signals). Despite their growing popularity, there remains a lack of comprehensive evaluation measuring the audio-visual capabilities of these models, especially in diverse scenarios (e.g., distribution shifts and adversarial attacks). In this paper, we present a multifaceted evaluation of the audio-visual capability of MLLMs, focusing on four key dimensions: effectiveness, efficiency, generalizability, and robustness. Through extensive experiments, we find that MLLMs exhibit strong zero-shot and few-shot generalization abilities, enabling them to achieve great performance with limited data. However, their success relies heavily on the vision modality, which impairs performance when visual input is corrupted or missing. Additionally, while MLLMs are susceptible to adversarial samples, they demonstrate greater robustness compared to traditional models. The experimental results and our findings provide insights into the audio-visual capabilities of MLLMs, highlighting areas for improvement and offering guidance for future research.
Expert Commentary: Evaluating the Audio-Visual Capabilities of Multi-Modal Large Language Models
In recent years, multi-modal large language models (MLLMs) have gained significant attention and achieved remarkable success in processing and understanding information from various modalities such as text, audio, and visual signals. However, despite their widespread use, there has been a lack of comprehensive evaluation measuring the audio-visual capabilities of these models across diverse scenarios.
This paper fills this knowledge gap by presenting a multifaceted evaluation of MLLMs’ audio-visual capabilities, focusing on four key dimensions: effectiveness, efficiency, generalizability, and robustness. These dimensions encompass different aspects that are crucial for assessing the overall performance and potential limitations of MLLMs in processing audio-visual data.
Effectiveness refers to how well MLLMs can accurately process and understand audio-visual information. The experiments conducted in this study reveal that MLLMs demonstrate strong zero-shot and few-shot generalization abilities. This means that even with limited data or completely new examples, they can still achieve impressive performance. This finding highlights the potential of MLLMs in handling tasks that require quick adaptation to new scenarios or concepts, making them highly flexible and versatile.
Efficiency is another important aspect evaluated in the study. Although MLLMs excel in effectiveness, their computational efficiency needs attention. Given their large size and complexity, MLLMs tend to be computationally intensive, which can pose challenges in real-time applications or systems with limited computational resources. Further research and optimization techniques are required to enhance their efficiency without sacrificing performance.
Generalizability is a critical factor in assessing the practical usability of MLLMs. The results indicate that MLLMs heavily rely on the vision modality, and their performance suffers when visual input is corrupted or missing. This limitation implies that MLLMs may not be suitable for tasks where visual information is unreliable or incomplete, such as in scenarios with noisy or degraded visual signals. Addressing this issue is crucial to improve the robustness and generalizability of MLLMs across diverse real-world situations.
Lastly, the study explores the robustness of MLLMs against adversarial attacks. Adversarial attacks attempt to deceive or mislead the model by introducing subtly crafted perturbations to the input data. While MLLMs are not immune to these attacks, they exhibit greater robustness compared to traditional models. This finding suggests that MLLMs have inherent built-in defenses against adversarial attacks, which opens up possibilities for leveraging their robustness and security features.
From a broader perspective, this research is highly relevant to the field of multimedia information systems, animations, artificial reality, augmented reality, and virtual realities. The evaluation of MLLMs’ audio-visual capabilities contributes to our understanding of how these models can be effectively utilized in multimedia processing, including tasks like video captioning, content understanding, and interactive virtual environments. The findings also shed light on the interdisciplinary nature of MLLMs, as they demonstrate the fusion and interplay of language understanding, computer vision, and audio processing.
In conclusion, this paper provides a comprehensive evaluation of the audio-visual capabilities of multi-modal large language models. The findings offer valuable insights into the strengths and limitations of these models, paving the way for future improvements and guiding further research towards enhancing the effectiveness, efficiency, generalizability, and robustness of MLLMs in processing and understanding multi-modal information.
Read the original article
by jsendak | Apr 26, 2025 | Computer Science
Abstract:
The application of Predictive Process Monitoring (PPM) techniques is becoming increasingly widespread due to their capacity to provide organizations with accurate predictions regarding the future behavior of business processes, thereby facilitating more informed decision-making. A plethora of solutions have been proposed in the literature employing these techniques, yet they differ from one another due to a number of factors.
In light of the growing recognition of the value of object-centric event logs, including in the context of PPM, this survey focuses on the differences among PPM techniques employed with different event logs, namely traditional event logs and object-centric event logs. By understanding these differences, organizations can gain better insights into which techniques are most suitable for their specific needs.
This survey also examines the classification of PPM methods based on the prediction task they address and the specific methodologies they employ. This categorization allows organizations to identify the most appropriate PPM techniques based on their desired prediction outcomes and the resources and expertise available to them.
Traditional event logs typically capture data about process instances and activities, while object-centric event logs go a step further by also including information about objects involved in the process. This additional information enables PPM techniques to provide more accurate predictions, as they can take into account not only the sequence of activities but also the characteristics and behavior of objects throughout the process.
Various PPM techniques have been proposed and classified based on the prediction task they aim to achieve. These tasks include predicting the next event in a process, predicting the remaining time for a process instance to complete, and predicting the outcome or performance measures of a process instance.
The specific methodologies employed in PPM techniques include but are not limited to: sequence-based techniques that rely on patterns of past behavior, machine learning-based techniques that learn from historical data to make predictions, and rule-based techniques that use expert knowledge to define prediction rules.
In conclusion, this survey highlights the importance of considering the event logs used in PPM techniques and the specific prediction tasks and methodologies employed. By understanding these differences and selecting the most appropriate techniques, organizations can harness the power of PPM to make more informed and accurate predictions about their business processes, leading to improved decision-making and operational efficiency.
Read the original article
by jsendak | Apr 25, 2025 | Computer Science
arXiv:2504.16405v1 Announce Type: new
Abstract: The furnishing of multi-modal large language models (MLLMs) has led to the emergence of numerous benchmark studies, particularly those evaluating their perception and understanding capabilities.
Among these, understanding image-evoked emotions aims to enhance MLLMs’ empathy, with significant applications such as human-machine interaction and advertising recommendations. However, current evaluations of this MLLM capability remain coarse-grained, and a systematic and comprehensive assessment is still lacking.
To this end, we introduce EEmo-Bench, a novel benchmark dedicated to the analysis of the evoked emotions in images across diverse content categories.
Our core contributions include:
1) Regarding the diversity of the evoked emotions, we adopt an emotion ranking strategy and employ the Valence-Arousal-Dominance (VAD) as emotional attributes for emotional assessment. In line with this methodology, 1,960 images are collected and manually annotated.
2) We design four tasks to evaluate MLLMs’ ability to capture the evoked emotions by single images and their associated attributes: Perception, Ranking, Description, and Assessment. Additionally, image-pairwise analysis is introduced to investigate the model’s proficiency in performing joint and comparative analysis.
In total, we collect 6,773 question-answer pairs and perform a thorough assessment on 19 commonly-used MLLMs.
The results indicate that while some proprietary and large-scale open-source MLLMs achieve promising overall performance, the analytical capabilities in certain evaluation dimensions remain suboptimal.
Our EEmo-Bench paves the path for further research aimed at enhancing the comprehensive perceiving and understanding capabilities of MLLMs concerning image-evoked emotions, which is crucial for machine-centric emotion perception and understanding.
Enhancing Multi-Modal Large Language Models (MLLMs) with Image-Evoked Emotions
This article introduces the concept of image-evoked emotions and its relevance in enhancing the empathy of multi-modal large language models (MLLMs). MLLMs have gained significant attention in various domains, including human-machine interaction and advertising recommendations. However, the evaluation of MLLMs’ understanding of image-evoked emotions is currently limited and lacks a systematic and comprehensive assessment.
The Importance of Emotion in MLLMs
Emotion plays a crucial role in human communication and understanding, and the ability to perceive and understand emotions is highly desirable in MLLMs. By incorporating image-evoked emotions into MLLMs, these models can better empathize with users and provide more tailored responses and recommendations.
The EEmo-Bench Benchmark
To address the limitations in evaluating MLLMs’ understanding of image-evoked emotions, the authors introduce EEmo-Bench, a novel benchmark specifically designed for this purpose. EEmo-Bench focuses on the analysis of the evoked emotions in images across diverse content categories.
The benchmark includes the following core contributions:
- Diversity of evoked emotions: To assess emotional attributes, the authors adopt an emotion ranking strategy and utilize the Valence-Arousal-Dominance (VAD) model. A dataset of 1,960 images is collected and manually annotated for emotional assessment.
- Four evaluation tasks: Four tasks are designed to evaluate MLLMs’ ability to capture evoked emotions and their associated attributes: Perception, Ranking, Description, and Assessment. Additionally, image-pairwise analysis is introduced for joint and comparative analysis.
- Thorough assessment of MLLMs: A comprehensive evaluation is conducted on 19 commonly-used MLLMs, with a collection of 6,773 question-answer pairs. The results highlight the performance of different models in various evaluation dimensions.
Insights and Future Directions
The results of the EEmo-Bench benchmark reveal that while some proprietary and large-scale open-source MLLMs show promising overall performance, there are still areas in which these models’ analytical capabilities can be improved. This highlights the need for further research and innovation to enhance MLLMs’ comprehension and perception of image-evoked emotions.
The concepts discussed in this article are highly relevant to the wider field of multimedia information systems, as they bridge the gap between textual data and visual content analysis. Incorporating image-evoked emotions into MLLMs opens up new avenues for research in areas such as virtual reality, augmented reality, and artificial reality.
The multi-disciplinary nature of the concepts presented here underscores the importance of collaboration between researchers from fields such as computer vision, natural language processing, and psychology. By combining expertise from these diverse domains, we can develop more sophisticated MLLMs that truly understand and respond to the emotions evoked by visual stimuli.
In conclusion, the EEmo-Bench benchmark serves as a stepping stone for future research in enhancing the comprehension and perception capabilities of MLLMs in the context of image-evoked emotions. This research has significant implications for machine-centric emotion perception and understanding, with applications ranging from personalized user experiences to improved advertising recommendations.
Read the original article
by jsendak | Apr 25, 2025 | Computer Science
Expert Commentary:
The article highlights the challenges faced by small and medium-sized enterprises (SMEs) in the context of sustainability and compliance with global carbon regulations. SMEs often struggle to navigate the complex carbon trading process and face entry barriers into carbon markets.
The proposed solution, a blockchain-based decentralized carbon credit trading platform tailored specifically for SMEs in Taiwan, offers several advantages. By leveraging blockchain technology, the platform aims to reduce informational asymmetry and intermediary costs, two key challenges in carbon markets.
One interesting aspect of this proposal is the integration of Ethereum-based smart contracts. Smart contracts automate transactions, provide transparency, and reduce administrative burdens. This tackles the technical complexities and market risks associated with carbon trading, making it more accessible for SMEs.
To validate the effectiveness of the proposed system, a controlled experimental design was conducted, comparing it with a conventional centralized carbon trading platform. The statistical analysis confirmed that the blockchain-based platform minimized time and expenses while ensuring compliance with the Carbon Border Adjustment Mechanism (CBAM) and the Clean Competition Act (CCA).
The study also applied the Kano model to measure user satisfaction, identifying essential features and prioritizing future enhancements. This approach ensures that the platform meets the needs of SMEs and continues to evolve based on their requirements.
Overall, this research contributes a comprehensive solution for SMEs seeking to achieve carbon neutrality. By harnessing blockchain technology, the platform addresses key barriers and empowers SMEs to participate in global carbon markets. It highlights the transformative potential of blockchain in creating a more sustainable and transparent future.
Read the original article
by jsendak | Apr 24, 2025 | Computer Science
arXiv:2504.15376v1 Announce Type: cross
Abstract: We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like “follow” (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.
CameraBench: A Step Towards Understanding Camera Motion in Videos
In the world of multimedia information systems, understanding camera motion in videos is a crucial task. It has applications in various domains such as animations, artificial reality, augmented reality, and virtual realities. To improve camera motion understanding, a team of researchers has introduced CameraBench, a large-scale dataset and benchmark.
CameraBench comprises approximately 3,000 diverse internet videos that have been annotated by experts using a rigorous multi-stage quality control process. This dataset presents a significant contribution to the field, as it provides a valuable resource for assessing and improving camera motion understanding algorithms.
One key aspect of CameraBench is the collaboration with cinematographers, which has led to the development of a taxonomy of camera motion primitives. This taxonomy helps classify different types of camera motions and their dependencies on scene content. For example, a camera motion like “follow” requires understanding of moving subjects in the scene.
To evaluate human annotation performance, a large-scale human study was conducted. The results showed that domain expertise and tutorial-based training significantly enhance accuracy. Novices may initially struggle with differentiating between camera motions like zoom-in (a change of intrinsics) and translating forward (a change of extrinsics). However, through training, they can learn to differentiate between these motions.
The researchers also evaluated Structure-from-Motion (SfM) models and Video-Language Models (VLMs) using CameraBench. They found that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle with geometric primitives that require precise estimation of trajectories. To address these limitations, a generative VLM was fine-tuned with CameraBench to achieve a hybrid model that combines the strengths of both approaches.
This hybrid model opens up a range of applications, including motion-augmented captioning, video question answering, and video-text retrieval. By better understanding camera motions in videos, these applications can be enhanced, providing more immersive experiences for users.
The taxonomy, benchmark, and tutorials provided with CameraBench are valuable resources for researchers and practitioners working towards the ultimate goal of understanding camera motions in any video. The multi-disciplinary nature of camera motion understanding makes it relevant to various fields, including multimedia information systems, animations, artificial reality, augmented reality, and virtual realities.
Read the original article
by jsendak | Apr 24, 2025 | Computer Science
Expert Commentary: Improving Code Editing with EditLord
In software development, code editing is a foundational task that plays a crucial role in ensuring the effectiveness and functionality of the software. The article introduces EditLord, a code editing framework that aims to enhance the performance, robustness, and generalization of code editing procedures.
A key insight presented in EditLord is the use of a language model (LM) as an inductive learner to extract code editing rules from training code pairs. This approach allows for the formulation of concise meta-rule sets that can be utilized for various code editing tasks.
One notable advantage of explicitly defining the code transformation steps is that it addresses the limitations of existing approaches that treat code editing as an implicit end-to-end task. By breaking down the editing process into discrete and explicit steps, EditLord overcomes the challenges related to suboptimal performance and lack of robustness and generalization.
The use of LM models in EditLord offers several benefits. Firstly, it enables the augmentation of training samples through the manifestation of rule sets specific to each sample. This augmentation process can greatly enhance the finetuning process or assist in prompting- and iterative-based code editing. Secondly, by leveraging LM models, EditLord achieves improved editing performance and robustness compared to existing state-of-the-art methods.
Furthermore, EditLord demonstrates its effectiveness across critical software engineering and security applications, LM models, and editing modes. The framework achieves an average improvement of 22.7% in editing performance and 58.1% in robustness. It also ensures a 20.2% higher level of functional correctness, which is crucial in the development of reliable and secure software.
The advancements brought by EditLord have significant implications for the field of code editing and software development as a whole. By explicitly defining code transformation steps and utilizing LM models, developers can benefit from enhanced performance, robustness, generalization, and functional correctness. This can lead to more efficient and reliable software development processes, ultimately resulting in higher-quality software products.
Future Outlook
Looking ahead, the concepts and techniques introduced by EditLord open doors for further research and development in code editing. One possible direction is the exploration of different types of language models and their impact on code editing performance. Additionally, investigating the integration of other machine learning techniques and algorithms with EditLord could yield even more significant improvements.
Moreover, the application of EditLord to specific domains, such as machine learning or cybersecurity, may uncover domain-specific code editing rules and optimizations. This domain-specific approach could further enhance the performance and accuracy of code editing in specialized software development areas.
Overall, EditLord presents a promising framework for code editing, offering a more explicit and robust approach to code transformation. Its adoption has the potential to revolutionize the software development process, leading to higher efficiency, reliability, and security in software creation.
Read the original article