by jsendak | Apr 29, 2025 | Computer Science
arXiv:2504.17938v1 Announce Type: new
Abstract: The Quality of Experience (QoE) is the users satisfaction while streaming a video session over an over-the-top (OTT) platform like YouTube. QoE of YouTube reflects the smooth streaming session without any buffering and quality shift events. One of the most important factors nowadays affecting QoE of YouTube is frequent shifts from higher to lower resolutions and vice versa. These shifts ensure a smooth streaming session; however, it might get a lower mean opinion score. For instance, dropping from 1080p to 480p during a video can preserve continuity but might reduce the viewers enjoyment. Over time, OTT platforms are looking for alternative ways to boost user experience instead of relying on traditional Quality of Service (QoS) metrics such as bandwidth, latency, and throughput. As a result, we look into the relationship between quality shifting in YouTube streaming sessions and the channel metrics RSRP, RSRQ, and SNR. Our findings state that these channel metrics positively correlate with shifts. Thus, in real-time, OTT can only rely on them to predict video streaming sessions into lower- and higher-resolution categories, thus providing more resources to improve user experience. Using traditional Machine Learning (ML) classifiers, we achieved an accuracy of 77-percent, while using only RSRP, RSRQ, and SNR. In the era of 5G and beyond, where ultra-reliable, low-latency networks promise enhanced streaming capabilities, the proposed methodology can be used to improve OTT services.
The Impact of Quality Shifting on YouTube Streaming Sessions
In the increasingly digital world we live in, the demand for high-quality streaming services has skyrocketed. As users turn to platforms like YouTube to consume video content, their satisfaction, known as Quality of Experience (QoE), becomes a key factor in their overall viewing experience. In this context, it is essential to understand how the quality shifting phenomenon affects QoE, and how it can be optimized to enhance user satisfaction.
Traditionally, QoS metrics such as bandwidth, latency, and throughput have been used to assess streaming performance. However, as the article points out, these metrics alone are no longer sufficient to measure QoE accurately. This is where the concept of quality shifting comes into play. By dynamically adjusting video quality during a streaming session, platforms like YouTube can ensure a smooth viewing experience without buffering interruptions. However, this practice can also impact viewer enjoyment. For example, sudden shifts from higher to lower resolutions can lead to a decrease in satisfaction.
The study discussed in the article delves into the relationship between quality shifting in YouTube streaming sessions and specific channel metrics: RSRP, RSRQ, and SNR. These metrics, which are related to signal strength and quality, were found to positively correlate with shifts. In other words, they can serve as indicators to predict when a video streaming session might transition between lower and higher resolutions. By leveraging this information in real-time, over-the-top (OTT) platforms can allocate appropriate resources to improve user experience.
The researcher’s utilization of traditional Machine Learning (ML) classifiers and the achievement of a 77% accuracy rate using only RSRP, RSRQ, and SNR is a significant finding. This demonstrates the potential of using predictive algorithms to enhance QoE by proactively managing quality shifts in streaming sessions.
In the wider field of multimedia information systems, this research has important implications. As the demand for high-quality video content continues to rise and technologies such as 5G promise enhanced streaming capabilities, finding innovative ways to optimize QoE becomes imperative. By combining insights from multiple disciplines, including computer science, telecommunications, and human-computer interaction, this study contributes to improving the overall streaming experience for users.
Beyond YouTube, the concepts discussed in this article also have implications for other forms of multimedia, such as animations, artificial reality, augmented reality, and virtual realities. These immersive multimedia experiences heavily rely on streaming technologies, and ensuring a smooth and uninterrupted experience is crucial for user engagement. By further exploring the relationship between quality shifting and user satisfaction, researchers can develop innovative solutions to enrich multimedia experiences across various platforms and applications.
Conclusion
The study presented in this article highlights the impact of quality shifting on YouTube streaming sessions and its relationship with channel metrics such as RSRP, RSRQ, and SNR. By leveraging these metrics and utilizing machine learning techniques, OTT platforms can predict quality shifts in real-time and allocate appropriate resources to enhance user experience. The multi-disciplinary nature of this research, spanning areas like multimedia information systems, animations, artificial reality, augmented reality, and virtual realities, makes it a valuable contribution to the field. As technologies evolve and demand for high-quality streaming services grows, innovative approaches like those presented in this study will play a crucial role in delivering an optimal multimedia experience.
Read the original article
by jsendak | Apr 29, 2025 | Computer Science
Safety-Critical Data and Autonomous Vehicles: Barriers to Sharing
Autonomous vehicles (AVs) have the potential to transform transportation by greatly improving road safety. However, to ensure their safety and efficacy, it is crucial to have access to safety-critical data, such as crash and near-crash records. Sharing this data among AV companies, academic researchers, regulators, and the public can contribute to the overall improvement of AV design and development.
Despite the benefits of sharing safety-critical data, AV companies have been reluctant to do so. A recent study conducted interviews with twelve employees from AV companies to explore the reasons behind this reluctance and identify potential barriers to data sharing.
Barriers to Data Sharing
The study revealed two key barriers that were previously unknown. The first barrier is the inherent nature of the datasets themselves. Safety-critical data contains knowledge that is essential for improving AV safety, and the process of collecting, analyzing, and sharing this data is resource-intensive. Even within a single company, sharing such data can be complicated due to the politics involved. Different teams within a company may have competing interests and priorities, leading to reluctance in sharing data internally.
The second barrier identified by the study is the perception of AV safety knowledge as private rather than public. Interviewees believed that the knowledge gained from safety-critical data gives their companies a competitive edge. They view it as proprietary information that should be guarded to maintain their advantage in the market. This perception hinders the sharing of safety-critical data for the greater social good.
Implications and Way Forward
The findings of this study have important implications for promoting safety-critical AV data sharing. To overcome the barriers identified, several strategies can be considered.
- Debating and Stratifying Public and Private Knowledge: It is essential to initiate discussions and debates within the AV industry and regulatory bodies regarding the classification of safety knowledge as public or private. By defining clear boundaries, companies can feel more secure in sharing data without compromising their competitive advantages.
- Innovating Data Tools and Sharing Pipelines: Developing new tools and technologies that streamline the process of sharing safety-critical data can alleviate resource constraints and minimize the politics associated with data sharing. Companies could collaborate to create standardized data formats and sharing pipelines to facilitate easier and more efficient exchange of information.
- Offsetting Costs and Incentivizing Sharing: Given the resource-intensive nature of collecting safety-critical data, it is crucial to find ways to offset the costs associated with data curation. Incentives, such as tax breaks or grants, could be provided to companies that actively participate in data sharing initiatives. This would encourage greater participation and promote a culture of collaboration in the AV industry.
In conclusion, the barriers to sharing safety-critical data in the autonomous vehicle industry are rooted in the complexities of data collection, internal politics, and the perception of knowledge as a competitive advantage. Addressing these barriers requires industry-wide discussions, technological innovations, and the provision of incentives to encourage data sharing. By overcoming these obstacles, the AV industry can collectively work towards improving AV safety and realizing the full potential of autonomous vehicles.
Read the original article
by jsendak | Apr 26, 2025 | Computer Science
arXiv:2504.16936v1 Announce Type: new
Abstract: Multi-modal large language models (MLLMs) have recently achieved great success in processing and understanding information from diverse modalities (e.g., text, audio, and visual signals). Despite their growing popularity, there remains a lack of comprehensive evaluation measuring the audio-visual capabilities of these models, especially in diverse scenarios (e.g., distribution shifts and adversarial attacks). In this paper, we present a multifaceted evaluation of the audio-visual capability of MLLMs, focusing on four key dimensions: effectiveness, efficiency, generalizability, and robustness. Through extensive experiments, we find that MLLMs exhibit strong zero-shot and few-shot generalization abilities, enabling them to achieve great performance with limited data. However, their success relies heavily on the vision modality, which impairs performance when visual input is corrupted or missing. Additionally, while MLLMs are susceptible to adversarial samples, they demonstrate greater robustness compared to traditional models. The experimental results and our findings provide insights into the audio-visual capabilities of MLLMs, highlighting areas for improvement and offering guidance for future research.
Expert Commentary: Evaluating the Audio-Visual Capabilities of Multi-Modal Large Language Models
In recent years, multi-modal large language models (MLLMs) have gained significant attention and achieved remarkable success in processing and understanding information from various modalities such as text, audio, and visual signals. However, despite their widespread use, there has been a lack of comprehensive evaluation measuring the audio-visual capabilities of these models across diverse scenarios.
This paper fills this knowledge gap by presenting a multifaceted evaluation of MLLMs’ audio-visual capabilities, focusing on four key dimensions: effectiveness, efficiency, generalizability, and robustness. These dimensions encompass different aspects that are crucial for assessing the overall performance and potential limitations of MLLMs in processing audio-visual data.
Effectiveness refers to how well MLLMs can accurately process and understand audio-visual information. The experiments conducted in this study reveal that MLLMs demonstrate strong zero-shot and few-shot generalization abilities. This means that even with limited data or completely new examples, they can still achieve impressive performance. This finding highlights the potential of MLLMs in handling tasks that require quick adaptation to new scenarios or concepts, making them highly flexible and versatile.
Efficiency is another important aspect evaluated in the study. Although MLLMs excel in effectiveness, their computational efficiency needs attention. Given their large size and complexity, MLLMs tend to be computationally intensive, which can pose challenges in real-time applications or systems with limited computational resources. Further research and optimization techniques are required to enhance their efficiency without sacrificing performance.
Generalizability is a critical factor in assessing the practical usability of MLLMs. The results indicate that MLLMs heavily rely on the vision modality, and their performance suffers when visual input is corrupted or missing. This limitation implies that MLLMs may not be suitable for tasks where visual information is unreliable or incomplete, such as in scenarios with noisy or degraded visual signals. Addressing this issue is crucial to improve the robustness and generalizability of MLLMs across diverse real-world situations.
Lastly, the study explores the robustness of MLLMs against adversarial attacks. Adversarial attacks attempt to deceive or mislead the model by introducing subtly crafted perturbations to the input data. While MLLMs are not immune to these attacks, they exhibit greater robustness compared to traditional models. This finding suggests that MLLMs have inherent built-in defenses against adversarial attacks, which opens up possibilities for leveraging their robustness and security features.
From a broader perspective, this research is highly relevant to the field of multimedia information systems, animations, artificial reality, augmented reality, and virtual realities. The evaluation of MLLMs’ audio-visual capabilities contributes to our understanding of how these models can be effectively utilized in multimedia processing, including tasks like video captioning, content understanding, and interactive virtual environments. The findings also shed light on the interdisciplinary nature of MLLMs, as they demonstrate the fusion and interplay of language understanding, computer vision, and audio processing.
In conclusion, this paper provides a comprehensive evaluation of the audio-visual capabilities of multi-modal large language models. The findings offer valuable insights into the strengths and limitations of these models, paving the way for future improvements and guiding further research towards enhancing the effectiveness, efficiency, generalizability, and robustness of MLLMs in processing and understanding multi-modal information.
Read the original article
by jsendak | Apr 26, 2025 | Computer Science
Abstract:
The application of Predictive Process Monitoring (PPM) techniques is becoming increasingly widespread due to their capacity to provide organizations with accurate predictions regarding the future behavior of business processes, thereby facilitating more informed decision-making. A plethora of solutions have been proposed in the literature employing these techniques, yet they differ from one another due to a number of factors.
In light of the growing recognition of the value of object-centric event logs, including in the context of PPM, this survey focuses on the differences among PPM techniques employed with different event logs, namely traditional event logs and object-centric event logs. By understanding these differences, organizations can gain better insights into which techniques are most suitable for their specific needs.
This survey also examines the classification of PPM methods based on the prediction task they address and the specific methodologies they employ. This categorization allows organizations to identify the most appropriate PPM techniques based on their desired prediction outcomes and the resources and expertise available to them.
Traditional event logs typically capture data about process instances and activities, while object-centric event logs go a step further by also including information about objects involved in the process. This additional information enables PPM techniques to provide more accurate predictions, as they can take into account not only the sequence of activities but also the characteristics and behavior of objects throughout the process.
Various PPM techniques have been proposed and classified based on the prediction task they aim to achieve. These tasks include predicting the next event in a process, predicting the remaining time for a process instance to complete, and predicting the outcome or performance measures of a process instance.
The specific methodologies employed in PPM techniques include but are not limited to: sequence-based techniques that rely on patterns of past behavior, machine learning-based techniques that learn from historical data to make predictions, and rule-based techniques that use expert knowledge to define prediction rules.
In conclusion, this survey highlights the importance of considering the event logs used in PPM techniques and the specific prediction tasks and methodologies employed. By understanding these differences and selecting the most appropriate techniques, organizations can harness the power of PPM to make more informed and accurate predictions about their business processes, leading to improved decision-making and operational efficiency.
Read the original article
by jsendak | Apr 25, 2025 | Computer Science
arXiv:2504.16405v1 Announce Type: new
Abstract: The furnishing of multi-modal large language models (MLLMs) has led to the emergence of numerous benchmark studies, particularly those evaluating their perception and understanding capabilities.
Among these, understanding image-evoked emotions aims to enhance MLLMs’ empathy, with significant applications such as human-machine interaction and advertising recommendations. However, current evaluations of this MLLM capability remain coarse-grained, and a systematic and comprehensive assessment is still lacking.
To this end, we introduce EEmo-Bench, a novel benchmark dedicated to the analysis of the evoked emotions in images across diverse content categories.
Our core contributions include:
1) Regarding the diversity of the evoked emotions, we adopt an emotion ranking strategy and employ the Valence-Arousal-Dominance (VAD) as emotional attributes for emotional assessment. In line with this methodology, 1,960 images are collected and manually annotated.
2) We design four tasks to evaluate MLLMs’ ability to capture the evoked emotions by single images and their associated attributes: Perception, Ranking, Description, and Assessment. Additionally, image-pairwise analysis is introduced to investigate the model’s proficiency in performing joint and comparative analysis.
In total, we collect 6,773 question-answer pairs and perform a thorough assessment on 19 commonly-used MLLMs.
The results indicate that while some proprietary and large-scale open-source MLLMs achieve promising overall performance, the analytical capabilities in certain evaluation dimensions remain suboptimal.
Our EEmo-Bench paves the path for further research aimed at enhancing the comprehensive perceiving and understanding capabilities of MLLMs concerning image-evoked emotions, which is crucial for machine-centric emotion perception and understanding.
Enhancing Multi-Modal Large Language Models (MLLMs) with Image-Evoked Emotions
This article introduces the concept of image-evoked emotions and its relevance in enhancing the empathy of multi-modal large language models (MLLMs). MLLMs have gained significant attention in various domains, including human-machine interaction and advertising recommendations. However, the evaluation of MLLMs’ understanding of image-evoked emotions is currently limited and lacks a systematic and comprehensive assessment.
The Importance of Emotion in MLLMs
Emotion plays a crucial role in human communication and understanding, and the ability to perceive and understand emotions is highly desirable in MLLMs. By incorporating image-evoked emotions into MLLMs, these models can better empathize with users and provide more tailored responses and recommendations.
The EEmo-Bench Benchmark
To address the limitations in evaluating MLLMs’ understanding of image-evoked emotions, the authors introduce EEmo-Bench, a novel benchmark specifically designed for this purpose. EEmo-Bench focuses on the analysis of the evoked emotions in images across diverse content categories.
The benchmark includes the following core contributions:
- Diversity of evoked emotions: To assess emotional attributes, the authors adopt an emotion ranking strategy and utilize the Valence-Arousal-Dominance (VAD) model. A dataset of 1,960 images is collected and manually annotated for emotional assessment.
- Four evaluation tasks: Four tasks are designed to evaluate MLLMs’ ability to capture evoked emotions and their associated attributes: Perception, Ranking, Description, and Assessment. Additionally, image-pairwise analysis is introduced for joint and comparative analysis.
- Thorough assessment of MLLMs: A comprehensive evaluation is conducted on 19 commonly-used MLLMs, with a collection of 6,773 question-answer pairs. The results highlight the performance of different models in various evaluation dimensions.
Insights and Future Directions
The results of the EEmo-Bench benchmark reveal that while some proprietary and large-scale open-source MLLMs show promising overall performance, there are still areas in which these models’ analytical capabilities can be improved. This highlights the need for further research and innovation to enhance MLLMs’ comprehension and perception of image-evoked emotions.
The concepts discussed in this article are highly relevant to the wider field of multimedia information systems, as they bridge the gap between textual data and visual content analysis. Incorporating image-evoked emotions into MLLMs opens up new avenues for research in areas such as virtual reality, augmented reality, and artificial reality.
The multi-disciplinary nature of the concepts presented here underscores the importance of collaboration between researchers from fields such as computer vision, natural language processing, and psychology. By combining expertise from these diverse domains, we can develop more sophisticated MLLMs that truly understand and respond to the emotions evoked by visual stimuli.
In conclusion, the EEmo-Bench benchmark serves as a stepping stone for future research in enhancing the comprehension and perception capabilities of MLLMs in the context of image-evoked emotions. This research has significant implications for machine-centric emotion perception and understanding, with applications ranging from personalized user experiences to improved advertising recommendations.
Read the original article
by jsendak | Apr 25, 2025 | Computer Science
Expert Commentary:
The article highlights the challenges faced by small and medium-sized enterprises (SMEs) in the context of sustainability and compliance with global carbon regulations. SMEs often struggle to navigate the complex carbon trading process and face entry barriers into carbon markets.
The proposed solution, a blockchain-based decentralized carbon credit trading platform tailored specifically for SMEs in Taiwan, offers several advantages. By leveraging blockchain technology, the platform aims to reduce informational asymmetry and intermediary costs, two key challenges in carbon markets.
One interesting aspect of this proposal is the integration of Ethereum-based smart contracts. Smart contracts automate transactions, provide transparency, and reduce administrative burdens. This tackles the technical complexities and market risks associated with carbon trading, making it more accessible for SMEs.
To validate the effectiveness of the proposed system, a controlled experimental design was conducted, comparing it with a conventional centralized carbon trading platform. The statistical analysis confirmed that the blockchain-based platform minimized time and expenses while ensuring compliance with the Carbon Border Adjustment Mechanism (CBAM) and the Clean Competition Act (CCA).
The study also applied the Kano model to measure user satisfaction, identifying essential features and prioritizing future enhancements. This approach ensures that the platform meets the needs of SMEs and continues to evolve based on their requirements.
Overall, this research contributes a comprehensive solution for SMEs seeking to achieve carbon neutrality. By harnessing blockchain technology, the platform addresses key barriers and empowers SMEs to participate in global carbon markets. It highlights the transformative potential of blockchain in creating a more sustainable and transparent future.
Read the original article