by jsendak | May 1, 2025 | Computer Science
arXiv:2504.20370v1 Announce Type: new
Abstract: Bayer-patterned color filter array (CFA) has been the go-to solution for color image sensors. In augmented reality (AR), although color interpolation (i.e., demosaicing) of pre-demosaic RAW images facilitates a user-friendly rendering, it creates no benefits in offloaded DNN analytics but increases the image channels by 3 times inducing higher transmission overheads. The potential optimization in frame preprocessing of DNN offloading is yet to be investigated. To that end, we propose ABO, an adaptive RAW frame offloading framework that parallelizes demosaicing with DNN computation. Its contributions are three-fold: First, we design a configurable tile-wise RAW image neural codec to compress frame sizes while sustaining downstream DNN accuracy under bandwidth constraints. Second, based on content-aware tiles-in-frame selection and runtime bandwidth estimation, a dynamic transmission controller adaptively calibrates codec configurations to maximize the DNN accuracy. Third, we further optimize the system pipelining to achieve lower end-to-end frame processing latency and higher throughput. Through extensive evaluations on a prototype platform, ABO consistently achieves 40% more frame processing throughput and 30% less end-to-end latency while improving the DNN accuracy by up to 15% than SOTA baselines. It also exhibits improved robustness against dim lighting and motion blur situations.
Analysis: Adaptation and Optimization in RAW Frame Offloading for Augmented Reality
The article introduces a novel approach called ABO (Adaptive RAW frame offloading) for optimizing the preprocessing of RAW images in the context of augmented reality (AR). The authors highlight the limitations of the traditional color interpolation (demosaicing) technique in AR, which increases image channels and transmission overheads without providing any benefits in offloaded deep neural network (DNN) analytics. This motivates the need for a new framework that optimizes the preprocessing of RAW frames to enhance DNN accuracy, frame processing throughput, and end-to-end latency in AR applications.
The multidisciplinary nature of this research becomes evident as it combines concepts from various fields such as computer vision, image processing, multimedia information systems, and augmented reality. By addressing the specific challenges posed by color interpolation in AR, the proposed framework brings together techniques from image compression, neural codec design, bandwidth estimation, and system optimization. This interdisciplinary approach allows for a holistic solution that improves the performance of AR systems.
Relevance to Multimedia Information Systems
Within the field of multimedia information systems, this research contributes to the area of image processing and optimization techniques for efficient data transmission and preprocessing. By considering the unique requirements of AR applications, the authors propose a configurable tile-wise RAW image neural codec that compresses frame sizes while maintaining DNN accuracy. This not only reduces transmission overheads but also allows for efficient storage and processing of RAW frames in multimedia systems.
Additionally, the incorporation of content-aware tiles-in-frame selection and runtime bandwidth estimation in the dynamic transmission controller demonstrates the integration of intelligent decision-making mechanisms in multimedia information systems. These techniques leverage contextual information to dynamically adjust codec configurations and maximize DNN accuracy. The optimization of system pipelining further enhances frame processing latency and throughput, which are crucial factors for real-time multimedia systems.
Connection to Animation, Artificial Reality, Augmented Reality, and Virtual Realities
While the focus of this article is specifically on augmented reality, it is worth noting the connections between this research and other areas such as animation, artificial reality, and virtual realities. These domains often rely on similar underlying technologies and face similar challenges related to image processing, system optimization, and rendering.
For instance, the optimization of image preprocessing in Augmented Reality can also apply to Virtual Reality systems, where the efficient handling of high-resolution image data is essential for creating immersive experiences. Similarly, the concept of adaptive offloading and intelligent decision-making algorithms can be extended to animation and artificial reality systems, where real-time rendering and content adaptation play a crucial role.
In conclusion, this article presents a comprehensive framework, ABO, that addresses the limitations of color interpolation in AR and optimizes RAW frame preprocessing for enhanced DNN accuracy, frame processing throughput, and end-to-end latency. With its multidisciplinary approach and relevance to multimedia information systems, animations, artificial reality, augmented reality, and virtual realities, this research contributes to the advancement of various fields and lays the foundation for more efficient and immersive multimedia experiences in the future.
Read the original article
by jsendak | May 1, 2025 | Computer Science
Tabular data embedded within PDF files, web pages, and other document formats are widely used in various sectors, such as government, engineering, science, and business. These tabular datasets, known as human-centric tables (HCTs), have unique characteristics that make them valuable for deriving critical insights. However, their complex layouts and limited operational power at scale pose significant challenges for traditional data extraction, processing, and querying methods.
Current solutions in the field primarily aim to transform these tables into relational formats for SQL queries. While this approach has been helpful to some extent, it falls short when dealing with the diverse and complex layouts of HCTs. Consequently, querying such tables becomes a challenging task.
To address this challenge, the authors of this paper introduce HCT-QA, an extensive benchmark specifically designed to evaluate HCTs, natural language queries, and their corresponding answers. The benchmark dataset consists of 2,188 real-world HCTs along with 9,835 question-answer (QA) pairs. Additionally, the dataset includes 4,679 synthetic tables with 67.5K QA pairs.
While HCTs can potentially be processed by different types of query engines, this paper primarily focuses on assessing the capabilities of Large Language Models (LLMs) as potential engines for processing and querying such tables. LLMs, such as GPT-3, have shown remarkable advancements in natural language processing tasks and have the potential to handle the challenges presented by HCTs.
The HCT-QA benchmark provides an opportunity to evaluate the performance of LLMs in processing and querying complex HCTs. By assessing their ability to answer a wide range of questions posed in natural language, researchers can gain insights into the strengths and limitations of LLMs in this context. This analysis can inform the development of novel techniques and approaches that harness the power of LLMs to effectively process and query HCTs.
In conclusion, the HCT-QA benchmark and the focus on Large Language Models present an exciting avenue for advancing the field of tabular data processing and querying. By addressing the challenges posed by complex HCT layouts, researchers can unlock new possibilities for deriving insights from tabular data in various domains.
Read the original article
by jsendak | Apr 30, 2025 | Computer Science
arXiv:2504.18799v1 Announce Type: new
Abstract: Multimodal music emotion recognition (MMER) is an emerging discipline in music information retrieval that has experienced a surge in interest in recent years. This survey provides a comprehensive overview of the current state-of-the-art in MMER. Discussing the different approaches and techniques used in this field, the paper introduces a four-stage MMER framework, including multimodal data selection, feature extraction, feature processing, and final emotion prediction. The survey further reveals significant advancements in deep learning methods and the increasing importance of feature fusion techniques. Despite these advancements, challenges such as the need for large annotated datasets, datasets with more modalities, and real-time processing capabilities remain. This paper also contributes to the field by identifying critical gaps in current research and suggesting potential directions for future research. The gaps underscore the importance of developing robust, scalable, a interpretable models for MMER, with implications for applications in music recommendation systems, therapeutic tools, and entertainment.
Expert Commentary: Multimodal Music Emotion Recognition in the Context of Multimedia Information Systems and Virtual Realities
Music holds great emotional power, and understanding and predicting the emotions it evokes is a fascinating and important area of research. The emerging discipline of Multimodal Music Emotion Recognition (MMER) aims to leverage multiple modalities such as audio, lyrics, gestures, and physiological signals to recognize and predict the emotional content of music. This survey paper provides a comprehensive overview of the current state-of-the-art in MMER, shedding light on the various approaches and techniques used in this field.
The field of MMER intersects with several other domains, making it a truly multi-disciplinary subject. Multimedia Information Systems, for instance, play a significant role in MMER by providing the infrastructure and tools to handle and analyze large volumes of multimodal music data. The techniques discussed in this survey, such as feature extraction and processing, are fundamental to extracting relevant information from music and its associated modalities. These techniques are shared with other fields, such as Speech and Image Processing, highlighting the cross-pollination of knowledge and methodologies.
Furthermore, Animations, Artificial Reality, Augmented Reality, and Virtual Realities are all related to MMER. These technologies offer new ways to experience and interact with music, providing additional modalities for MMER. For example, in Virtual Reality environments, users can be fully immersed in a musical experience and their physiological signals and gestures can be captured, enhancing the multimodal data available for emotion recognition. By incorporating these technologies, MMER can have practical applications in areas such as interactive entertainment, virtual music therapy, and even music recommendation systems that can generate personalized playlists based on the user’s emotional state.
The survey paper highlights the advancements in deep learning methods in MMER. Deep learning algorithms, such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), have shown remarkable performance in various domains, and their application in MMER has yielded promising results. Deep learning allows for the automatic extraction of relevant features from music and other modalities, reducing the need for manual feature engineering. However, it is important to mention that large annotated datasets are still required to train these models effectively, and creating such datasets can be a laborious and resource-intensive task.
The paper also emphasizes the increasing importance of feature fusion techniques in MMER. As the field progresses, researchers are moving towards combining information from multiple modalities to improve emotion recognition accuracy. Fusion techniques such as early fusion, late fusion, and hybrid fusion are discussed in the paper, each with its advantages and trade-offs. The choice of fusion technique depends on the specific requirements of the application and the available data. This trend towards multimodal fusion reflects the realization that a holistic understanding of music emotions requires the integration of information from different sources.
Despite the advancements in MMER, several challenges still need to be addressed. The need for large annotated datasets that cover a wide range of music genres, emotions, and demographic diversity is one significant challenge. Building such datasets is crucial for developing robust and generalizable MMER models. Additionally, the field would benefit from datasets with more modalities, including visual and physiological signals, as they can provide richer information for emotion recognition. Furthermore, real-time processing capabilities are essential for practical applications of MMER, such as interactive music systems. Developing efficient and scalable algorithms to handle real-time multimodal music data is a direction that future research should aim to pursue.
In conclusion, this survey paper provides a comprehensive overview of MMER, its current state-of-the-art, and potential avenues for future research. The multi-disciplinary nature of MMER, with its connections to Multimedia Information Systems, Animations, Artificial Reality, Augmented Reality, and Virtual Realities, opens up exciting possibilities for understanding and harnessing the emotional power of music.
Read the original article
by jsendak | Apr 30, 2025 | Computer Science
Analysis: Asymmetric numeral systems (ANS) have gained significant attention in recent years due to their high compression efficiency and low computational complexity. This article presents several algorithms for generating tables for ANS, with a focus on optimizing the discrepancy and entropy loss.
Discrepancy refers to how well the generated tables distribute the probability mass across different symbols. Lower discrepancy values indicate better distribution, leading to more efficient compression. The article claims that the presented algorithms are optimal in terms of discrepancy, which is a significant achievement in ANS research.
The optimization of entropy loss is another crucial aspect discussed in the article. Entropy loss refers to the difference between the theoretical entropy of a data source and the compressed representation using ANS. Minimizing entropy loss is essential to ensure that the compressed data retains as much information as possible.
The article also introduces improved theoretical bounds for entropy loss in tabled ANS. These bounds provide a better understanding of the expected compression performance and can guide future research in optimizing ANS algorithms.
In addition to the theoretical analysis, the article includes a brief empirical evaluation of the stream variant of ANS. Empirical evaluations are crucial to validate the theoretical claims and assess the performance of the proposed algorithms in practice.
Expert Insights:
The presented algorithms for generating tables in ANS are indeed a significant contribution to the field. Optimizing discrepancy and entropy loss is a crucial step in improving the compression efficiency of ANS. By providing algorithms that are proven to be optimal in terms of discrepancy, the article enables researchers and practitioners to achieve state-of-the-art compression performance.
The improved theoretical bounds for entropy loss also enhance our understanding of ANS and its limitations. These bounds can guide future research in developing new algorithms or refining existing ones to further minimize entropy loss and improve compression performance.
The empirical evaluation of the stream variant of ANS complements the theoretical analysis by demonstrating the real-world performance of the proposed algorithms. This evaluation allows us to assess the practical impact of the algorithms and provides insights into their suitability for different types of data sources.
Overall, this article contributes to the advancement of ANS by presenting optimized algorithms for table generation and offering improved theoretical bounds for entropy loss. The combination of theoretical analysis and empirical evaluation strengthens the credibility of the findings and sets a foundation for future research in ANS compression.
Read the original article
by jsendak | Apr 29, 2025 | Computer Science
arXiv:2504.17938v1 Announce Type: new
Abstract: The Quality of Experience (QoE) is the users satisfaction while streaming a video session over an over-the-top (OTT) platform like YouTube. QoE of YouTube reflects the smooth streaming session without any buffering and quality shift events. One of the most important factors nowadays affecting QoE of YouTube is frequent shifts from higher to lower resolutions and vice versa. These shifts ensure a smooth streaming session; however, it might get a lower mean opinion score. For instance, dropping from 1080p to 480p during a video can preserve continuity but might reduce the viewers enjoyment. Over time, OTT platforms are looking for alternative ways to boost user experience instead of relying on traditional Quality of Service (QoS) metrics such as bandwidth, latency, and throughput. As a result, we look into the relationship between quality shifting in YouTube streaming sessions and the channel metrics RSRP, RSRQ, and SNR. Our findings state that these channel metrics positively correlate with shifts. Thus, in real-time, OTT can only rely on them to predict video streaming sessions into lower- and higher-resolution categories, thus providing more resources to improve user experience. Using traditional Machine Learning (ML) classifiers, we achieved an accuracy of 77-percent, while using only RSRP, RSRQ, and SNR. In the era of 5G and beyond, where ultra-reliable, low-latency networks promise enhanced streaming capabilities, the proposed methodology can be used to improve OTT services.
The Impact of Quality Shifting on YouTube Streaming Sessions
In the increasingly digital world we live in, the demand for high-quality streaming services has skyrocketed. As users turn to platforms like YouTube to consume video content, their satisfaction, known as Quality of Experience (QoE), becomes a key factor in their overall viewing experience. In this context, it is essential to understand how the quality shifting phenomenon affects QoE, and how it can be optimized to enhance user satisfaction.
Traditionally, QoS metrics such as bandwidth, latency, and throughput have been used to assess streaming performance. However, as the article points out, these metrics alone are no longer sufficient to measure QoE accurately. This is where the concept of quality shifting comes into play. By dynamically adjusting video quality during a streaming session, platforms like YouTube can ensure a smooth viewing experience without buffering interruptions. However, this practice can also impact viewer enjoyment. For example, sudden shifts from higher to lower resolutions can lead to a decrease in satisfaction.
The study discussed in the article delves into the relationship between quality shifting in YouTube streaming sessions and specific channel metrics: RSRP, RSRQ, and SNR. These metrics, which are related to signal strength and quality, were found to positively correlate with shifts. In other words, they can serve as indicators to predict when a video streaming session might transition between lower and higher resolutions. By leveraging this information in real-time, over-the-top (OTT) platforms can allocate appropriate resources to improve user experience.
The researcher’s utilization of traditional Machine Learning (ML) classifiers and the achievement of a 77% accuracy rate using only RSRP, RSRQ, and SNR is a significant finding. This demonstrates the potential of using predictive algorithms to enhance QoE by proactively managing quality shifts in streaming sessions.
In the wider field of multimedia information systems, this research has important implications. As the demand for high-quality video content continues to rise and technologies such as 5G promise enhanced streaming capabilities, finding innovative ways to optimize QoE becomes imperative. By combining insights from multiple disciplines, including computer science, telecommunications, and human-computer interaction, this study contributes to improving the overall streaming experience for users.
Beyond YouTube, the concepts discussed in this article also have implications for other forms of multimedia, such as animations, artificial reality, augmented reality, and virtual realities. These immersive multimedia experiences heavily rely on streaming technologies, and ensuring a smooth and uninterrupted experience is crucial for user engagement. By further exploring the relationship between quality shifting and user satisfaction, researchers can develop innovative solutions to enrich multimedia experiences across various platforms and applications.
Conclusion
The study presented in this article highlights the impact of quality shifting on YouTube streaming sessions and its relationship with channel metrics such as RSRP, RSRQ, and SNR. By leveraging these metrics and utilizing machine learning techniques, OTT platforms can predict quality shifts in real-time and allocate appropriate resources to enhance user experience. The multi-disciplinary nature of this research, spanning areas like multimedia information systems, animations, artificial reality, augmented reality, and virtual realities, makes it a valuable contribution to the field. As technologies evolve and demand for high-quality streaming services grows, innovative approaches like those presented in this study will play a crucial role in delivering an optimal multimedia experience.
Read the original article
by jsendak | Apr 29, 2025 | Computer Science
Safety-Critical Data and Autonomous Vehicles: Barriers to Sharing
Autonomous vehicles (AVs) have the potential to transform transportation by greatly improving road safety. However, to ensure their safety and efficacy, it is crucial to have access to safety-critical data, such as crash and near-crash records. Sharing this data among AV companies, academic researchers, regulators, and the public can contribute to the overall improvement of AV design and development.
Despite the benefits of sharing safety-critical data, AV companies have been reluctant to do so. A recent study conducted interviews with twelve employees from AV companies to explore the reasons behind this reluctance and identify potential barriers to data sharing.
Barriers to Data Sharing
The study revealed two key barriers that were previously unknown. The first barrier is the inherent nature of the datasets themselves. Safety-critical data contains knowledge that is essential for improving AV safety, and the process of collecting, analyzing, and sharing this data is resource-intensive. Even within a single company, sharing such data can be complicated due to the politics involved. Different teams within a company may have competing interests and priorities, leading to reluctance in sharing data internally.
The second barrier identified by the study is the perception of AV safety knowledge as private rather than public. Interviewees believed that the knowledge gained from safety-critical data gives their companies a competitive edge. They view it as proprietary information that should be guarded to maintain their advantage in the market. This perception hinders the sharing of safety-critical data for the greater social good.
Implications and Way Forward
The findings of this study have important implications for promoting safety-critical AV data sharing. To overcome the barriers identified, several strategies can be considered.
- Debating and Stratifying Public and Private Knowledge: It is essential to initiate discussions and debates within the AV industry and regulatory bodies regarding the classification of safety knowledge as public or private. By defining clear boundaries, companies can feel more secure in sharing data without compromising their competitive advantages.
- Innovating Data Tools and Sharing Pipelines: Developing new tools and technologies that streamline the process of sharing safety-critical data can alleviate resource constraints and minimize the politics associated with data sharing. Companies could collaborate to create standardized data formats and sharing pipelines to facilitate easier and more efficient exchange of information.
- Offsetting Costs and Incentivizing Sharing: Given the resource-intensive nature of collecting safety-critical data, it is crucial to find ways to offset the costs associated with data curation. Incentives, such as tax breaks or grants, could be provided to companies that actively participate in data sharing initiatives. This would encourage greater participation and promote a culture of collaboration in the AV industry.
In conclusion, the barriers to sharing safety-critical data in the autonomous vehicle industry are rooted in the complexities of data collection, internal politics, and the perception of knowledge as a competitive advantage. Addressing these barriers requires industry-wide discussions, technological innovations, and the provision of incentives to encourage data sharing. By overcoming these obstacles, the AV industry can collectively work towards improving AV safety and realizing the full potential of autonomous vehicles.
Read the original article