by jsendak | Apr 12, 2025 | AI
arXiv:2504.07375v1 Announce Type: new Abstract: Predicting hand motion is critical for understanding human intentions and bridging the action space between human movements and robot manipulations. Existing hand trajectory prediction (HTP) methods forecast the future hand waypoints in 3D space conditioned on past egocentric observations. However, such models are only designed to accommodate 2D egocentric video inputs. There is a lack of awareness of multimodal environmental information from both 2D and 3D observations, hindering the further improvement of 3D HTP performance. In addition, these models overlook the synergy between hand movements and headset camera egomotion, either predicting hand trajectories in isolation or encoding egomotion only from past frames. To address these limitations, we propose novel diffusion models (MMTwin) for multimodal 3D hand trajectory prediction. MMTwin is designed to absorb multimodal information as input encompassing 2D RGB images, 3D point clouds, past hand waypoints, and text prompt. Besides, two latent diffusion models, the egomotion diffusion and the HTP diffusion as twins, are integrated into MMTwin to predict camera egomotion and future hand trajectories concurrently. We propose a novel hybrid Mamba-Transformer module as the denoising model of the HTP diffusion to better fuse multimodal features. The experimental results on three publicly available datasets and our self-recorded data demonstrate that our proposed MMTwin can predict plausible future 3D hand trajectories compared to the state-of-the-art baselines, and generalizes well to unseen environments. The code and pretrained models will be released at https://github.com/IRMVLab/MMTwin.
The article “Predicting Hand Motion with Multimodal Diffusion Models” addresses the challenge of accurately predicting hand trajectories in 3D space, which is crucial for understanding human intentions and enabling seamless interaction between humans and robots. Existing hand trajectory prediction (HTP) methods are limited to 2D egocentric video inputs and fail to leverage multimodal environmental information. Additionally, these models overlook the relationship between hand movements and headset camera egomotion. To overcome these limitations, the authors propose a novel diffusion model called MMTwin, which takes into account 2D RGB images, 3D point clouds, past hand waypoints, and text prompts as input. MMTwin integrates two latent diffusion models, egomotion diffusion, and HTP diffusion, to predict camera egomotion and future hand trajectories simultaneously. The authors also introduce a hybrid Mamba-Transformer module as the denoising model of the HTP diffusion to effectively fuse multimodal features. Experimental results on multiple datasets demonstrate that MMTwin outperforms existing baselines and generalizes well to unseen environments. The code and pretrained models are available for further exploration.
Predicting Multimodal 3D Hand Trajectories with MMTwin
In the field of robotics, predicting hand motion plays a crucial role in understanding human intentions and bridging the gap between human movements and robot manipulations. Existing hand trajectory prediction (HTP) methods have focused primarily on forecasting the future hand waypoints in 3D space based on past egocentric observations. However, these models are designed to accommodate only 2D egocentric video inputs, which limits their ability to leverage multimodal environmental information from both 2D and 3D observations, hindering the overall performance of 3D HTP.
In addition to the limitations posed by the lack of multimodal awareness, current models also overlook the synergy between hand movements and headset camera egomotion. They often either predict hand trajectories in isolation or encode egomotion solely from past frames. This oversight hampers the accuracy and effectiveness of the predictions.
To address these limitations and pioneer a new approach to multimodal 3D hand trajectory prediction, we propose novel diffusion models known as MMTwin. This innovative model is designed to absorb multimodal information as input, encompassing 2D RGB images, 3D point clouds, past hand waypoints, and text prompts. By amalgamating two latent diffusion models, namely the egomotion diffusion and the HTP diffusion, into MMTwin, we can predict both camera egomotion and future hand trajectories concurrently.
A key element of MMTwin is the implementation of a hybrid Mamba-Transformer module as the denoising model of the HTP diffusion process. This module ensures the fusion of multimodal features, resulting in enhanced predictions compared to existing baselines in the field.
The efficacy of our proposed MMTwin model was evaluated through extensive experimentation on three publicly available datasets, as well as our self-recorded data. The results demonstrate that MMTwin consistently predicts plausible future 3D hand trajectories in comparison to state-of-the-art baselines. Furthermore, MMTwin exhibits excellent generalization capabilities across unseen environments.
We are excited to announce that the code and pretrained models of MMTwin are available for public access. We believe that the release of our work will provide researchers in the field with valuable resources to further advance multimodal 3D hand trajectory prediction.
For more information and access to the code and pretrained models, please visit our GitHub repository at: https://github.com/IRMVLab/MMTwin.
The paper titled “MMTwin: Multimodal 3D Hand Trajectory Prediction with Egomotion Diffusion” addresses the challenge of predicting hand motion in order to understand human intentions and bridge the gap between human movements and robot manipulations. The authors highlight the limitations of existing hand trajectory prediction (HTP) methods, which are designed for 2D egocentric video inputs and do not effectively utilize multimodal environmental information from both 2D and 3D observations.
To overcome these limitations, the authors propose a novel diffusion model called MMTwin for multimodal 3D hand trajectory prediction. MMTwin takes into account various modalities such as 2D RGB images, 3D point clouds, past hand waypoints, and text prompts as input. The model consists of two latent diffusion models, egomotion diffusion, and HTP diffusion, which work together to predict both camera egomotion and future hand trajectories concurrently.
A key contribution of this work is the introduction of a hybrid Mamba-Transformer module as the denoising model of the HTP diffusion. This module helps in effectively fusing multimodal features and improving the prediction performance. The authors evaluate the proposed MMTwin on three publicly available datasets as well as their self-recorded data. The experimental results demonstrate that MMTwin outperforms state-of-the-art baselines in terms of predicting plausible future 3D hand trajectories. Furthermore, the model generalizes well to unseen environments.
Overall, this paper introduces a novel approach to multimodal 3D hand trajectory prediction by incorporating various modalities and leveraging the synergy between hand movements and headset camera egomotion. The proposed MMTwin model shows promising results and opens up possibilities for further research in this domain. The release of code and pretrained models on GitHub will facilitate the adoption and extension of this work by the research community.
Read the original article
by jsendak | Feb 21, 2025 | Computer Science
Personalized Age Transformation Using Diffusion Model
Age transformation of facial images is a task that involves modifying a person’s appearance to make them look older or younger while maintaining their identity. While deep learning methods have been successful in creating natural age transformations, they often fail to capture the individual-specific features influenced by a person’s life history. In this paper, the authors propose a novel approach for personalized age transformation using a diffusion model.
The authors’ diffusion model takes a facial image and a target age as input and generates an age-edited face image as output. This model is able to capture not only the average age transitions but also the individual-specific appearances influenced by their life histories. To achieve this, the authors incorporate additional supervision using self-reference images, which are facial images of the same person at different ages.
The authors fine-tune a pretrained diffusion model for personalized adaptation using approximately 3 to 5 self-reference images. This allows the model to learn and understand the unique characteristics of the individual’s aging process. By incorporating self-reference images, the model is able to better preserve the identity of the person while performing age editing.
In addition to using self-reference images, the authors also design an effective prompt to further enhance the performance of age editing and identity preservation. The prompt serves as a guiding signal for the diffusion model, helping it generate more accurate and visually pleasing age-edited face images.
The experiments conducted by the authors demonstrate that their proposed method outperforms existing methods both quantitatively and qualitatively. The personalized age transformation achieved by the diffusion model is superior in terms of preserving individual-specific appearances and maintaining identity.
This research has significant implications in various domains including entertainment, forensics, and cosmetic industries. The ability to accurately and realistically age-transform facial images can be used in applications such as creating age-progressed images of missing persons or simulating the effects of aging for entertainment purposes.
The availability of the code and pretrained models further enhances the practicality of this research. By making these resources accessible to the public, researchers and developers can easily implement and build upon the proposed method.
In conclusion, the authors’ personalized age transformation method using a diffusion model and self-reference images is a significant advancement in the field. This approach not only achieves superior performance in age editing and identity preservation but also opens up new possibilities for personalized image transformation.
Read the original article
by jsendak | Feb 11, 2025 | Computer Science
arXiv:2502.05695v1 Announce Type: new
Abstract: This paper proposes a novel framework for real-time adaptive-bitrate video streaming by integrating latent diffusion models (LDMs) within the FFmpeg techniques. This solution addresses the challenges of high bandwidth usage, storage inefficiencies, and quality of experience (QoE) degradation associated with traditional constant bitrate streaming (CBS) and adaptive bitrate streaming (ABS). The proposed approach leverages LDMs to compress I-frames into a latent space, offering significant storage and semantic transmission savings without sacrificing high visual quality. While it keeps B-frames and P-frames as adjustment metadata to ensure efficient video reconstruction at the user side, the proposed framework is complemented with the most state-of-the-art denoising and video frame interpolation (VFI) techniques. These techniques mitigate semantic ambiguity and restore temporal coherence between frames, even in noisy wireless communication environments. Experimental results demonstrate the proposed method achieves high-quality video streaming with optimized bandwidth usage, outperforming state-of-the-art solutions in terms of QoE and resource efficiency. This work opens new possibilities for scalable real-time video streaming in 5G and future post-5G networks.
New Framework for Real-Time Adaptive-Bitrate Video Streaming: A Multi-disciplinary Approach
Video streaming has become an integral part of our daily lives, and the demand for high-quality video content is increasing exponentially. However, traditional streaming methods face challenges such as high bandwidth usage, storage inefficiencies, and degradation of quality of experience (QoE). In this paper, a novel framework is proposed to address these challenges by integrating latent diffusion models (LDMs) within the FFmpeg techniques.
One of the key contributions of this framework is the use of LDMs to compress I-frames into a latent space. By leveraging latent diffusion models, significant storage and semantic transmission savings can be achieved without sacrificing visual quality. This is crucial in modern multimedia information systems, where efficient storage and transmission are vital.
Furthermore, the proposed framework considers the multi-disciplinary nature of video streaming by incorporating state-of-the-art denoising and video frame interpolation (VFI) techniques. These techniques help mitigate semantic ambiguity and restore temporal coherence between frames, even in noisy wireless communication environments. By addressing temporal coherence, the framework ensures a smooth and seamless video streaming experience.
From a wider perspective, this research aligns with the field of Artificial Reality, Augmented Reality, and Virtual Realities. The integration of LDMs, denoising, and VFI techniques in video streaming has potential applications in these fields. For example, in augmented reality, the reduction of semantic ambiguity can enhance the accuracy and realism of virtual objects overlaid onto the real world.
This novel framework also has implications for 5G and future post-5G networks. As video streaming becomes more prevalent with the advent of faster network technologies, resource efficiency becomes crucial. The proposed method not only achieves high-quality video streaming but also optimizes bandwidth usage, making it well-suited for scalable real-time video streaming in these networks.
In conclusion, this paper introduces a groundbreaking framework for real-time adaptive-bitrate video streaming. By leveraging latent diffusion models, denoising, and video frame interpolation techniques, this framework tackles the challenges of traditional streaming methods and opens up new possibilities for the multimedia information systems, artificial reality, augmented reality, and virtual realities. As technology continues to evolve, this research paves the way for more efficient and immersive video streaming experiences.
Read the original article
by jsendak | Dec 10, 2024 | Computer Science
arXiv:2412.05694v1 Announce Type: new
Abstract: This study presents a novel method for generating music visualisers using diffusion models, combining audio input with user-selected artwork. The process involves two main stages: image generation and video creation. First, music captioning and genre classification are performed, followed by the retrieval of artistic style descriptions. A diffusion model then generates images based on the user’s input image and the derived artistic style descriptions. The video generation stage utilises the same diffusion model to interpolate frames, controlled by audio energy vectors derived from key musical features of harmonics and percussives. The method demonstrates promising results across various genres, and a new metric, Audio-Visual Synchrony (AVS), is introduced to quantitatively evaluate the synchronisation between visual and audio elements. Comparative analysis shows significantly higher AVS values for videos generated using the proposed method with audio energy vectors, compared to linear interpolation. This approach has potential applications in diverse fields, including independent music video creation, film production, live music events, and enhancing audio-visual experiences in public spaces.
Music Visualizers: Blending Art and Technology
Music visualizers have long been used to enhance the auditory experience by adding a visual component to sound. This study presents a unique and innovative method for generating music visualizers using diffusion models, combining audio input with user-selected artwork. The multi-disciplinary nature of this concept lies in its integration of music analysis, art interpretation, and video generation techniques.
Image Generation and Artistic Style Descriptions
In the first stage of the process, music captioning and genre classification algorithms are employed to analyze the audio input. This analysis provides crucial information about the key musical features such as harmonics and percussives. Utilizing these features, artistic style descriptions are retrieved and combined with the user’s input image.
The diffusion model plays a central role in generating the images based on the user’s input and the artistic style descriptions. This technique allows for the creation of unique and visually stunning visuals that are in harmony with the music. The blending of audio and visual elements in this stage showcases the potential of this method to create immersive experiences.
Video Creation and Audio-Visual Synchrony
Once the images are generated, the same diffusion model is employed to interpolate frames and create a video. However, what sets this method apart is the use of audio energy vectors derived from the key musical features. These vectors control the interpolation, ensuring that the visual elements synchronize with the changes in audio energy.
The introduction of a new metric, Audio-Visual Synchrony (AVS), allows for a quantitative evaluation of the synchronisation between visual and audio elements. Comparative analysis has shown significantly higher AVS values for videos generated using the proposed method with audio energy vectors compared to linear interpolation. This indicates the effectiveness of this method in creating visually appealing and synchronized music visualizers.
Applications and Future Developments
The potential applications of this method are vast and span across various fields. Independent music video creators can use this technique to generate captivating visuals that complement their music. Film producers can incorporate this method in their productions to create unique and engaging visual experiences. Live music events can leverage this technology to enhance the audio-visual spectacle for the audience. Furthermore, this method can be applied in public spaces to create interactive and immersive audio-visual displays.
In relation to the wider field of multimedia information systems, animations, artificial reality, augmented reality, and virtual realities, this study showcases the potential for integration of audio and visual elements in new and innovative ways. It highlights the important role that technology, such as diffusion models, can play in enhancing multimedia experiences. By bridging the gap between art and technology, this method paves the way for future developments in the field of music visualization and beyond.
Read the original article
by jsendak | Dec 9, 2024 | Computer Science
arXiv:2412.04746v1 Announce Type: cross
Abstract: Modern music retrieval systems often rely on fixed representations of user preferences, limiting their ability to capture users’ diverse and uncertain retrieval needs. To address this limitation, we introduce Diff4Steer, a novel generative retrieval framework that employs lightweight diffusion models to synthesize diverse seed embeddings from user queries that represent potential directions for music exploration. Unlike deterministic methods that map user query to a single point in embedding space, Diff4Steer provides a statistical prior on the target modality (audio) for retrieval, effectively capturing the uncertainty and multi-faceted nature of user preferences. Furthermore, Diff4Steer can be steered by image or text inputs, enabling more flexible and controllable music discovery combined with nearest neighbor search. Our framework outperforms deterministic regression methods and LLM-based generative retrieval baseline in terms of retrieval and ranking metrics, demonstrating its effectiveness in capturing user preferences, leading to more diverse and relevant recommendations. Listening examples are available at tinyurl.com/diff4steer.
Diff4Steer: A Novel Generative Retrieval Framework for Music Exploration
Modern music retrieval systems often struggle to capture the diverse and uncertain retrieval needs of users. This limitation is due to their reliance on fixed representations of user preferences. To overcome this challenge, a team of researchers has introduced Diff4Steer, a highly innovative generative retrieval framework that aims to synthesize diverse seed embeddings from user queries, representing potential directions for music exploration.
Unlike deterministic methods that map user queries to a single point in embedding space, Diff4Steer employs lightweight diffusion models to provide a statistical prior on the target modality, which in this case is audio. This approach effectively captures the uncertainty and multi-faceted nature of user preferences, allowing for a more nuanced understanding of their musical tastes.
One of the standout features of Diff4Steer is its ability to be steered by image or text inputs, in addition to traditional audio queries. This unique functionality enables a more flexible and controllable music discovery experience, combined with advanced nearest neighbor search techniques. By incorporating different modalities, the framework allows users to explore music based on visual cues or textual descriptions, bridging the gap between different sensory experiences.
The use of diffusion models in Diff4Steer holds promise for the wider field of multimedia information systems. The concept of using statistical priors to capture uncertainty and leverage diverse data sources is not only relevant to music retrieval but can also be applied to other domains where unstructured multimedia data is prevalent. By expanding the scope of this framework beyond music, researchers and practitioners can explore its potential in analyzing and retrieving multimedia content such as images, videos, and text.
Furthermore, Diff4Steer’s integration of artificial reality, augmented reality, and virtual realities can enhance the music exploration experience. By incorporating these technologies, users can visualize and interact with music in immersive environments, adding a new layer of engagement and sensory stimulation. This multidisciplinary approach opens up avenues for cross-pollination between the fields of multimedia information systems and virtual reality, leading to the development of more immersive and interactive music retrieval systems.
In terms of performance, Diff4Steer demonstrates its effectiveness in capturing user preferences and generating more diverse and relevant recommendations. It outperforms deterministic regression methods and a generative retrieval baseline, showcasing the superiority of its statistical approach. By providing a wider range of music options to users, Diff4Steer has the potential to enhance music discovery and foster a deeper connection between listeners and their preferred genres.
In conclusion, Diff4Steer offers a groundbreaking solution to the limitations of traditional music retrieval systems. By incorporating lightweight diffusion models and the ability to be steered by different modalities, it provides a more comprehensive understanding of user preferences and enables a more flexible and controllable music exploration experience. Its implications extend beyond the field of music, opening up new possibilities in multimedia information systems, artificial reality, augmented reality, and virtual realities.
Read the original article