There has been a growing interest in the task of generating sound for silent
videos, primarily because of its practicality in streamlining video
post-production. However, existing methods for video-sound generation attempt
to directly create sound from visual representations, which can be challenging
due to the difficulty of aligning visual representations with audio
representations. In this paper, we present SonicVisionLM, a novel framework
aimed at generating a wide range of sound effects by leveraging vision language
models. Instead of generating audio directly from video, we use the
capabilities of powerful vision language models (VLMs). When provided with a
silent video, our approach first identifies events within the video using a VLM
to suggest possible sounds that match the video content. This shift in approach
transforms the challenging task of aligning image and audio into more
well-studied sub-problems of aligning image-to-text and text-to-audio through
the popular diffusion models. To improve the quality of audio recommendations
with LLMs, we have collected an extensive dataset that maps text descriptions
to specific sound effects and developed temporally controlled audio adapters.
Our approach surpasses current state-of-the-art methods for converting video to
audio, resulting in enhanced synchronization with the visuals and improved
alignment between audio and video components. Project page:
https://yusiissy.github.io/SonicVisionLM.github.io/

Analysis: SonicVisionLM – Generating Sound for Silent Videos

Generating sound for silent videos has gained significant interest in recent years due to its practicality in streamlining video post-production. However, existing methods face challenges in aligning visual representations with audio representations. In this paper, the authors propose SonicVisionLM, a novel framework that leverages vision language models (VLMs) to generate a wide range of sound effects.

The adoption of VLMs in SonicVisionLM represents a multi-disciplinary approach that combines computer vision and natural language processing. By using VLMs, the framework is able to identify events within a silent video and suggest relevant sounds that match the visual content. This shift in approach simplifies the complex task of aligning image and audio, transforming it into more well-studied sub-problems of aligning image-to-text and text-to-audio.

These sub-problems are addressed through the use of diffusion models, which have been widely used in the field of multimedia information systems. Diffusion models facilitate the process of converting text descriptions into specific sound effects. Additionally, the authors have developed temporally controlled audio adapters to improve the quality of audio recommendations with VLMs. This integration of different techniques enhances the overall synchronization between audio and video components.

With the proposed SonicVisionLM framework, the authors have surpassed current state-of-the-art methods for converting video to audio. They have achieved enhanced synchronization with visuals and improved alignment between audio and video components. By utilizing VLMs and diffusion models, the framework demonstrates the potential of combining various disciplines to advance the field of multimedia information systems. This research opens up possibilities for further exploration and development of advanced techniques in animations, artificial reality, augmented reality, and virtual realities.

For more details and access to the project page, please visit: https://yusiissy.github.io/SonicVisionLM.github.io/

Read the original article