Qubixity.net
  • AI is the Future
  • AI
  • AI Music
  • AI News
  • Art
  • Cadabra
    • Cartan structural equations and Bianchi identity
    • Einstein equations from a variational principle
  • Cities
  • Cosmology & Computing
  • Data Science
    • DS Articles
    • Life Expectancy
  • General Relativity & Quantum Cosmology
    • GR & QC Articles
  • Mathematica
    • Monte Carlo Intergration
  • RStudio
    • Quarto Cars
    • Quarto Cars v2
  • Science
    • Computer Science
  • Science Magazine
  • WordPress Blogging
    • CyberSEO
    • Divi AI
    • Namecheap
  • Privacy Policy
Select Page

“AutoCap: Improving Ambient Sound Generation with High-Quality Audio Captions”

by jsendak | Jun 28, 2024 | Computer Science | 0 comments

"AutoCap: Improving Ambient Sound Generation with High-Quality Audio Captions"

arXiv:2406.19388v1 Announce Type: cross
Abstract: Generating ambient sounds and effects is a challenging problem due to data scarcity and often insufficient caption quality, making it difficult to employ large-scale generative models for the task. In this work, we tackle the problem by introducing two new models. First, we propose AutoCap, a high-quality and efficient automatic audio captioning model. We show that by leveraging metadata available with the audio modality, we can substantially improve the quality of captions. AutoCap reaches CIDEr score of 83.2, marking a 3.2% improvement from the best available captioning model at four times faster inference speed. We then use AutoCap to caption clips from existing datasets, obtaining 761,000 audio clips with high-quality captions, forming the largest available audio-text dataset. Second, we propose GenAu, a scalable transformer-based audio generation architecture that we scale up to 1.25B parameters and train with our new dataset. When compared to state-of-the-art audio generators, GenAu obtains significant improvements of 15.7% in FAD score, 22.7% in IS, and 13.5% in CLAP score, indicating significantly improved quality of generated audio compared to previous works. This shows that the quality of data is often as important as its quantity. Besides, since AutoCap is fully automatic, new audio samples can be added to the training dataset, unlocking the training of even larger generative models for audio synthesis.

Improving Ambient Sound Generation with New Models

Ambient sound generation is a complex task that has posed challenges due to limited data availability and the quality of captions. However, this article presents two novel models that address these issues effectively. The first model, AutoCap, is an automatic audio captioning model that leverages metadata to enhance the quality of captions. This approach not only improves the accuracy of captions but also allows for faster inference speed. In fact, AutoCap achieves an impressive CIDEr score of 83.2, which is a 3.2% improvement over existing captioning models.

AutoCap’s ability to generate high-quality captions is applied to existing datasets, resulting in the creation of a groundbreaking audio-text dataset containing 761,000 audio clips. This dataset, with its accurate and descriptive captions, serves as a valuable resource for future research in the field of ambient sound generation.

The second model introduced in this article is GenAu, a transformer-based audio generation architecture with a significant scale of 1.25B parameters. Trained on the newly created audio-text dataset, GenAu surpasses state-of-the-art audio generators. It achieves remarkable improvements of 15.7% in FAD score, 22.7% in IS, and 13.5% in CLAP score, indicating a substantial enhancement in the quality of generated audio compared to previous works.

These advancements highlight the significance of both the quality and quantity of data in multimedia information systems. By utilizing AutoCap’s automatic captioning capability, the training dataset for generative audio models can be exponentially expanded. This, in turn, unlocks the potential for training even larger models for audio synthesis, ultimately improving the overall realism and quality of generated ambient sounds.

This research encompasses various disciplines within multimedia information systems, including audio processing, natural language processing, and artificial intelligence. The integration of metadata with audio modality in AutoCap demonstrates the multi-disciplinary nature of the proposed approach. Furthermore, the utilization of transformer-based architectures in GenAu showcases the importance of leveraging advancements in deep learning techniques, specifically in the context of audio generation.

Read the original article

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Recent Posts

  • “Design for Good: Highlights from the London Design Biennale”
  • “Examining Fragile P Values: A Closer Look at Research Practices”
  • “Demystifying GRPO in LLMs: A Simplified Explanation”
  • “Greening Your Website: Sustainable Solutions for Combatting Climate Change”
  • The Future of Computing: Quantum Computing Explained

Recent Comments

No comments to show.

Archives

  • June 2025
  • May 2025
  • April 2025
  • March 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • May 2023
  • March 2023
  • January 2023
  • December 2022
  • October 2022
  • September 2022
  • July 2022
  • June 2022
  • May 2022
  • January 2022
  • October 2021
  • May 2021
  • April 2021
  • March 2021
  • January 2021
  • December 2020
  • November 2020
  • October 2020

Categories

  • AI
  • AI News
  • Art
  • ArXiv
  • Cities
  • Computer Science
  • Cosmology & Computing
  • CyberSEO
  • DS Articles
  • GR & QC Articles
  • Music
  • Namecheap
  • News
  • Science
  • Facebook
  • X
  • Instagram
  • RSS

Designed by Elegant Themes | Powered by WordPress