arXiv:2409.08489v1 Announce Type: new
Abstract: Systems that automatically generate text captions for audio, images and video lack a confidence indicator of the relevance and correctness of the generated sequences. To address this, we build on existing methods of confidence measurement for text by introduce selective pooling of token probabilities, which aligns better with traditional correctness measures than conventional pooling does. Further, we propose directly measuring the similarity between input audio and text in a shared embedding space. To measure self-consistency, we adapt semantic entropy for audio captioning, and find that these two methods align even better than pooling-based metrics with the correctness measure that calculates acoustic similarity between captions. Finally, we explain why temperature scaling of confidences improves calibration.

Improving Confidence Measurement in Automatic Caption Generation

Automatic caption generation systems play a crucial role in multimedia information systems, as they enable better accessibility and understanding of audio, images, and videos. However, a key challenge in these systems is the lack of a confidence indicator for the generated captions, making it difficult to assess their relevance and correctness.

In this research, the authors propose novel techniques to address this challenge. They first introduce selective pooling of token probabilities as a method to measure confidence. This approach aligns better with traditional correctness measures compared to conventional pooling methods. By selectively pooling token probabilities, the system can assign higher confidence to tokens that are more likely to be correct, improving the overall reliability of the caption.

Additionally, the authors propose measuring the similarity between input audio and text in a shared embedding space. This allows for a direct comparison between the audio and textual representation, enabling a more accurate assessment of the caption’s relevance. The multi-disciplinary nature of this concept is evident as it combines techniques from natural language processing, audio processing, and machine learning to create a robust measure of similarity.

To ensure self-consistency in the generated captions, the authors adapt the concept of semantic entropy for audio captioning. Semantic entropy measures the uncertainty or diversity of concepts in a given caption. The alignment between the selective pooling of token probabilities and semantic entropy further enhances the accuracy of correctness measures, specifically in capturing the acoustic similarity between captions.

Finally, the authors propose temperature scaling of confidences to improve calibration. Temperature scaling is a technique that adjusts the probability distributions of the confidences. By applying temperature scaling, the system can effectively calibrate the confidence indicators, ensuring a more accurate assessment of the captions’ reliability.

Overall, this research significantly contributes to the field of multimedia information systems by addressing the need for confidence measurement in automatic caption generation. The proposed techniques, such as selective pooling, similarity measurement, semantic entropy, and temperature scaling, demonstrate the multi-disciplinary nature of this field, integrating concepts from various domains such as natural language processing, audio processing, and machine learning.

Furthermore, the findings of this research have implications for other areas such as animations, artificial reality, augmented reality, and virtual realities. Captions play a vital role in these domains, enabling better understanding and interaction with multimedia content. By improving confidence measurement, this research can enhance the overall user experience and accessibility in these immersive environments.

Read the original article