arXiv:2408.11237v1 Announce Type: new
Abstract: Detecting out-of-distribution (OOD) data is crucial in machine learning applications to mitigate the risk of model overconfidence, thereby enhancing the reliability and safety of deployed systems. The majority of existing OOD detection methods predominantly address uni-modal inputs, such as images or texts. In the context of multi-modal documents, there is a notable lack of extensive research on the performance of these methods, which have primarily been developed with a focus on computer vision tasks. We propose a novel methodology termed as attention head masking (AHM) for multi-modal OOD tasks in document classification systems. Our empirical results demonstrate that the proposed AHM method outperforms all state-of-the-art approaches and significantly decreases the false positive rate (FPR) compared to existing solutions up to 7.5%. This methodology generalizes well to multi-modal data, such as documents, where visual and textual information are modeled under the same Transformer architecture. To address the scarcity of high-quality publicly available document datasets and encourage further research on OOD detection for documents, we introduce FinanceDocs, a new document AI dataset. Our code and dataset are publicly available.

Multi-Modal OOD Detection in Document Classification Systems

In the field of machine learning, detecting out-of-distribution (OOD) data is crucial for enhancing the reliability and safety of deployed systems. OOD detection methods have primarily focused on uni-modal inputs like images or texts, leaving a notable gap in research for multi-modal documents. In this article, a novel methodology called attention head masking (AHM) is proposed for multi-modal OOD tasks in document classification systems.

The authors of this study demonstrate the effectiveness of the AHM method by comparing it to existing state-of-the-art approaches. Their empirical results show that AHM outperforms all the existing methods and achieves a significant decrease in the false positive rate (FPR) of up to 7.5%. This indicates that AHM has superior capability in accurately identifying OOD data within the context of multi-modal documents.

One particularly noteworthy aspect of this methodology is its multi-disciplinary nature. By utilizing a Transformer architecture, both visual and textual information within the documents can be modeled effectively. This allows for a comprehensive understanding of the document content, further improving the accuracy of OOD detection.

To support further research in this domain and address the scarcity of high-quality publicly available document datasets, the authors introduce FinanceDocs, a new document AI dataset. The availability of this dataset, along with the publicly accessible code and models, encourages researchers to delve deeper into OOD detection specifically tailored for multi-modal documents.

This study highlights the importance of considering multi-modal data in OOD detection and contributes to the growing field of document classification systems. The proposed methodology, AHM, sets a benchmark for accurately identifying OOD data within multi-modal documents by leveraging the power of Transformer architectures. As the prevalence of multi-modal data continues to increase, further research and development in this area are expected to yield even more sophisticated and effective OOD detection methods for diverse applications.

Read the original article