Following the success of Large Language Models (LLMs), Large Multimodal
Models (LMMs), such as the Flamingo model and its subsequent competitors, have
started to emerge as natural steps towards generalist agents. However,
interacting with recent LMMs reveals major limitations that are hardly captured
by the current evaluation benchmarks. Indeed, task performances (e.g., VQA
accuracy) alone do not provide enough clues to understand their real
capabilities, limitations, and to which extent such models are aligned to human
expectations. To refine our understanding of those flaws, we deviate from the
current evaluation paradigm, and (1) evaluate 10 recent open-source LMMs from
3B up to 80B parameter scale, on 5 different axes; hallucinations, abstention,
compositionality, explainability and instruction following. Our evaluation on
these axes reveals major flaws in LMMs. While the current go-to solution to
align these models is based on training, such as instruction tuning or RLHF, we
rather (2) explore the training-free in-context learning (ICL) as a solution,
and study how it affects these limitations. Based on our ICL study, (3) we push
ICL further and propose new multimodal ICL variants such as; Multitask-ICL,
Chain-of-Hindsight-ICL, and Self-Correcting-ICL. Our findings are as follows.
(1) Despite their success, LMMs have flaws that remain unsolved with scaling
alone. (2) The effect of ICL on LMMs flaws is nuanced; despite its
effectiveness for improved explainability, answer abstention, ICL only slightly
improves instruction following, does not improve compositional abilities, and
actually even amplifies hallucinations. (3) The proposed ICL variants are
promising as post-hoc approaches to efficiently tackle some of those flaws. The
code is available here: https://github.com/mshukor/EvALign-ICL.

Exploring the Limits of Large Multimodal Models and the Role of In-Context Learning

In recent years, Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks. As a natural progression, researchers have started developing Large Multimodal Models (LMMs), such as the Flamingo model and its competitors, to explore the intersection of language and visual information. These LMMs aim to be more generalist agents by incorporating both text and image data.

However, a closer examination of these LMMs reveals that they have significant limitations that are not adequately captured by current evaluation benchmarks. Merely assessing task performance, such as Visual Question Answering (VQA) accuracy, does not provide a comprehensive understanding of their true capabilities or their alignment with human expectations.

To address these limitations, the authors of this article deviate from the current evaluation paradigm and propose a novel evaluation framework. They evaluate 10 recent open-source LMMs, ranging from 3 billion to 80 billion parameters, along five different axes: hallucinations, abstention, compositionality, explainability, and instruction following.

The evaluation on these axes highlights major flaws in LMMs. It becomes evident that scaling alone is not sufficient to address these flaws. While training has been the go-to solution for aligning LMMs, the authors take a different approach by exploring training-free in-context learning (ICL) as a potential solution. They investigate how ICL affects the identified limitations and propose new multimodal ICL variants such as Multitask-ICL, Chain-of-Hindsight-ICL, and Self-Correcting-ICL.

The findings of the study are threefold. Firstly, despite their success, LMMs still have unresolved flaws that cannot be addressed solely through scaling. Secondly, the effect of ICL on these flaws is nuanced; while it improves explainability and answer abstention, it only marginally enhances instruction following and fails to improve compositional abilities. Surprisingly, ICL even amplifies hallucinations to some extent. Lastly, the proposed ICL variants show promise as post-hoc approaches to efficiently tackle some of the identified flaws.

This research highlights the multidisciplinary nature of the concepts discussed. It bridges the fields of multimedia information systems, animations, artificial reality, augmented reality, and virtual realities by focusing on large multimodal models that integrate language and visual information. The study not only provides a deeper understanding of the limitations of LMMs but also explores innovative approaches to address these limitations through in-context learning.

Key Takeaways:

  • Large Multimodal Models (LMMs) have significant limitations beyond what current evaluation benchmarks capture.
  • Scaling alone is not sufficient to address the flaws in LMMs.
  • In-Context Learning (ICL) is explored as a training-free solution to tackle the limitations of LMMs.
  • ICL variants such as Multitask-ICL, Chain-of-Hindsight-ICL, and Self-Correcting-ICL show promise for improving LMMs.
  • This research bridges the fields of multimedia information systems, animations, artificial reality, augmented reality, and virtual realities.

Read the original article