Evaluating the quality and variability of text generated by Large Language
Models (LLMs) poses a significant, yet unresolved research challenge.
Traditional evaluation methods, such as ROUGE and BERTScore, which measure
token similarity, often fail to capture the holistic semantic equivalence. This
results in a low correlation with human judgments and intuition, which is
especially problematic in high-stakes applications like healthcare and finance
where reliability, safety, and robust decision-making are highly critical. This
work proposes DCR, an automated framework for evaluating and improving the
consistency of LLM-generated texts using a divide-conquer-reasoning approach.
Unlike existing LLM-based evaluators that operate at the paragraph level, our
method employs a divide-and-conquer evaluator (DCE) that breaks down the
paragraph-to-paragraph comparison between two generated responses into
individual sentence-to-paragraph comparisons, each evaluated based on
predefined criteria. To facilitate this approach, we introduce an automatic
metric converter (AMC) that translates the output from DCE into an
interpretable numeric score. Beyond the consistency evaluation, we further
present a reason-assisted improver (RAI) that leverages the analytical reasons
with explanations identified by DCE to generate new responses aimed at reducing
these inconsistencies. Through comprehensive and systematic empirical analysis,
we show that our approach outperforms state-of-the-art methods by a large
margin (e.g., +19.3% and +24.3% on the SummEval dataset) in evaluating the
consistency of LLM generation across multiple benchmarks in semantic, factual,
and summarization consistency tasks. Our approach also substantially reduces
nearly 90% of output inconsistencies, showing promise for effective
hallucination mitigation.

Expert Commentary: Evaluating and Improving Consistency in LLM-Generated Texts

The article highlights the challenges in evaluating the quality and variability of text generated by Large Language Models (LLMs). While traditional evaluation methods like ROUGE and BERTScore focus on token similarity, they often fail to capture the holistic semantic equivalence. This limitation is especially problematic in critical domains like healthcare and finance where reliability, safety, and robust decision-making are paramount.

In response to this challenge, the authors propose DCR, an automated framework for evaluating and improving the consistency of LLM-generated texts. The framework utilizes a divide-conquer-reasoning approach by employing a divide-and-conquer evaluator (DCE) at the sentence-to-paragraph level. By breaking down the comparison between two generated responses into individual sentence-to-paragraph evaluations, the method can capture finer-grained nuances and evaluate consistency based on predefined criteria.

An important contribution of this work is the introduction of an automatic metric converter (AMC), which translates the output from DCE into interpretable numeric scores. This allows for easier interpretation and comparison of consistency scores across different evaluations. Additionally, the authors present a reason-assisted improver (RAI) that leverages analytical reasons identified by DCE to generate new responses aimed at reducing inconsistencies. This approach not only evaluates consistency but also provides insights for improving the generated text.

A particularly noteworthy aspect of this research is its multi-disciplinary nature. By addressing the challenges of evaluating and improving LLM-generated texts in high-stakes applications like healthcare and finance, it brings together expertise from natural language processing, domain-specific knowledge, and critical decision-making. The framework’s ability to evaluate semantic, factual, and summarization consistency tasks showcases its versatility and applicability across various domains.

The empirical analysis presented in the study demonstrates the effectiveness of the proposed approach. Compared to state-of-the-art methods, the DCR framework outperforms them by a significant margin, indicating its superiority in evaluating the consistency of LLM generation. Moreover, the approach substantially reduces output inconsistencies, which is crucial for mitigating hallucination in LLM-generated texts.

In conclusion, the DCR framework offers a promising solution to the challenge of evaluating and improving the consistency of LLM-generated texts. Its divide-conquer-reasoning approach, along with the use of an automatic metric converter and reason-assisted improver, enhances our understanding of text generation quality. By considering the multi-disciplinary nature of LLM-generated texts, this research contributes to the development of more reliable and robust decision-making systems across critical domains.

Read the original article