arXiv:2508.14214v1 Announce Type: new
Abstract: Emotions exert an immense influence over human behavior and cognition in both commonplace and high-stress tasks. Discussions of whether or how to integrate large language models (LLMs) into everyday life (e.g., acting as proxies for, or interacting with, human agents), should be informed by an understanding of how these tools evaluate emotionally loaded stimuli or situations. A model’s alignment with human behavior in these cases can inform the effectiveness of LLMs for certain roles or interactions. To help build this understanding, we elicited ratings from multiple popular LLMs for datasets of words and images that were previously rated for their emotional content by humans. We found that when performing the same rating tasks, GPT-4o responded very similarly to human participants across modalities, stimuli and most rating scales (r = 0.9 or higher in many cases). However, arousal ratings were less well aligned between human and LLM raters, while happiness ratings were most highly aligned. Overall LLMs aligned better within a five-category (happiness, anger, sadness, fear, disgust) emotion framework than within a two-dimensional (arousal and valence) organization. Finally, LLM ratings were substantially more homogenous than human ratings. Together these results begin to describe how LLM agents interpret emotional stimuli and highlight similarities and differences among biological and artificial intelligence in key behavioral domains.
Expert Commentary: Understanding Emotional Evaluation by Large Language Models
Emotions play a crucial role in human behavior and decision-making, influencing our interactions with the world around us. The integration of large language models (LLMs) into everyday life raises important questions about how these AI systems evaluate emotionally loaded stimuli and their alignment with human behavior.
It is essential to consider the multi-disciplinary nature of this topic, as it intersects psychology, artificial intelligence, and human-computer interaction. By conducting experiments that compare LLM ratings with human ratings for emotional content, researchers can gain insights into the effectiveness of these models in different roles and interactions.
The study mentioned in the article reveals interesting findings about the alignment between LLMs and humans in evaluating emotions. Notably, GPT-4o showed high similarity with human participants in rating emotional stimuli, particularly in happiness ratings. However, differences were observed in arousal ratings, indicating a level of discrepancy in the interpretation of emotionally charged content.
Furthermore, the study suggests that LLMs perform better within a categorical framework of emotions, such as happiness, anger, sadness, fear, and disgust, compared to a two-dimensional framework based on arousal and valence. This insight can inform future developments in AI systems designed to understand and respond to human emotions accurately.
Overall, the results of this research highlight the potential of LLMs in interpreting emotional stimuli and shed light on the behavioral differences between biological and artificial intelligence. Moving forward, further interdisciplinary studies and collaborations will be crucial in enhancing our understanding of how AI systems perceive and engage with emotions in human-computer interactions.