Google Research just dropped a paper that tries to answer a question I’ve been chewing on for a while: do LLMs actually behave like people in social situations, or are they just good at sounding like they do?
Their new framework, described in “Evaluating Alignment of Behavioral Dispositions in LLMs,” takes a clever approach. Instead of asking models to fill out personality questionnaires directly—which is too easy to game with prompt tweaks—they built situational judgment tests (SJTs) based on established psychological instruments.
These aren’t your typical “rate your agreement with this statement” surveys. Each SJT presents a realistic scenario where the model acts as an advisor, choosing between two courses of action. One option aligns with a specific behavioral trait (like empathy or assertiveness), the other opposes it. Example scenarios range from workplace conflict resolution to helping someone book a trip, covering everyday human-to-human interactions and professional settings.
The key insight here is that self-report questionnaires are unreliable for LLMs because the models’ outputs shift wildly depending on how you phrase the prompt. A model might claim to be empathetic when asked directly, but behave coldly when faced with a nuanced social dilemma. SJTs sidestep that by forcing the model to demonstrate its dispositions through action, not just declaration.
To establish a human baseline, Google collected preferred actions from 550 annotators (10 per scenario) and compared those distributions to model responses across 25 different LLMs. The results reveal two distinct kinds of gaps:
First, there’s the obvious case where a model’s behavior simply disagrees with the majority of human annotators. Second, and more interesting, is when the model fails to capture the natural range of human opinions. Humans rarely agree on everything—some people are more assertive, some more accommodating—but some models seemed to collapse that diversity into a single “correct” response.
I find that second finding particularly concerning. If models only learn to mimic the majority opinion, they’ll struggle in situations where reasonable people disagree. That’s not alignment; that’s flattening human complexity into a bland average.
The paper’s methodology is solid. They adapted validated psychological instruments like the Interpersonal Reactivity Index (IRI) for empathy and the Emotion Regulation Questionnaire (ERQ), then had three independent annotators validate each generated SJT for coherence and fidelity to the underlying behavioral markers. The evaluation uses an LLM-as-a-judge to map model responses to one of the two action choices.
What’s missing from the paper, and what I’d love to see follow-up work address, is whether these behavioral gaps actually matter in practice. Does a model that scores low on empathy in these tests produce worse advice in real-world use? Probably, but the paper doesn’t make that connection explicit.
Also, I’m curious about the annotator pool. 550 participants is decent, but who are they? If the sample skews toward certain demographics or cultural backgrounds, the “human consensus” might not generalize well. Behavioral norms around assertiveness, emotional expression, and conflict resolution vary significantly across cultures.
The framework itself is a useful tool for the alignment toolkit. It moves beyond abstract safety benchmarks and toward evaluating models on the kind of social reasoning that actually matters when people use them for advice, customer support, or personal assistance. But this is an early step—the paper acknowledges that, and I appreciate the honesty.
If future research can tie these behavioral disposition scores to downstream task performance, we might finally get alignment evaluations that predict real-world behavior, not just benchmark scores. Until then, this is a solid foundation for understanding how LLMs navigate the messy, subjective world of human social dynamics.
Comments (0)
Login Log in to comment.
Be the first to comment!