News

Humans vs Vision-Language Models

Centre for Human-Centred Computing Centre for Fundamentals of AI and Computational Theory

26 March 2026

Shalom Lappin is part of a team centred at the Centre for Linguistic Theory and Studies in Probability (CLASP) at the University of Gothenburg that have published new work on a comparison of Humans vs Vision-Language Models

It proposes a unified way to measure narrative coherence in writing about about sequences of visual scenes. The experimental results suggest that human descriptions have more coherence, across different dimensions, than Large Language Model (LLM) ones, despite the fluency of the latter. Human writing about visual narratives show significantly more elements of surprise.

Abstract

We study narrative coherence in visually grounded stories by comparing human-written narratives with those generated by vision-language models (VLMs) on the Visual Writing Prompts corpus. Using a set of metrics that capture different aspects of narrative coherence, including coreference, discourse relation types, topic continuity, character persistence, and multimodal character grounding, we compute a narrative coherence score. We find that VLMs show broadly similar coherence profiles that differ systematically from those of humans. In addition, differences for individual measures are often subtle, but they become clearer when considered jointly. Overall, our results indicate that, despite human-like surface fluency, model narratives exhibit systematic differences from those of humans in how they organise discourse across a visually grounded story. Our code is available at this https URL.

Reference

Nikolai Ilinykh, Hyewon Jang, Shalom Lappin, Asad Sayeed, Sharid Loáiciga (2026) Humans vs Vision-Language Models: A Unified Measure of Narrative Coherence, arXiv, March. DOI: 10.48550/arXiv.2603.25537

People: Shalom LAPPIN

Contact: Shalom Lappin
Email: s.lappin@qmul.ac.uk

Updated by: Paul Curzon