Benchmarking Visual Language Models on Standardized Visualization Literacy Tests

IF 2.9 4区计算机科学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computer Graphics Forum Pub Date : 2025-05-23 DOI:10.1111/cgf.70137

Saugat Pandey, Alvitta Ottley

{"title":"Benchmarking Visual Language Models on Standardized Visualization Literacy Tests","authors":"Saugat Pandey, Alvitta Ottley","doi":"10.1111/cgf.70137","DOIUrl":null,"url":null,"abstract":"<p>The increasing integration of Visual Language Models (VLMs) into visualization systems demands a comprehensive understanding of their visual interpretation capabilities and constraints. While existing research has examined individual models, systematic comparisons of VLMs' visualization literacy remain unexplored. We bridge this gap through a rigorous, first-of-its-kind evaluation of four leading VLMs (GPT-4, Claude, Gemini, and Llama) using standardized assessments: the Visualization Literacy Assessment Test (VLAT) and Critical Thinking Assessment for Literacy in Visualizations (CALVI). Our methodology uniquely combines randomized trials with structured prompting techniques to control for order effects and response variability - a critical consideration overlooked in many VLM evaluations. Our analysis reveals that while specific models demonstrate competence in basic chart interpretation (Claude achieving 67.9% accuracy on VLAT), all models exhibit substantial difficulties in identifying misleading visualization elements (maximum 30.0% accuracy on CALVI). We uncover distinct performance patterns: strong capabilities in interpreting conventional charts like line charts (76-96% accuracy) and detecting hierarchical structures (80-100% accuracy), but consistent difficulties with data-dense visualizations involving multiple encodings (bubble charts: 18.6-61.4%) and anomaly detection (25-30% accuracy). Significantly, we observe distinct uncertainty management behavior across models, with Gemini displaying heightened caution (22.5% question omission) compared to others (7-8%). These findings provide crucial insights for the visualization community by establishing reliable VLM evaluation benchmarks, identifying areas where current models fall short, and highlighting the need for targeted improvements in VLM architectures for visualization tasks. To promote reproducibility, encourage further research, and facilitate benchmarking of future VLMs, our complete evaluation framework, including code, prompts, and analysis scripts, is available at https://github.com/washuvis/VisLit-VLM-Eval.</p>","PeriodicalId":10687,"journal":{"name":"Computer Graphics Forum","volume":"44 3","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Graphics Forum","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/cgf.70137","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

The increasing integration of Visual Language Models (VLMs) into visualization systems demands a comprehensive understanding of their visual interpretation capabilities and constraints. While existing research has examined individual models, systematic comparisons of VLMs' visualization literacy remain unexplored. We bridge this gap through a rigorous, first-of-its-kind evaluation of four leading VLMs (GPT-4, Claude, Gemini, and Llama) using standardized assessments: the Visualization Literacy Assessment Test (VLAT) and Critical Thinking Assessment for Literacy in Visualizations (CALVI). Our methodology uniquely combines randomized trials with structured prompting techniques to control for order effects and response variability - a critical consideration overlooked in many VLM evaluations. Our analysis reveals that while specific models demonstrate competence in basic chart interpretation (Claude achieving 67.9% accuracy on VLAT), all models exhibit substantial difficulties in identifying misleading visualization elements (maximum 30.0% accuracy on CALVI). We uncover distinct performance patterns: strong capabilities in interpreting conventional charts like line charts (76-96% accuracy) and detecting hierarchical structures (80-100% accuracy), but consistent difficulties with data-dense visualizations involving multiple encodings (bubble charts: 18.6-61.4%) and anomaly detection (25-30% accuracy). Significantly, we observe distinct uncertainty management behavior across models, with Gemini displaying heightened caution (22.5% question omission) compared to others (7-8%). These findings provide crucial insights for the visualization community by establishing reliable VLM evaluation benchmarks, identifying areas where current models fall short, and highlighting the need for targeted improvements in VLM architectures for visualization tasks. To promote reproducibility, encourage further research, and facilitate benchmarking of future VLMs, our complete evaluation framework, including code, prompts, and analysis scripts, is available at https://github.com/washuvis/VisLit-VLM-Eval.

查看原文本刊更多论文

视觉语言模型在标准化视觉读写能力测试中的基准测试

可视化语言模型（VLMs）越来越多地集成到可视化系统中，要求对其可视化解释能力和限制有一个全面的了解。虽然现有的研究已经检查了单个模型，但vlm可视化素养的系统比较仍然没有被探索。我们通过使用标准化评估（可视化素养评估测试（VLAT）和可视化素养批判性思维评估（CALVI））对四个领先的VLMs （GPT-4, Claude， Gemini和Llama）进行严格的，史无前例的评估来弥合这一差距。我们的方法独特地将随机试验与结构化提示技术相结合，以控制顺序效应和反应变异性-这是许多VLM评估中忽略的关键考虑因素。我们的分析表明，虽然特定模型在基本图表解释方面表现出能力（Claude在VLAT上达到67.9%的准确率），但所有模型在识别误导性可视化元素方面都表现出相当大的困难（在CALVI上的最高准确率为30.0%）。我们发现了不同的性能模式：在解释折线图（76-96%的准确率）和检测层次结构（80-100%的准确率）等传统图表方面有很强的能力，但在涉及多种编码的数据密集可视化（气泡图：18.6-61.4%）和异常检测（25-30%的准确率）方面一直存在困难。值得注意的是，我们观察到不同模型的不确定性管理行为，双子座表现出高度谨慎（22.5%的问题遗漏），而其他模型（7-8%）。这些发现通过建立可靠的VLM评估基准，确定当前模型不足的领域，并强调可视化任务中VLM架构有针对性改进的需求，为可视化社区提供了重要的见解。为了提高可重复性，鼓励进一步的研究，并促进未来vlm的基准测试，我们完整的评估框架，包括代码，提示符和分析脚本，可在https://github.com/washuvis/VisLit-VLM-Eval上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Graphics Forum 工程技术-计算机：软件工程

CiteScore

5.80

自引率

12.00%

发文量

175

审稿时长

3-6 weeks

期刊介绍： Computer Graphics Forum is the official journal of Eurographics, published in cooperation with Wiley-Blackwell, and is a unique, international source of information for computer graphics professionals interested in graphics developments worldwide. It is now one of the leading journals for researchers, developers and users of computer graphics in both commercial and academic environments. The journal reports on the latest developments in the field throughout the world and covers all aspects of the theory, practice and application of computer graphics.