{"title":"Evaluating vision-capable chatbots in interpreting kinematics graphs: a comparative study of free and subscription-based models","authors":"Giulia Polverini, Bor Gregorcic","doi":"arxiv-2406.14685","DOIUrl":null,"url":null,"abstract":"This study investigates the performance of eight large multimodal model\n(LMM)-based chatbots on the Test of Understanding Graphs in Kinematics (TUG-K),\na research-based concept inventory. Graphs are a widely used representation in\nSTEM and medical fields, making them a relevant topic for exploring LMM-based\nchatbots' visual interpretation abilities. We evaluated both freely available\nchatbots (Gemini 1.0 Pro, Claude 3 Sonnet, Microsoft Copilot, and ChatGPT-4o)\nand subscription-based ones (Gemini 1.0 Ultra, Gemini 1.5 Pro API, Claude 3\nOpus, and ChatGPT-4). We found that OpenAI's chatbots outperform all the\nothers, with ChatGPT-4o showing the overall best performance. Contrary to\nexpectations, we found no notable differences in the overall performance\nbetween freely available and subscription-based versions of Gemini and Claude 3\nchatbots, with the exception of Gemini 1.5 Pro, available via API. In addition,\nwe found that tasks relying more heavily on linguistic input were generally\neasier for chatbots than those requiring visual interpretation. The study\nprovides a basis for considerations of LMM-based chatbot applications in STEM\nand medical education, and suggests directions for future research.","PeriodicalId":501565,"journal":{"name":"arXiv - PHYS - Physics Education","volume":"22 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Physics Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.14685","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This study investigates the performance of eight large multimodal model
(LMM)-based chatbots on the Test of Understanding Graphs in Kinematics (TUG-K),
a research-based concept inventory. Graphs are a widely used representation in
STEM and medical fields, making them a relevant topic for exploring LMM-based
chatbots' visual interpretation abilities. We evaluated both freely available
chatbots (Gemini 1.0 Pro, Claude 3 Sonnet, Microsoft Copilot, and ChatGPT-4o)
and subscription-based ones (Gemini 1.0 Ultra, Gemini 1.5 Pro API, Claude 3
Opus, and ChatGPT-4). We found that OpenAI's chatbots outperform all the
others, with ChatGPT-4o showing the overall best performance. Contrary to
expectations, we found no notable differences in the overall performance
between freely available and subscription-based versions of Gemini and Claude 3
chatbots, with the exception of Gemini 1.5 Pro, available via API. In addition,
we found that tasks relying more heavily on linguistic input were generally
easier for chatbots than those requiring visual interpretation. The study
provides a basis for considerations of LMM-based chatbot applications in STEM
and medical education, and suggests directions for future research.