Crowdsourcing Thumbnail Captions: Data Collection and Validation

IF 4.3 3区 材料科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC
Carlos Alejandro Aguirre, Shiye Cao, Amama Mahmood, Chien-Ming Huang
{"title":"Crowdsourcing Thumbnail Captions: Data Collection and Validation","authors":"Carlos Alejandro Aguirre, Shiye Cao, Amama Mahmood, Chien-Ming Huang","doi":"10.1145/3589346","DOIUrl":null,"url":null,"abstract":"Speech interfaces, such as personal assistants and screen readers, read image captions to users. Typically, however, only one caption is available per image, which may not be adequate for all situations (e.g., browsing large quantities of images). Long captions provide a deeper understanding of an image but require more time to listen to, whereas shorter captions may not allow for such thorough comprehension yet have the advantage of being faster to consume. We explore how to effectively collect both thumbnail captions—succinct image descriptions meant to be consumed quickly—and comprehensive captions—which allow individuals to understand visual content in greater detail. We consider text-based instructions and time-constrained methods to collect descriptions at these two levels of detail and find that a time-constrained method is the most effective for collecting thumbnail captions while preserving caption accuracy. Additionally, we verify that caption authors using this time-constrained method are still able to focus on the most important regions of an image by tracking their eye gaze. We evaluate our collected captions along human-rated axes—correctness, fluency, amount of detail, and mentions of important concepts—and discuss the potential for model-based metrics to perform large-scale automatic evaluations in the future.","PeriodicalId":3,"journal":{"name":"ACS Applied Electronic Materials","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2023-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Electronic Materials","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3589346","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Speech interfaces, such as personal assistants and screen readers, read image captions to users. Typically, however, only one caption is available per image, which may not be adequate for all situations (e.g., browsing large quantities of images). Long captions provide a deeper understanding of an image but require more time to listen to, whereas shorter captions may not allow for such thorough comprehension yet have the advantage of being faster to consume. We explore how to effectively collect both thumbnail captions—succinct image descriptions meant to be consumed quickly—and comprehensive captions—which allow individuals to understand visual content in greater detail. We consider text-based instructions and time-constrained methods to collect descriptions at these two levels of detail and find that a time-constrained method is the most effective for collecting thumbnail captions while preserving caption accuracy. Additionally, we verify that caption authors using this time-constrained method are still able to focus on the most important regions of an image by tracking their eye gaze. We evaluate our collected captions along human-rated axes—correctness, fluency, amount of detail, and mentions of important concepts—and discuss the potential for model-based metrics to perform large-scale automatic evaluations in the future.
众包缩略图说明:数据收集和验证
语音界面,如个人助理和屏幕阅读器,会向用户读取图像说明。然而,通常情况下,每张图片只有一个标题,这可能不适用于所有情况(例如,浏览大量图片)。较长的字幕提供了对图像更深入的理解,但需要更多的时间来听,而较短的字幕可能不允许如此彻底的理解,但具有更快的消费优势。我们探讨了如何有效地收集缩略图标题(简洁的图像描述,旨在快速消费)和综合标题(允许个人更详细地理解视觉内容)。我们考虑了基于文本的指令和时间约束的方法来收集这两个细节级别的描述,并发现时间约束的方法在收集缩略图标题的同时保持标题的准确性是最有效的。此外,我们验证了使用这种时间约束方法的标题作者仍然能够通过跟踪他们的眼睛注视来关注图像中最重要的区域。我们按照人类评定的标准(正确性、流畅性、细节数量和重要概念的提及)评估收集到的标题,并讨论基于模型的指标在未来执行大规模自动评估的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
7.20
自引率
4.30%
发文量
567
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信