Language-based machine perception: linguistic perspectives on the compilation of captioning datasets

IF 1.1 3区文学 0 HUMANITIES, MULTIDISCIPLINARY

Digital Scholarship in the Humanities Pub Date : 2024-06-22 DOI:10.1093/llc/fqae029

Laura Hekanaho, Maija Hirvonen, Tuomas Virtanen

{"title":"Language-based machine perception: linguistic perspectives on the compilation of captioning datasets","authors":"Laura Hekanaho, Maija Hirvonen, Tuomas Virtanen","doi":"10.1093/llc/fqae029","DOIUrl":null,"url":null,"abstract":"Over the last decade, a plethora of training datasets have been compiled for use in language-based machine perception and in human-centered AI, alongside research regarding their compilation methods. From a primarily linguistic perspective, we add to these studies in two ways. First, we provide an overview of sixty-six training datasets used in automatic image, video, and audio captioning, examining their compilation methods with a metadata analysis. Second, we delve into the annotation process of crowdsourced datasets with an interest in understanding the linguistic factors that affect the form and content of the captions, such as contextualization and perspectivation. With a qualitative content analysis, we examine annotator instructions with a selection of eleven datasets. Drawing from various theoretical frameworks that help assess the effectiveness of the instructions, we discuss the visual and textual presentation of the instructions, as well as the perspective-guidance that is an essential part of the language instructions. While our analysis indicates that some standards in the formulation of instructions seem to have formed in the field, we also identified various reoccurring issues potentially hindering readability and comprehensibility of the instructions, and therefore, caption quality. To enhance readability, we emphasize the importance of text structure, organization of the information, consistent use of typographical cues, and clarity of language use. Last, engaging with previous research, we assess the compilation of both web-sourced and crowdsourced captioning datasets from various perspectives, discussing factors affecting the diversity of the datasets.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"44 1","pages":""},"PeriodicalIF":1.1000,"publicationDate":"2024-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Scholarship in the Humanities","FirstCategoryId":"98","ListUrlMain":"https://doi.org/10.1093/llc/fqae029","RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"HUMANITIES, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Over the last decade, a plethora of training datasets have been compiled for use in language-based machine perception and in human-centered AI, alongside research regarding their compilation methods. From a primarily linguistic perspective, we add to these studies in two ways. First, we provide an overview of sixty-six training datasets used in automatic image, video, and audio captioning, examining their compilation methods with a metadata analysis. Second, we delve into the annotation process of crowdsourced datasets with an interest in understanding the linguistic factors that affect the form and content of the captions, such as contextualization and perspectivation. With a qualitative content analysis, we examine annotator instructions with a selection of eleven datasets. Drawing from various theoretical frameworks that help assess the effectiveness of the instructions, we discuss the visual and textual presentation of the instructions, as well as the perspective-guidance that is an essential part of the language instructions. While our analysis indicates that some standards in the formulation of instructions seem to have formed in the field, we also identified various reoccurring issues potentially hindering readability and comprehensibility of the instructions, and therefore, caption quality. To enhance readability, we emphasize the importance of text structure, organization of the information, consistent use of typographical cues, and clarity of language use. Last, engaging with previous research, we assess the compilation of both web-sourced and crowdsourced captioning datasets from various perspectives, discussing factors affecting the diversity of the datasets.

查看原文本刊更多论文

基于语言的机器感知：从语言学角度看字幕数据集的编制工作

在过去的十年中，已经有大量的训练数据集被编译用于基于语言的机器感知和以人为中心的人工智能，同时还有关于其编译方法的研究。我们主要从语言学的角度，从两个方面对这些研究进行补充。首先，我们概述了用于自动图像、视频和音频字幕的 66 个训练数据集，并通过元数据分析研究了它们的编译方法。其次，我们深入研究了众包数据集的注释过程，希望了解影响字幕形式和内容的语言因素，如语境化和视角化。通过定性内容分析，我们选取了 11 个数据集来研究注释者的说明。我们借鉴了有助于评估说明有效性的各种理论框架，讨论了说明的视觉和文本呈现方式，以及作为语言说明重要组成部分的视角指导。我们的分析表明，在制定说明方面似乎已经形成了一些标准，但我们也发现了一些反复出现的问题，这些问题可能会妨碍说明的可读性和可理解性，从而影响字幕质量。为了提高可读性，我们强调了文本结构、信息组织、排版提示的连贯使用以及语言使用清晰度的重要性。最后，结合之前的研究，我们从不同角度评估了网络来源和众包字幕数据集的编译情况，并讨论了影响数据集多样性的因素。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Digital Scholarship in the Humanities Multiple-

CiteScore

1.80

自引率

25.00%

发文量

期刊介绍： DSH or Digital Scholarship in the Humanities is an international, peer reviewed journal which publishes original contributions on all aspects of digital scholarship in the Humanities including, but not limited to, the field of what is currently called the Digital Humanities. Long and short papers report on theoretical, methodological, experimental, and applied research and include results of research projects, descriptions and evaluations of tools, techniques, and methodologies, and reports on work in progress. DSH also publishes reviews of books and resources. Digital Scholarship in the Humanities was previously known as Literary and Linguistic Computing.