Linguistic issues behind visual question answering

IF 2.8 0 LANGUAGE & LINGUISTICS

Language and Linguistics Compass Pub Date : 2021-06-04 DOI:10.1111/lnc3.12417

Raffaella Bernardi, Sandro Pezzelle

{"title":"Linguistic issues behind visual question answering","authors":"Raffaella Bernardi, Sandro Pezzelle","doi":"10.1111/lnc3.12417","DOIUrl":null,"url":null,"abstract":"Answering a question that is grounded in an image is a crucial ability that requires understanding the question, the visual context, and their interaction at many linguistic levels: among others, semantics, syntax and pragmatics. As such, visually-grounded questions have long been of interest to theoretical linguists and cognitive scientists. Moreover, they have inspired the first attempts to computationally model natural language understanding, where pioneering systems were faced with the highly challenging task—still unsolved—of jointly dealing with syntax, semantics and inference whilst understanding a visual context. Boosted by impressive advancements in machine learning, the task of answering visually-grounded questions has experienced a renewed interest in recent years, to the point of becoming a research sub-field at the intersection of computational linguistics and computer vision. In this paper, we review current approaches to the problem which encompass the development of datasets, models and frameworks. We conduct our investigation from the perspective of the theoretical linguists; we extract from pioneering computational linguistic work a list of desiderata that we use to review current computational achievements. We acknowledge that impressive progress has been made to reconcile the engineering with the theoretical view. At the same time, we claim that further research is needed to get to a unified approach which jointly encompasses all the underlying linguistic problems. We conclude the paper by sharing our own desiderata for the future.","PeriodicalId":47472,"journal":{"name":"Language and Linguistics Compass","volume":"15 6","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2021-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1111/lnc3.12417","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language and Linguistics Compass","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/lnc3.12417","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}

引用次数: 9

Abstract

Answering a question that is grounded in an image is a crucial ability that requires understanding the question, the visual context, and their interaction at many linguistic levels: among others, semantics, syntax and pragmatics. As such, visually-grounded questions have long been of interest to theoretical linguists and cognitive scientists. Moreover, they have inspired the first attempts to computationally model natural language understanding, where pioneering systems were faced with the highly challenging task—still unsolved—of jointly dealing with syntax, semantics and inference whilst understanding a visual context. Boosted by impressive advancements in machine learning, the task of answering visually-grounded questions has experienced a renewed interest in recent years, to the point of becoming a research sub-field at the intersection of computational linguistics and computer vision. In this paper, we review current approaches to the problem which encompass the development of datasets, models and frameworks. We conduct our investigation from the perspective of the theoretical linguists; we extract from pioneering computational linguistic work a list of desiderata that we use to review current computational achievements. We acknowledge that impressive progress has been made to reconcile the engineering with the theoretical view. At the same time, we claim that further research is needed to get to a unified approach which jointly encompasses all the underlying linguistic problems. We conclude the paper by sharing our own desiderata for the future.

Abstract Image

查看原文本刊更多论文

视觉问答背后的语言问题

回答基于图像的问题是一项至关重要的能力，它需要理解问题、视觉背景以及它们在许多语言层面上的相互作用:其中包括语义、句法和语用学。因此，基于视觉的问题长期以来一直是理论语言学家和认知科学家的兴趣所在。此外，它们还启发了对自然语言理解进行计算建模的首次尝试，其中开创性的系统面临着高度挑战性的任务(仍未解决)，即在理解视觉上下文的同时共同处理语法、语义和推理。在机器学习令人印象深刻的进步的推动下，回答基于视觉的问题的任务近年来重新引起了人们的兴趣，成为计算语言学和计算机视觉交叉的一个研究子领域。在本文中，我们回顾了目前解决该问题的方法，这些方法包括数据集、模型和框架的开发。我们从理论语言学家的角度进行研究;我们从开创性的计算语言学工作中提取了一个我们用来回顾当前计算成就的期望列表。我们承认，在协调工程与理论观点方面取得了令人印象深刻的进展。同时，我们认为需要进一步的研究，以获得一个统一的方法，共同涵盖所有潜在的语言问题。我们通过分享我们对未来的期望来结束本文。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Language and Linguistics Compass LANGUAGE & LINGUISTICS-

CiteScore

5.40

自引率

4.00%

发文量

期刊介绍： Unique in its range, Language and Linguistics Compass is an online-only journal publishing original, peer-reviewed surveys of current research from across the entire discipline. Language and Linguistics Compass publishes state-of-the-art reviews, supported by a comprehensive bibliography and accessible to an international readership. Language and Linguistics Compass is aimed at senior undergraduates, postgraduates and academics, and will provide a unique reference tool for researching essays, preparing lectures, writing a research proposal, or just keeping up with new developments in a specific area of interest.