NLP Meets Vision for Visual Interpretation - A Retrospective Insight and Future directions

2021 International Conference on Digital Futures and Transformative Technologies (ICoDT2) Pub Date : 2021-05-20 DOI:10.1109/ICoDT252288.2021.9441517

A. Jamshed, M. Fraz

{"title":"NLP Meets Vision for Visual Interpretation - A Retrospective Insight and Future directions","authors":"A. Jamshed, M. Fraz","doi":"10.1109/ICoDT252288.2021.9441517","DOIUrl":null,"url":null,"abstract":"Recent advances in the field of NLP (Natural Language Processing) and CV (Computer Vision) have sparked a lot of curiosity among researchers to test the limitations of latest Deep learning techniques by employing them in more complex AI tasks. One such kind of task is VQA (Visual Question Answering) which is inherently divided into many layers of complexities. Some questions are simple having obvious answers while some are more complex which need logical reasoning, common sense and factual knowledge. Starting simple and gradually incorporating complexity, is always a good idea in scientific research and development. At first, datasets were simpler consisting of simple question-answer pairs with images depicting simpler concepts and relatively naive VQA models were trained on them. Slowly, with time, the VQA datasets got more complicated and tangled demanding more cognitive capabilities from VQA models. This evolution pushed the VQA models to be more efficient in matching human cognitive abilities, using reasoning based on common sense and factual knowledge. In this survey, we will first discuss some of the famous datasets in the domain of VQA and then we will discuss some of the crucial advancements in the VQA architectures and what is currently being done for integrating common sense and knowledge into these models. Moreover, reasoning is very crucial for truly intelligent systems but representations in deep learning models are inherently very fuzzy and vague. We need models that can transparently generate reasoning about their predictions like old school expert systems which used to work on symbolic knowledge, so the architectures based on the amalgam of deep learning techniques and Symbolic representations would also be a part of our discussion. We will also shed some light on the impact of transformers in the field of deep learning and how these transformer based models are quickly becoming state-of-the-art in almost every deep learning task.","PeriodicalId":207832,"journal":{"name":"2021 International Conference on Digital Futures and Transformative Technologies (ICoDT2)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Digital Futures and Transformative Technologies (ICoDT2)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICoDT252288.2021.9441517","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Recent advances in the field of NLP (Natural Language Processing) and CV (Computer Vision) have sparked a lot of curiosity among researchers to test the limitations of latest Deep learning techniques by employing them in more complex AI tasks. One such kind of task is VQA (Visual Question Answering) which is inherently divided into many layers of complexities. Some questions are simple having obvious answers while some are more complex which need logical reasoning, common sense and factual knowledge. Starting simple and gradually incorporating complexity, is always a good idea in scientific research and development. At first, datasets were simpler consisting of simple question-answer pairs with images depicting simpler concepts and relatively naive VQA models were trained on them. Slowly, with time, the VQA datasets got more complicated and tangled demanding more cognitive capabilities from VQA models. This evolution pushed the VQA models to be more efficient in matching human cognitive abilities, using reasoning based on common sense and factual knowledge. In this survey, we will first discuss some of the famous datasets in the domain of VQA and then we will discuss some of the crucial advancements in the VQA architectures and what is currently being done for integrating common sense and knowledge into these models. Moreover, reasoning is very crucial for truly intelligent systems but representations in deep learning models are inherently very fuzzy and vague. We need models that can transparently generate reasoning about their predictions like old school expert systems which used to work on symbolic knowledge, so the architectures based on the amalgam of deep learning techniques and Symbolic representations would also be a part of our discussion. We will also shed some light on the impact of transformers in the field of deep learning and how these transformer based models are quickly becoming state-of-the-art in almost every deep learning task.

查看原文本刊更多论文

NLP满足视觉解释的愿景-回顾洞察和未来方向

NLP(自然语言处理)和CV(计算机视觉)领域的最新进展引发了许多研究人员的好奇心，他们想通过将最新的深度学习技术应用于更复杂的人工智能任务来测试它们的局限性。其中一种任务是VQA(可视问答)，它本质上被划分为许多复杂的层次。有些问题很简单，答案显而易见，而有些问题则更复杂，需要逻辑推理、常识和事实知识。从简单开始，逐渐融入复杂，在科学研究和发展中总是一个好主意。起初，数据集更简单，由简单的问答对和描绘更简单概念的图像组成，并在其上训练相对幼稚的VQA模型。慢慢地，随着时间的推移，VQA数据集变得越来越复杂，要求VQA模型具备更多的认知能力。这种进化推动VQA模型在匹配人类认知能力方面更有效，使用基于常识和事实知识的推理。在本调查中，我们将首先讨论VQA领域的一些著名数据集，然后我们将讨论VQA架构中的一些关键进展，以及目前正在做的将常识和知识集成到这些模型中的工作。此外，推理对于真正的智能系统非常重要，但深度学习模型中的表示本质上是非常模糊和模糊的。我们需要能够透明地生成预测推理的模型，就像过去用于处理符号知识的老派专家系统一样，因此基于深度学习技术和符号表示混合的架构也将是我们讨论的一部分。我们还将阐明变压器在深度学习领域的影响，以及这些基于变压器的模型如何在几乎所有深度学习任务中迅速成为最先进的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 International Conference on Digital Futures and Transformative Technologies (ICoDT2)

自引率

0.00%

发文量