实现基于视觉的双流主动视觉学习

IF 4.9 3区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Cognitive and Developmental Systems Pub Date : 2024-04-17 DOI:10.1109/TCDS.2024.3390597

Timur Ibrayev;Amitangshu Mukherjee;Sai Aparna Aketi;Kaushik Roy

{"title":"实现基于视觉的双流主动视觉学习","authors":"Timur Ibrayev;Amitangshu Mukherjee;Sai Aparna Aketi;Kaushik Roy","doi":"10.1109/TCDS.2024.3390597","DOIUrl":null,"url":null,"abstract":"Deep neural network (DNN) based machine perception frameworks process the entire input in a one-shot manner to provide answers to both “\n<italic>what\n object is being observed” and “\n<italic>where\n it is located.” In contrast, the \n<italic>“two-stream hypothesis”\n from neuroscience explains the neural processing in the human visual cortex as an active vision system that utilizes two separate regions of the brain to answer the \n<italic>what\n and the \n<italic>where\n questions. In this work, we propose a machine learning framework inspired by the \n<italic>“two-stream hypothesis”\n and explore the potential benefits that it offers. Specifically, the proposed framework models the following mechanisms: 1) ventral (\n<italic>what\n) stream focusing on the input regions perceived by the fovea part of an eye (foveation); 2) dorsal (\n<italic>where\n) stream providing visual guidance; and 3) iterative processing of the two streams to calibrate visual focus and process the sequence of focused image patches. The training of the proposed framework is accomplished by label-based DNN training for the ventral stream model and reinforcement learning (RL) for the dorsal stream model. We show that the two-stream foveation-based learning is applicable to the challenging task of weakly-supervised object localization (WSOL), where the training data is limited to the object class or its attributes. The framework is capable of both predicting the properties of an object \n<italic>and\n successfully localizing it by predicting its bounding box. We also show that, due to the independent nature of the two streams, the dorsal model can be applied on its own to unseen images to localize objects from different datasets.","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"16 5","pages":"1843-1860"},"PeriodicalIF":4.9000,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Toward Two-Stream Foveation-Based Active Vision Learning\",\"authors\":\"Timur Ibrayev;Amitangshu Mukherjee;Sai Aparna Aketi;Kaushik Roy\",\"doi\":\"10.1109/TCDS.2024.3390597\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep neural network (DNN) based machine perception frameworks process the entire input in a one-shot manner to provide answers to both “\\n<italic>what\\n object is being observed” and “\\n<italic>where\\n it is located.” In contrast, the \\n<italic>“two-stream hypothesis”\\n from neuroscience explains the neural processing in the human visual cortex as an active vision system that utilizes two separate regions of the brain to answer the \\n<italic>what\\n and the \\n<italic>where\\n questions. In this work, we propose a machine learning framework inspired by the \\n<italic>“two-stream hypothesis”\\n and explore the potential benefits that it offers. Specifically, the proposed framework models the following mechanisms: 1) ventral (\\n<italic>what\\n) stream focusing on the input regions perceived by the fovea part of an eye (foveation); 2) dorsal (\\n<italic>where\\n) stream providing visual guidance; and 3) iterative processing of the two streams to calibrate visual focus and process the sequence of focused image patches. The training of the proposed framework is accomplished by label-based DNN training for the ventral stream model and reinforcement learning (RL) for the dorsal stream model. We show that the two-stream foveation-based learning is applicable to the challenging task of weakly-supervised object localization (WSOL), where the training data is limited to the object class or its attributes. The framework is capable of both predicting the properties of an object \\n<italic>and\\n successfully localizing it by predicting its bounding box. We also show that, due to the independent nature of the two streams, the dorsal model can be applied on its own to unseen images to localize objects from different datasets.\",\"PeriodicalId\":54300,\"journal\":{\"name\":\"IEEE Transactions on Cognitive and Developmental Systems\",\"volume\":\"16 5\",\"pages\":\"1843-1860\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2024-04-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Cognitive and Developmental Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10504691/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cognitive and Developmental Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10504691/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

基于深度神经网络（DNN）的机器感知框架以一次性的方式处理整个输入，为 "观察到什么物体 "和 "物体在哪里 "提供答案。相比之下，神经科学中的 "双流假说 "将人类视觉皮层的神经处理解释为一种主动视觉系统，利用大脑的两个独立区域来回答 "是什么 "和 "在哪里 "的问题。在这项工作中，我们提出了一个受 "双流假说 "启发的机器学习框架，并探索了该框架的潜在优势。具体来说，所提出的框架对以下机制进行建模：1）腹向（what）流聚焦于眼睛眼窝部分感知到的输入区域（foveation）；2）背向（where）流提供视觉引导；以及 3）对两股流进行迭代处理，以校准视觉焦点并处理聚焦图像斑块序列。建议框架的训练是通过对腹侧流模型进行基于标签的 DNN 训练和对背侧流模型进行强化学习 (RL) 来完成的。我们的研究表明，基于双流的视网膜学习适用于弱监督对象定位（WSOL）这一具有挑战性的任务，在这种情况下，训练数据仅限于对象类别或其属性。该框架既能预测物体的属性，又能通过预测其边界框来成功定位物体。我们还证明，由于两个数据流的独立性质，背侧模型可以单独应用于未见过的图像，以定位来自不同数据集的物体。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Toward Two-Stream Foveation-Based Active Vision Learning

Deep neural network (DNN) based machine perception frameworks process the entire input in a one-shot manner to provide answers to both “ what object is being observed” and “ where it is located.” In contrast, the “two-stream hypothesis” from neuroscience explains the neural processing in the human visual cortex as an active vision system that utilizes two separate regions of the brain to answer the what and the where questions. In this work, we propose a machine learning framework inspired by the “two-stream hypothesis” and explore the potential benefits that it offers. Specifically, the proposed framework models the following mechanisms: 1) ventral ( what ) stream focusing on the input regions perceived by the fovea part of an eye (foveation); 2) dorsal ( where ) stream providing visual guidance; and 3) iterative processing of the two streams to calibrate visual focus and process the sequence of focused image patches. The training of the proposed framework is accomplished by label-based DNN training for the ventral stream model and reinforcement learning (RL) for the dorsal stream model. We show that the two-stream foveation-based learning is applicable to the challenging task of weakly-supervised object localization (WSOL), where the training data is limited to the object class or its attributes. The framework is capable of both predicting the properties of an object and successfully localizing it by predicting its bounding box. We also show that, due to the independent nature of the two streams, the dorsal model can be applied on its own to unseen images to localize objects from different datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Cognitive and Developmental Systems Computer Science-Software

CiteScore

7.20

自引率

10.00%

发文量

170

期刊介绍： The IEEE Transactions on Cognitive and Developmental Systems (TCDS) focuses on advances in the study of development and cognition in natural (humans, animals) and artificial (robots, agents) systems. It welcomes contributions from multiple related disciplines including cognitive systems, cognitive robotics, developmental and epigenetic robotics, autonomous and evolutionary robotics, social structures, multi-agent and artificial life systems, computational neuroscience, and developmental psychology. Articles on theoretical, computational, application-oriented, and experimental studies as well as reviews in these areas are considered.