Timur Ibrayev;Amitangshu Mukherjee;Sai Aparna Aketi;Kaushik Roy
{"title":"Toward Two-Stream Foveation-Based Active Vision Learning","authors":"Timur Ibrayev;Amitangshu Mukherjee;Sai Aparna Aketi;Kaushik Roy","doi":"10.1109/TCDS.2024.3390597","DOIUrl":null,"url":null,"abstract":"Deep neural network (DNN) based machine perception frameworks process the entire input in a one-shot manner to provide answers to both “\n<italic>what</i>\n object is being observed” and “\n<italic>where</i>\n it is located.” In contrast, the \n<italic>“two-stream hypothesis”</i>\n from neuroscience explains the neural processing in the human visual cortex as an active vision system that utilizes two separate regions of the brain to answer the \n<italic>what</i>\n and the \n<italic>where</i>\n questions. In this work, we propose a machine learning framework inspired by the \n<italic>“two-stream hypothesis”</i>\n and explore the potential benefits that it offers. Specifically, the proposed framework models the following mechanisms: 1) ventral (\n<italic>what</i>\n) stream focusing on the input regions perceived by the fovea part of an eye (foveation); 2) dorsal (\n<italic>where</i>\n) stream providing visual guidance; and 3) iterative processing of the two streams to calibrate visual focus and process the sequence of focused image patches. The training of the proposed framework is accomplished by label-based DNN training for the ventral stream model and reinforcement learning (RL) for the dorsal stream model. We show that the two-stream foveation-based learning is applicable to the challenging task of weakly-supervised object localization (WSOL), where the training data is limited to the object class or its attributes. The framework is capable of both predicting the properties of an object \n<italic>and</i>\n successfully localizing it by predicting its bounding box. We also show that, due to the independent nature of the two streams, the dorsal model can be applied on its own to unseen images to localize objects from different datasets.","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"16 5","pages":"1843-1860"},"PeriodicalIF":5.0000,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cognitive and Developmental Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10504691/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Deep neural network (DNN) based machine perception frameworks process the entire input in a one-shot manner to provide answers to both “
what
object is being observed” and “
where
it is located.” In contrast, the
“two-stream hypothesis”
from neuroscience explains the neural processing in the human visual cortex as an active vision system that utilizes two separate regions of the brain to answer the
what
and the
where
questions. In this work, we propose a machine learning framework inspired by the
“two-stream hypothesis”
and explore the potential benefits that it offers. Specifically, the proposed framework models the following mechanisms: 1) ventral (
what
) stream focusing on the input regions perceived by the fovea part of an eye (foveation); 2) dorsal (
where
) stream providing visual guidance; and 3) iterative processing of the two streams to calibrate visual focus and process the sequence of focused image patches. The training of the proposed framework is accomplished by label-based DNN training for the ventral stream model and reinforcement learning (RL) for the dorsal stream model. We show that the two-stream foveation-based learning is applicable to the challenging task of weakly-supervised object localization (WSOL), where the training data is limited to the object class or its attributes. The framework is capable of both predicting the properties of an object
and
successfully localizing it by predicting its bounding box. We also show that, due to the independent nature of the two streams, the dorsal model can be applied on its own to unseen images to localize objects from different datasets.
期刊介绍:
The IEEE Transactions on Cognitive and Developmental Systems (TCDS) focuses on advances in the study of development and cognition in natural (humans, animals) and artificial (robots, agents) systems. It welcomes contributions from multiple related disciplines including cognitive systems, cognitive robotics, developmental and epigenetic robotics, autonomous and evolutionary robotics, social structures, multi-agent and artificial life systems, computational neuroscience, and developmental psychology. Articles on theoretical, computational, application-oriented, and experimental studies as well as reviews in these areas are considered.