{"title":"航空场景下单目深度估计的时间关注","authors":"Vlad-Cristian Miclea, S. Nedevschi","doi":"10.1109/ICECCME55909.2022.9988383","DOIUrl":null,"url":null,"abstract":"Monocular depth estimation (MDE) is a key task for a large set of computer vision applications, convolutional neural networks (CNNs) being nowadays employed for this task. The objective of measuring the world from a single image is cumbersome, especially in case of highly complex scenarios where there is a lack in scene structure. State of the art deep learning-based methods cope with this problem by employing very powerful feature extractors, mixed with additional scene priors such as geometrical or semantic information. The usage of such approaches generally leads to high amounts of resources, computations which make the system incapable for real-time processing. In this work we propose a novel method that tries to account for the time constraints while providing accurate depth maps from a monocular system. Thus, instead of providing geometric or semantic priors which need complex additional processing (generally an additional CNN), we aid the depth estimation process with features extracted and preserved from previous frames. To this end, we propose a novel temporal attention sub-network, that properly extracts the aforementioned features and it combines them with the last available depth map. This sub-network is then inserted into a novel CNN architecture, that proves to generate better depth maps. We test the efficiency of our method on aerial images and obtain an improved accuracy while keeping the amount of resources as low as possible.","PeriodicalId":202568,"journal":{"name":"2022 International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Temporal Attention for Monocular Depth Estimation in Aerial Scenarios\",\"authors\":\"Vlad-Cristian Miclea, S. Nedevschi\",\"doi\":\"10.1109/ICECCME55909.2022.9988383\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Monocular depth estimation (MDE) is a key task for a large set of computer vision applications, convolutional neural networks (CNNs) being nowadays employed for this task. The objective of measuring the world from a single image is cumbersome, especially in case of highly complex scenarios where there is a lack in scene structure. State of the art deep learning-based methods cope with this problem by employing very powerful feature extractors, mixed with additional scene priors such as geometrical or semantic information. The usage of such approaches generally leads to high amounts of resources, computations which make the system incapable for real-time processing. In this work we propose a novel method that tries to account for the time constraints while providing accurate depth maps from a monocular system. Thus, instead of providing geometric or semantic priors which need complex additional processing (generally an additional CNN), we aid the depth estimation process with features extracted and preserved from previous frames. To this end, we propose a novel temporal attention sub-network, that properly extracts the aforementioned features and it combines them with the last available depth map. This sub-network is then inserted into a novel CNN architecture, that proves to generate better depth maps. We test the efficiency of our method on aerial images and obtain an improved accuracy while keeping the amount of resources as low as possible.\",\"PeriodicalId\":202568,\"journal\":{\"name\":\"2022 International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME)\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICECCME55909.2022.9988383\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICECCME55909.2022.9988383","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Temporal Attention for Monocular Depth Estimation in Aerial Scenarios
Monocular depth estimation (MDE) is a key task for a large set of computer vision applications, convolutional neural networks (CNNs) being nowadays employed for this task. The objective of measuring the world from a single image is cumbersome, especially in case of highly complex scenarios where there is a lack in scene structure. State of the art deep learning-based methods cope with this problem by employing very powerful feature extractors, mixed with additional scene priors such as geometrical or semantic information. The usage of such approaches generally leads to high amounts of resources, computations which make the system incapable for real-time processing. In this work we propose a novel method that tries to account for the time constraints while providing accurate depth maps from a monocular system. Thus, instead of providing geometric or semantic priors which need complex additional processing (generally an additional CNN), we aid the depth estimation process with features extracted and preserved from previous frames. To this end, we propose a novel temporal attention sub-network, that properly extracts the aforementioned features and it combines them with the last available depth map. This sub-network is then inserted into a novel CNN architecture, that proves to generate better depth maps. We test the efficiency of our method on aerial images and obtain an improved accuracy while keeping the amount of resources as low as possible.