{"title":"MGSGNet-S*: Multilayer Guided Semantic Graph Network via Knowledge Distillation for RGB-Thermal Urban Scene Parsing","authors":"Wujie Zhou;Hongping Wu;Qiuping Jiang","doi":"10.1109/TIV.2024.3456437","DOIUrl":null,"url":null,"abstract":"Owing to rapid developments in driverless technologies, vision tasks for unmanned vehicles have gained considerable attention, particularly in multimodal-based urban scene parsing. Although deep-learning algorithms have outperformed traditional models in such tasks, they cannot operate on mobile devices and edge networks owing to the coarse-grained cross-modal complementary information alignment, inadequate modeling of semantic-category relations, overabundance of parameters, and high computational complexity. To address these issues, a multilayer guided semantic graph network via knowledge distillation (MGSGNet-S<sup>*</sup>) is proposed for red-green-blue-thermal urban scene parsing. First, a new cross-modal adaptive fusion module adjusts pixel-level adaptive modal complementary information by incorporating additional deep modal information and residual cross-modal matrix fine-grained attention. Second, a novel semantic graph module overcomes the misclassification problems of objects of the same semantic class during low-level encoding by incorporating high-level information in the Euclidean space and modeling semantic graph relationships in the non-Euclidean space. Finally, to strike the balance between accuracy and efficiency, a tailored framework optimally utilizes effective knowledge of pixel intra- and inter-class similarity, fusion features, and cross-modal correlation. Experimental results indicate that MGSGNet-S<sup>*</sup> considerably outperforms relevant state-of-the-art methods with fewer parameters and lower computational costs. The numbers of parameters and floating-point operations were reduced by 95.69% and 93.34%, respectively, relative to those for the teacher model, thus demonstrating stronger inferencing capabilities at 28.65 frames per second.","PeriodicalId":36532,"journal":{"name":"IEEE Transactions on Intelligent Vehicles","volume":"10 5","pages":"3543-3559"},"PeriodicalIF":14.3000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Intelligent Vehicles","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10669814/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Owing to rapid developments in driverless technologies, vision tasks for unmanned vehicles have gained considerable attention, particularly in multimodal-based urban scene parsing. Although deep-learning algorithms have outperformed traditional models in such tasks, they cannot operate on mobile devices and edge networks owing to the coarse-grained cross-modal complementary information alignment, inadequate modeling of semantic-category relations, overabundance of parameters, and high computational complexity. To address these issues, a multilayer guided semantic graph network via knowledge distillation (MGSGNet-S*) is proposed for red-green-blue-thermal urban scene parsing. First, a new cross-modal adaptive fusion module adjusts pixel-level adaptive modal complementary information by incorporating additional deep modal information and residual cross-modal matrix fine-grained attention. Second, a novel semantic graph module overcomes the misclassification problems of objects of the same semantic class during low-level encoding by incorporating high-level information in the Euclidean space and modeling semantic graph relationships in the non-Euclidean space. Finally, to strike the balance between accuracy and efficiency, a tailored framework optimally utilizes effective knowledge of pixel intra- and inter-class similarity, fusion features, and cross-modal correlation. Experimental results indicate that MGSGNet-S* considerably outperforms relevant state-of-the-art methods with fewer parameters and lower computational costs. The numbers of parameters and floating-point operations were reduced by 95.69% and 93.34%, respectively, relative to those for the teacher model, thus demonstrating stronger inferencing capabilities at 28.65 frames per second.
期刊介绍:
The IEEE Transactions on Intelligent Vehicles (T-IV) is a premier platform for publishing peer-reviewed articles that present innovative research concepts, application results, significant theoretical findings, and application case studies in the field of intelligent vehicles. With a particular emphasis on automated vehicles within roadway environments, T-IV aims to raise awareness of pressing research and application challenges.
Our focus is on providing critical information to the intelligent vehicle community, serving as a dissemination vehicle for IEEE ITS Society members and others interested in learning about the state-of-the-art developments and progress in research and applications related to intelligent vehicles. Join us in advancing knowledge and innovation in this dynamic field.