Gong Cheng, Xi Yong, Xin Lyu, Tao Zeng, Xinyu Wang, Jiale Chen, Xin Li
{"title":"MSYOLOF: Multi-input-single-output encoder network with tripartite feature enhancement for object detection","authors":"Gong Cheng, Xi Yong, Xin Lyu, Tao Zeng, Xinyu Wang, Jiale Chen, Xin Li","doi":"10.1145/3609703.3609710","DOIUrl":null,"url":null,"abstract":"Object detection under one-level feature is a challenging task, which requires that object representations at different scales can be extracted on a single feature map. However, existing object detectors using a one-level feature suffer from inadequate of different-scale object representations resulting in low accuracy for multi-scale object detection, especially for smaller objects. To address the problem above-mentioned, a novel object detector named MSYOLOF, is proposed to construct an effective single feature map for detecting objects of different scales. In the proposed network, three modules are proposed to bring considerable improvements, namely Feature Pyramid Pooling (FPP), Feature Perception Enhancement (FPE), and Dual Branch Receptive Field (DBRF). Firstly, the FPP module aggregates contextual information from various regions to improve the network's ability to achieve global information, which strengthens the model's understanding of the overall scene. Then, the FPE module utilizes coordinate attention to construct a residual block to obtain orientation-aware and position-sensitive information, making the network efficient in accurately locating and identifying objects of interest. Third, by rethinking the Dilated Encoder of YOLOF, the DBRF module reduces information loss and mitigates the problem of being sensitive only to large objects when dilated convolution utilizes large expansion rates. Extensive experiments are conducted on COCO benchmark to validate the effectiveness of our network, which exhibits superior performance compared to other state-of-the-art networks.","PeriodicalId":101485,"journal":{"name":"Proceedings of the 2023 5th International Conference on Pattern Recognition and Intelligent Systems","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2023 5th International Conference on Pattern Recognition and Intelligent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3609703.3609710","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Object detection under one-level feature is a challenging task, which requires that object representations at different scales can be extracted on a single feature map. However, existing object detectors using a one-level feature suffer from inadequate of different-scale object representations resulting in low accuracy for multi-scale object detection, especially for smaller objects. To address the problem above-mentioned, a novel object detector named MSYOLOF, is proposed to construct an effective single feature map for detecting objects of different scales. In the proposed network, three modules are proposed to bring considerable improvements, namely Feature Pyramid Pooling (FPP), Feature Perception Enhancement (FPE), and Dual Branch Receptive Field (DBRF). Firstly, the FPP module aggregates contextual information from various regions to improve the network's ability to achieve global information, which strengthens the model's understanding of the overall scene. Then, the FPE module utilizes coordinate attention to construct a residual block to obtain orientation-aware and position-sensitive information, making the network efficient in accurately locating and identifying objects of interest. Third, by rethinking the Dilated Encoder of YOLOF, the DBRF module reduces information loss and mitigates the problem of being sensitive only to large objects when dilated convolution utilizes large expansion rates. Extensive experiments are conducted on COCO benchmark to validate the effectiveness of our network, which exhibits superior performance compared to other state-of-the-art networks.