Shengli Yan , Wenhui Hou , Yuan Rao , Dan Jiang , Xiu Jin , Tan Wang , Yuwei Wang , Lu Liu , Tong Zhang , Arthur Genis
{"title":"Multi-scale cross-modal feature fusion and cost-sensitive loss function for differential detection of occluded bagging pears in practical orchards","authors":"Shengli Yan , Wenhui Hou , Yuan Rao , Dan Jiang , Xiu Jin , Tan Wang , Yuwei Wang , Lu Liu , Tong Zhang , Arthur Genis","doi":"10.1016/j.aiia.2025.05.002","DOIUrl":null,"url":null,"abstract":"<div><div>In practical orchards, the challenges posed by fruit overlapping, branch and leaf occlusion, significantly impede the successful implementation of automated picking, particularly for bagging pears. To address this issue, this paper introduces the multi-scale cross-modal feature fusion and cost-sensitive classification loss function network (MCCNet), specifically designed to accurately detect bagging pears with various occlusion categories. The network designs a dual-stream convolutional neural network as its backbone, enabling the parallel extraction of multi-modal features. Meanwhile, we propose a novel lightweight cross-modal feature fusion method, inspired by enhancing shared features between modalities while extracting specific features from RGB and depth modalities. The cross-modal method enhances the perceptual capabilities of the model by facilitating the fusion of complementary information from multimodal bagging pear image pairs. Furthermore, we optimize the classification loss function by transforming it into a cost-sensitive loss function, aiming to improve detection classification efficiency and reduce instances of missing and false detections during the picking process. Experimental results on a bagging pear dataset demonstrate that our MCCNet achieves mAP0.5 and mAP0.5:0.95 values of 97.3 % and 80.3 %, respectively, representing improvements of 3.6 % and 6.3 % over the classical YOLOv10m model. When benchmarked against several state-of-the-art detection models, our MCCNet network has only 19.5 million parameters while maintaining superior inference speed.</div></div>","PeriodicalId":52814,"journal":{"name":"Artificial Intelligence in Agriculture","volume":"15 4","pages":"Pages 573-589"},"PeriodicalIF":8.2000,"publicationDate":"2025-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence in Agriculture","FirstCategoryId":"1087","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2589721725000558","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AGRICULTURE, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
Abstract
In practical orchards, the challenges posed by fruit overlapping, branch and leaf occlusion, significantly impede the successful implementation of automated picking, particularly for bagging pears. To address this issue, this paper introduces the multi-scale cross-modal feature fusion and cost-sensitive classification loss function network (MCCNet), specifically designed to accurately detect bagging pears with various occlusion categories. The network designs a dual-stream convolutional neural network as its backbone, enabling the parallel extraction of multi-modal features. Meanwhile, we propose a novel lightweight cross-modal feature fusion method, inspired by enhancing shared features between modalities while extracting specific features from RGB and depth modalities. The cross-modal method enhances the perceptual capabilities of the model by facilitating the fusion of complementary information from multimodal bagging pear image pairs. Furthermore, we optimize the classification loss function by transforming it into a cost-sensitive loss function, aiming to improve detection classification efficiency and reduce instances of missing and false detections during the picking process. Experimental results on a bagging pear dataset demonstrate that our MCCNet achieves mAP0.5 and mAP0.5:0.95 values of 97.3 % and 80.3 %, respectively, representing improvements of 3.6 % and 6.3 % over the classical YOLOv10m model. When benchmarked against several state-of-the-art detection models, our MCCNet network has only 19.5 million parameters while maintaining superior inference speed.