{"title":"Two–stage multimodal 3D point localization framework for automatic grape harvesting","authors":"Qian Shen , Dayu Xu , Tianyu Guo , Xiaobo Mao , Fang Xia","doi":"10.1016/j.atech.2025.101062","DOIUrl":null,"url":null,"abstract":"<div><div>This study proposes a lightweight Two–Stage multimodal 3D point localization framework for automated grape harvesting, addressing the challenge of precise 3D harvesting point localization. Unlike traditional methods, it employs a Two–Stage multimodal fusion framework, linking RGB and depth images. In the first–stage, pedicels in RGB images are segmented to generate masks. To tackle missing depth information and outliers, an Adaptive Percentile Filtering and Irregular Group-Based Completion (APF–IGBC) algorithm is proposed, leveraging depth distribution patterns and morphological features of grape pedicels. Guided by the mask, APF–IGBC efficiently filters and complements depth information. In the second stage, semantic features from the mask are integrated into the depth image via the Inward Shrinkage Method (ISM) for pose estimation, extracting three key points on pedicels for precise 3D localization. The framework enhances depth restoration and pose estimation accuracy through multimodal fusion. To address multi-scale pedicel challenges, Shared Self–learning YOLO (SSL–YOLO) is introduced, utilizing a Shared Self–learning Head (SSL–Head) for cross-scale information flow. SSL–YOLO achieves 103.9 FPS (9.8 GFLOPs, 2.7M Params) in instance segmentation and 118.8 FPS (6.1 GFLOPs, 2.6M Params) in pose estimation, demonstrating lightweight efficiency, with AP@50 scores of 99.1% and 99.5%, respectively. In comprehensive experiments on a self-constructed grape dataset, the framework achieves a P of 99.2% and a R of 99.2% for 3D harvesting point localization within 600 mm. It has a computational cost of 15.9 GFLOPs and 5.3M Params, running at 100.6 FPS on a GPU and 27.6 FPS on a CPU, showcasing high accuracy and practicality.</div></div>","PeriodicalId":74813,"journal":{"name":"Smart agricultural technology","volume":"12 ","pages":"Article 101062"},"PeriodicalIF":5.7000,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Smart agricultural technology","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772375525002953","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AGRICULTURAL ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
This study proposes a lightweight Two–Stage multimodal 3D point localization framework for automated grape harvesting, addressing the challenge of precise 3D harvesting point localization. Unlike traditional methods, it employs a Two–Stage multimodal fusion framework, linking RGB and depth images. In the first–stage, pedicels in RGB images are segmented to generate masks. To tackle missing depth information and outliers, an Adaptive Percentile Filtering and Irregular Group-Based Completion (APF–IGBC) algorithm is proposed, leveraging depth distribution patterns and morphological features of grape pedicels. Guided by the mask, APF–IGBC efficiently filters and complements depth information. In the second stage, semantic features from the mask are integrated into the depth image via the Inward Shrinkage Method (ISM) for pose estimation, extracting three key points on pedicels for precise 3D localization. The framework enhances depth restoration and pose estimation accuracy through multimodal fusion. To address multi-scale pedicel challenges, Shared Self–learning YOLO (SSL–YOLO) is introduced, utilizing a Shared Self–learning Head (SSL–Head) for cross-scale information flow. SSL–YOLO achieves 103.9 FPS (9.8 GFLOPs, 2.7M Params) in instance segmentation and 118.8 FPS (6.1 GFLOPs, 2.6M Params) in pose estimation, demonstrating lightweight efficiency, with AP@50 scores of 99.1% and 99.5%, respectively. In comprehensive experiments on a self-constructed grape dataset, the framework achieves a P of 99.2% and a R of 99.2% for 3D harvesting point localization within 600 mm. It has a computational cost of 15.9 GFLOPs and 5.3M Params, running at 100.6 FPS on a GPU and 27.6 FPS on a CPU, showcasing high accuracy and practicality.