{"title":"基于视觉变换器的机器人感知技术,用于田间环境中的早茶菊花计数","authors":"Chao Qi, Kunjie Chen, Junfeng Gao","doi":"10.1002/rob.22398","DOIUrl":null,"url":null,"abstract":"<p>The current mainstream approaches for plant organ counting are based on convolutional neural networks (CNNs), which have a solid local feature extraction capability. However, CNNs inherently have difficulties for robust global feature extraction due to limited receptive fields. Visual transformer (ViT) provides a new opportunity to complement CNNs' capability, and it can easily model global context. In this context, we propose a deep learning network based on a convolution-free ViT backbone (tea chrysanthemum-visual transformer [TC-ViT]) to achieve the accurate and real-time counting of TCs at their early flowering stage under unstructured environments. First, all cropped fixed-size original image patches are linearly projected into a one-dimensional vector sequence and fed into a progressive multiscale ViT backbone to capture multiple scaled feature sequences. Subsequently, the obtained feature sequences are reshaped into two-dimensional image features and using a multiscale perceptual field module as a regression head to detect the overall scale and density variance. The resulting model was tested on 400 field images in the collected TC test data set, showing that the proposed TC-ViT achieved the mean absolute error and mean square error of 12.32 and 15.06, with the inference speed of 27.36 FPS (512 × 512 image size) under the NVIDIA Tesla V100 GPU environment. It is also shown that light variation had the greatest effect on TC counting, whereas blurring had the least effect. This proposed method enables accurate counting for high-density and occlusion objects in field environments and this perception system could be deployed in a robotic platform for selective harvesting and flower phenotyping.</p>","PeriodicalId":192,"journal":{"name":"Journal of Field Robotics","volume":"42 1","pages":"65-78"},"PeriodicalIF":4.2000,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/rob.22398","citationCount":"0","resultStr":"{\"title\":\"A vision transformer-based robotic perception for early tea chrysanthemum flower counting in field environments\",\"authors\":\"Chao Qi, Kunjie Chen, Junfeng Gao\",\"doi\":\"10.1002/rob.22398\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>The current mainstream approaches for plant organ counting are based on convolutional neural networks (CNNs), which have a solid local feature extraction capability. However, CNNs inherently have difficulties for robust global feature extraction due to limited receptive fields. Visual transformer (ViT) provides a new opportunity to complement CNNs' capability, and it can easily model global context. In this context, we propose a deep learning network based on a convolution-free ViT backbone (tea chrysanthemum-visual transformer [TC-ViT]) to achieve the accurate and real-time counting of TCs at their early flowering stage under unstructured environments. First, all cropped fixed-size original image patches are linearly projected into a one-dimensional vector sequence and fed into a progressive multiscale ViT backbone to capture multiple scaled feature sequences. Subsequently, the obtained feature sequences are reshaped into two-dimensional image features and using a multiscale perceptual field module as a regression head to detect the overall scale and density variance. The resulting model was tested on 400 field images in the collected TC test data set, showing that the proposed TC-ViT achieved the mean absolute error and mean square error of 12.32 and 15.06, with the inference speed of 27.36 FPS (512 × 512 image size) under the NVIDIA Tesla V100 GPU environment. It is also shown that light variation had the greatest effect on TC counting, whereas blurring had the least effect. This proposed method enables accurate counting for high-density and occlusion objects in field environments and this perception system could be deployed in a robotic platform for selective harvesting and flower phenotyping.</p>\",\"PeriodicalId\":192,\"journal\":{\"name\":\"Journal of Field Robotics\",\"volume\":\"42 1\",\"pages\":\"65-78\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2024-07-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/rob.22398\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Field Robotics\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/rob.22398\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ROBOTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Field Robotics","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/rob.22398","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ROBOTICS","Score":null,"Total":0}
A vision transformer-based robotic perception for early tea chrysanthemum flower counting in field environments
The current mainstream approaches for plant organ counting are based on convolutional neural networks (CNNs), which have a solid local feature extraction capability. However, CNNs inherently have difficulties for robust global feature extraction due to limited receptive fields. Visual transformer (ViT) provides a new opportunity to complement CNNs' capability, and it can easily model global context. In this context, we propose a deep learning network based on a convolution-free ViT backbone (tea chrysanthemum-visual transformer [TC-ViT]) to achieve the accurate and real-time counting of TCs at their early flowering stage under unstructured environments. First, all cropped fixed-size original image patches are linearly projected into a one-dimensional vector sequence and fed into a progressive multiscale ViT backbone to capture multiple scaled feature sequences. Subsequently, the obtained feature sequences are reshaped into two-dimensional image features and using a multiscale perceptual field module as a regression head to detect the overall scale and density variance. The resulting model was tested on 400 field images in the collected TC test data set, showing that the proposed TC-ViT achieved the mean absolute error and mean square error of 12.32 and 15.06, with the inference speed of 27.36 FPS (512 × 512 image size) under the NVIDIA Tesla V100 GPU environment. It is also shown that light variation had the greatest effect on TC counting, whereas blurring had the least effect. This proposed method enables accurate counting for high-density and occlusion objects in field environments and this perception system could be deployed in a robotic platform for selective harvesting and flower phenotyping.
期刊介绍:
The Journal of Field Robotics seeks to promote scholarly publications dealing with the fundamentals of robotics in unstructured and dynamic environments.
The Journal focuses on experimental robotics and encourages publication of work that has both theoretical and practical significance.