基于视觉变换器的机器人感知技术，用于田间环境中的早茶菊花计数

IF 5.2 2区计算机科学 Q2 ROBOTICS

Journal of Field Robotics Pub Date : 2024-07-19 DOI:10.1002/rob.22398

Chao Qi, Kunjie Chen, Junfeng Gao

{"title":"基于视觉变换器的机器人感知技术，用于田间环境中的早茶菊花计数","authors":"Chao Qi, Kunjie Chen, Junfeng Gao","doi":"10.1002/rob.22398","DOIUrl":null,"url":null,"abstract":"<p>The current mainstream approaches for plant organ counting are based on convolutional neural networks (CNNs), which have a solid local feature extraction capability. However, CNNs inherently have difficulties for robust global feature extraction due to limited receptive fields. Visual transformer (ViT) provides a new opportunity to complement CNNs' capability, and it can easily model global context. In this context, we propose a deep learning network based on a convolution-free ViT backbone (tea chrysanthemum-visual transformer [TC-ViT]) to achieve the accurate and real-time counting of TCs at their early flowering stage under unstructured environments. First, all cropped fixed-size original image patches are linearly projected into a one-dimensional vector sequence and fed into a progressive multiscale ViT backbone to capture multiple scaled feature sequences. Subsequently, the obtained feature sequences are reshaped into two-dimensional image features and using a multiscale perceptual field module as a regression head to detect the overall scale and density variance. The resulting model was tested on 400 field images in the collected TC test data set, showing that the proposed TC-ViT achieved the mean absolute error and mean square error of 12.32 and 15.06, with the inference speed of 27.36 FPS (512 × 512 image size) under the NVIDIA Tesla V100 GPU environment. It is also shown that light variation had the greatest effect on TC counting, whereas blurring had the least effect. This proposed method enables accurate counting for high-density and occlusion objects in field environments and this perception system could be deployed in a robotic platform for selective harvesting and flower phenotyping.</p>","PeriodicalId":192,"journal":{"name":"Journal of Field Robotics","volume":"42 1","pages":"65-78"},"PeriodicalIF":5.2000,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/rob.22398","citationCount":"0","resultStr":"{\"title\":\"A vision transformer-based robotic perception for early tea chrysanthemum flower counting in field environments\",\"authors\":\"Chao Qi, Kunjie Chen, Junfeng Gao\",\"doi\":\"10.1002/rob.22398\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>The current mainstream approaches for plant organ counting are based on convolutional neural networks (CNNs), which have a solid local feature extraction capability. However, CNNs inherently have difficulties for robust global feature extraction due to limited receptive fields. Visual transformer (ViT) provides a new opportunity to complement CNNs' capability, and it can easily model global context. In this context, we propose a deep learning network based on a convolution-free ViT backbone (tea chrysanthemum-visual transformer [TC-ViT]) to achieve the accurate and real-time counting of TCs at their early flowering stage under unstructured environments. First, all cropped fixed-size original image patches are linearly projected into a one-dimensional vector sequence and fed into a progressive multiscale ViT backbone to capture multiple scaled feature sequences. Subsequently, the obtained feature sequences are reshaped into two-dimensional image features and using a multiscale perceptual field module as a regression head to detect the overall scale and density variance. The resulting model was tested on 400 field images in the collected TC test data set, showing that the proposed TC-ViT achieved the mean absolute error and mean square error of 12.32 and 15.06, with the inference speed of 27.36 FPS (512 × 512 image size) under the NVIDIA Tesla V100 GPU environment. It is also shown that light variation had the greatest effect on TC counting, whereas blurring had the least effect. This proposed method enables accurate counting for high-density and occlusion objects in field environments and this perception system could be deployed in a robotic platform for selective harvesting and flower phenotyping.</p>\",\"PeriodicalId\":192,\"journal\":{\"name\":\"Journal of Field Robotics\",\"volume\":\"42 1\",\"pages\":\"65-78\"},\"PeriodicalIF\":5.2000,\"publicationDate\":\"2024-07-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/rob.22398\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Field Robotics\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/rob.22398\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ROBOTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Field Robotics","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/rob.22398","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ROBOTICS","Score":null,"Total":0}

引用次数: 0

摘要

目前用于植物器官计数的主流方法是基于卷积神经网络（CNN），这种网络具有强大的局部特征提取能力。然而，由于感受野有限，卷积神经网络难以进行稳健的全局特征提取。视觉变换器（Visual transformer，ViT）为补充 CNN 的能力提供了一个新的机会，它可以轻松地建立全局上下文模型。在此背景下，我们提出了一种基于无卷积 ViT 主干网的深度学习网络（茶菊-视觉转换器 [TC-ViT]），以实现在非结构化环境下对处于初花期的茶菊进行准确、实时的计数。首先，将所有经裁剪的固定大小原始图像片段线性投影成一维向量序列，并输入渐进式多尺度 ViT 主干网，以捕获多个缩放特征序列。随后，将获得的特征序列重塑为二维图像特征，并使用多尺度感知场模块作为回归头来检测整体尺度和密度方差。在收集到的 TC 测试数据集中的 400 幅实地图像上测试了所得到的模型，结果表明，在 NVIDIA Tesla V100 GPU 环境下，所提出的 TC-ViT 的平均绝对误差和均方误差分别为 12.32 和 15.06，推理速度为 27.36 FPS（512 × 512 图像大小）。研究还表明，光线变化对 TC 计数的影响最大，而模糊对 TC 计数的影响最小。所提出的这一方法可在田野环境中对高密度和遮挡物体进行精确计数，这种感知系统可部署在机器人平台上，用于选择性收获和花卉表型分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

A vision transformer-based robotic perception for early tea chrysanthemum flower counting in field environments

查看原文本刊更多论文

A vision transformer-based robotic perception for early tea chrysanthemum flower counting in field environments

The current mainstream approaches for plant organ counting are based on convolutional neural networks (CNNs), which have a solid local feature extraction capability. However, CNNs inherently have difficulties for robust global feature extraction due to limited receptive fields. Visual transformer (ViT) provides a new opportunity to complement CNNs' capability, and it can easily model global context. In this context, we propose a deep learning network based on a convolution-free ViT backbone (tea chrysanthemum-visual transformer [TC-ViT]) to achieve the accurate and real-time counting of TCs at their early flowering stage under unstructured environments. First, all cropped fixed-size original image patches are linearly projected into a one-dimensional vector sequence and fed into a progressive multiscale ViT backbone to capture multiple scaled feature sequences. Subsequently, the obtained feature sequences are reshaped into two-dimensional image features and using a multiscale perceptual field module as a regression head to detect the overall scale and density variance. The resulting model was tested on 400 field images in the collected TC test data set, showing that the proposed TC-ViT achieved the mean absolute error and mean square error of 12.32 and 15.06, with the inference speed of 27.36 FPS (512 × 512 image size) under the NVIDIA Tesla V100 GPU environment. It is also shown that light variation had the greatest effect on TC counting, whereas blurring had the least effect. This proposed method enables accurate counting for high-density and occlusion objects in field environments and this perception system could be deployed in a robotic platform for selective harvesting and flower phenotyping.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Field Robotics 工程技术-机器人学

CiteScore

15.00

自引率

3.60%

发文量

审稿时长

6 months

期刊介绍： The Journal of Field Robotics seeks to promote scholarly publications dealing with the fundamentals of robotics in unstructured and dynamic environments. The Journal focuses on experimental robotics and encourages publication of work that has both theoretical and practical significance.